Data ?

Data are pieces of information about individuals organized into variables. By an individual, we mean a particular person or object. By a variable, we mean a particular characteristic of the individual.

Variables can be classified into one of two types: categorical or quantitative.

Types of variables

Quantitative variables take numerical values whose “size” is meaningful. Quantitative variables answer questions such as “how many?” or “how much?” For example, it makes sense to add, to subtract, and to compare two persons’ weights, or two families’ incomes: These are quantitative variables. Quantitative variables typically have measurement units, such as pounds, dollars, years, volts, gallons, megabytes, inches, degrees, miles per hour, pounds per square inch, BTUs, and so on.

Some variables, such as social security numbers and zip codes, take numerical values, but are not quantitative: They are qualitative or categorical variables. The sum of two zip codes or social security numbers is not meaningful. The average of a list of zip codes is not meaningful. Qualitative and categorical variables typically do not have units. Qualitative or categorical variables—such as gender, hair color, or ethnicity—group individuals. Qualitative and categorical variables have neither a “size” nor, typically, a natural ordering to their values. They answer questions such as “which kind?” The values categorical and qualitative variables take are typically adjectives (for example, green, female, or tall). Arithmetic with qualitative variables usually does not make sense, even if the variables take numerical values. Categorical variables divide individuals into categories, such as gender, ethnicity, age group, or whether or not the individual finished high school.

Until now, a simple distinction was made between quantitative and categorical variables. However, there is a more precise method of categorizing variables: it is called scale of measurement. The four different scales of measurement, from least to most precise, are

  • Nominal
  • Ordinal
  • Interval
  • Ratio

As we saw in Data and Variables, the data for each variable are a long list of values (whether numerical or not), and are not very informative in that form. In order to convert these raw data into useful information we need to summarize and then examine the distribution of the variable. By distribution of a variable, we mean:

  • what values the variable takes, and
  • how often the variable takes those values.

We can achieve this by frequency distribution. example frequency distribution

Soon after this comes, numerous number of plots, cumulative freq etc. used to study data distribution. We will not go into those details.

But to make sense out of any distribution we should look out for certain things. Let’s Summarize

When examining the distribution of a quantitative variable, one should describe the overall pattern of the data by these factors [shape, center, spread, outliers].

When describing the shape of a distribution, one should consider: Symmetry/skewness of the distribution.

Peakedness (modality)—the number of peaks (modes) the distribution has.

Statistics ?

Statistics is all about is converting data into useful information. Statistics is therefore a process in which we

Collect data, Summarize data, and Interpret data.

Difference between probability & statistics

Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events.

Big picture of statistics

How statistics, probability, exploratory data analysis, inferences are linked?

The process of statistics starts when we identify what group we want to study or learn something about. We call this group the population. Population, then, is the entire group that is the target of our interest. In most cases, the population is so large that, as much as we want to, there is absolutely no way we can study all of it (imagine trying to get the opinions of all U.S. adults about the death penalty). A more practical approach would be to examine and collect data only from a subgroup of the population, which we call a sample. We call this first step, which involves choosing a sample and collecting data from it, producing data.

It should be noted that since, for practical reasons, we need to compromise and examine only a sub-group of the population rather than the whole population, we should make an effort to choose a sample in such a way that it will represent the population well. For example, if we choose a sample from the population of U.S. adults, and ask their opinions about the death penalty, we do not want our sample to consist of only Republicans or only Democrats.

Once the data have been collected, what we have is a long list of answers to questions, or numbers, and in order to explore and make sense of the data, we need to summarize that list in a meaningful way. This second step, which consists of summarizing the collected data, is called exploratory data analysis.

Now we’ve obtained the sample results and summarized them, but we are not done. Remember that our goal is to study the population, so what we want is to be able to draw conclusions about the population based on the sample results. Before we can do so, we need to look at how the sample we’re using may differ from the population as a whole, so that we can factor that into our analysis. To examine this difference, we use probability.

In essence, probability is the “machinery” that allows us to draw conclusions about the population based on the data collected about the sample.

Finally, we can use what we’ve discovered about our sample to draw conclusions about our population. We call this final step in the process inference.

Big picture of stats

Example

At the end of April 2005, a poll was conducted (by ABC News and the Washington Post) for the purpose of learning the opinions of U.S. adults about the death penalty.

  1. Producing Data: A (representative) sample of 1,082 U.S. adults was chosen, and each adult was asked whether he or she favored or opposed the death penalty.

  2. Exploratory Data Analysis (EDA): The collected data were summarized, and it was found that 65% of the sampled adults favor the death penalty for persons convicted of murder.

3 and 4. Probability and Inference: Based on the sample result (of 65% favoring the death penalty) and our knowledge of probability, it was concluded (with 95% confidence) that the percentage of those who favor the death penalty in the population is within 3% of what was obtained in the sample (i.e., between 62% and 68%). The following figure summarizes the example:

Big picture of stats

What is random variable ?

A random variable is actually a function that map the outcomes of a random process to a numeric value.

There can be 2 types of random variable. Discrete & Continous

What is probability distribution ?

A probability distribution is a list of all possible outcomes for that random variable, along with their probabilities

What is stochastic process/random process ?

Collection of random variables.