Lecture 2

Lecture 2


Outline


Displaying Quantitative Variables

Example of quantitative data: monthly changes in a stock prices

quant-var


Histograms

A histogram is similar to a bar chart with the bin counts used as the heights of the bars. Note: there are no gaps between bars unless there are actual gaps in the data.

histogram


Other

Example: Create a stem-and-leaf display for the data 21, 22, 24, 33, 33, 36, 38, 41.

2|124
3|3368
4|1

Elementary probability theory


Shape

When you describe a distribution, you should pay attention to its:

We describe the shape of a distribution in terms of its modes, its symmetry, and whether it has any gaps or outlying values.


Mode

Peaks or humps seen in a histogram are called the modes of a distribution.

A distribution whose histogram has

modes


Symmetry

A distribution is symmetric if the halves on either side of the center look, at least approximately, like mirror images.

The thinner ends of a distribution are called the tails. If one tail stretches out farther than the other, the distribution is said to be skewed to the side of the longer tail.

tail


Outliers

The outliers in a distribution are those values that stand off away from the body of the distribution.


Center

The mean of a distibution is caclualted as sum of all values, $y_i$, and divided by the number of values, $N$.

\[\bar{y} = \frac{\sum_{i=1}^N y_i}{N}\]

The mean is considered to be the balancing point of the distribution.


Median

The median is the value separating the higher half the data from lover part.

The median is resistant to unusual observations and to the shape of the distribution.

median


Spread

Sometimes we need to determine how spread out the data.

One simple measure of spread is the range, defined as the difference between the extremes.

\[R = \max - \min\]

The range is a single value and it is not resistant to unusual observations.


Quartiles

The quartiles of a ranked set of data values are the three points that divide the data set into four equal groups, each group comprising a quarter of the data.


Interquartile Range

The interquartile range (IQR) is defined to be the difference between the two quartile values.

\[IQR = Q_3 - Q_1\]

Variance

The average of the squared deviations of the values of the variable y from the mean is called the variance and is denoted by $s^2$.

\[s^2 = \frac{\sum_{i=1}^N (y_i-\bar{y})^2}{N-1}\]

Taking the square root of the variance corrects this issue and gives us the standard deviation.


Guide


Boxplot

The five-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum).

boxplot


Boxplot (cont)


Example

Example: Wine Prices The boxplots displayed case prices (in dollars) of wines produces by vineyards along three of the Finger Lakes in upstate New York.

a. Which lake region produces the most expensive wine? b. Which lake region produces the cheapest wine? c. In which region are wines generally more expensive?

winebox


Example (cont.)

a. Seneca Lake b. Seneca Lake c. Keuka Lake

Cayuga Lake vineyards and Seneca Lake have approximately the same average case price of about 200, while a typical Keuka Lake vineyard has a case price of about 260. Keuka Lake vineyards have consistently high case prices, between 240 and 280, with one low outlier at about 170 per case. Cayuga Lake vineyards have case prices from 140 to 270, and Seneca Lake vineyards have highly variable case prices from 100 to 300.


Outliers

What should be done with outliers?


Standardizing

\[z = \frac{y-\bar{y}}{s}\]

Example

Compare two companies (from the “top” 100 companies) with respect to the variables Revenue (in $B) and number of Employees.

For all 100 companies, the mean revenue was $\$$6.23B with standard deviation \$10.56B; the average number of employees was 19,629 and standard deviation 32,055.


Example (cont.)

z-score


Example 2

Example: Customer Ages As part of a marketing team, you send surveys to 25 customers (using an incentive to guarantee a high response rate) asking for demographic information. The average age of respondents is 31.84 years , the standard deviation is 9.84 years, min is 11 years and max is 48 years. Which has the more extreme z-score, the min or the max?


Correlation


Scatterplot

Scatterplots are the ideal way to picture associations between two quantitative variables.

scatter


Direction

The direction of the association is important.

The second thing to look for in a scatterplot is its form.

The third feature to look for in a scatterplot is the strength of the relationship.


Example: Bookstore

Data gathered from a bookstore show Number of Sales People Working and Sales (in \$1000). Given the scatterplot, describe the direction, form, and strength of the relationship. Are there any outliers?

scatter2


Example (cont.)

scatter2

The relationship between "Number of Sales People Working" and "Sales" is positive, linear, and strong. As the "Number of Sales People Working" increases, "Sales" tends to increase also. There are no outliers.


Assigning Roles to Variables in Scatterplots


Correlation

The ratio of the sum of the product $z_x z_y$ for every point in the scatterplot to $N–1$ is called the correlation coefficient.

\[r = \frac{\sum z_x z_y}{N-1}\]

Since x’s and y’s are paired, multiply each standardized value of x by the standardized value it is paired with and add up those crossproducts. Divide by n -1.


Understanding Correlation

Correlation measures the strength of the linear association between two quantitative variables.


Correlation Properties