Lecture 10

Inference for Regression

The Population and the Sample

But we know observations vary from sample to sample. So we imagine a true line that summarizes the relationship between x and y for the entire population,

\[\mu_y = \beta_0+\beta_1 x\]

Where $\mu_y$ is the population mean of $y$ at a given value of $x$.

We write $\mu_y$ instead of $y$ because the regression line assumes that the means of the $y$ values for each value of $x$ fall exactly on the line.

The Population and the Sample (cont.)

For a given value x:

Most, if not all, of the y values obtained from a particular sample will not lie on the line.
The sampled y values will be distributed about $\mu_y$.

regmeans

We can account for the difference between $\hat{y}$ and $\mu_y$ by adding the error residual, or $\varepsilon$:

\[y = \beta_0+\beta_1 x + \varepsilon\]

Regression Inference

Collect a sample and estimate the population $\beta$'s by finding a regression line:

\[\hat{y} = b_0 + b_1 x\]

where $b_0$ estimates $\beta_0$, $b_1$ estimates $\beta_1$.

The residuals $e = y - \hat{y}$ are the sample based versions of $\varepsilon$.
Account for the uncertainties in $\beta_0$ and $\beta_1$ by making confidence intervals, as we've done for means and proportions.

Assumptions and Conditions

The inference methods are based on these assumptions:

Linearity Assumption: This condition is satisfied if the scatterplot of $x$ and $y$ looks straight.
Independence Assumption: Look for randomization in the sample or the experiment. Also check the residual plot for lack of patterns.
Equal Variance Assumption: Check the Equal Spread Condition, which means the variability of $y$ should be about the same for all values of $x$.
Normal Population Assumption: Assume the errors around the idealized regression line at each value of $x$ follow a Normal model. Check if the residuals satisfy the Nearly Normal Condition.

Assumptions and Conditions (cont.)

Summary of Assumptions and Conditions:

Make a scatterplot of the data to check for linearity. (Linearity Assumption)
Fit a regression, find the residuals, $e$, and predicted values $\hat{y}$.
Plot the residuals against time (if appropriate) and check for evidence of patterns (Independence Assumption).
Make a scatterplot of the residuals against x or the predicted values. This plot should not exhibit a "fan" or "cone" shape. (Equal Variance Assumption)
Make a histogram and Normal probability plot of the residuals (Normal Population Assumption)

The Standard Error of the Slope

For a sample, we expect $b_1$ to be close, but not equal to the model slope $\beta_1$.
For similar samples, the standard error of the slope is a measure of the variability of $b_1$ about the true slope $\beta_1$.

\[SE(b_1) = \frac{s_e}{s_x\sqrt{n-1}}\]

where $s_e$ is spread around the line, $s_x$ is spread of $x$ values, $n$ is a sample size.

Example 1

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population?

regse1

Example 2

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population?

regse2

Example 3

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population?

regse3

A Test for the Regression Slope

When the conditions are met, the standardized estimated regression slope,

\[t = \frac{b_1 - \beta_1}{SE(b_1)}\]

follows a Student's t-model with $n - 2$ degrees of freedom. We calculate the standard error as SE (see above), where $s_e = \sqrt{\frac{\sum(y-\hat{y})^2}{n-2}}$ and $s_x$ is the standard deviation of the $x$-values.

A Test for the Regression Slope

When the assumptions and conditions are met, we can test the hypothesis $H_0: \beta_1 = 0$ vs. $H_A: \beta_1 \ne 0$ using the standardized estimated regression slope,

\[t = \frac{b_1 - \beta_1}{SE(b_1)}\]

follows a Student's $t$-model with $n - 2$ degrees of freedom.

We can use the $t$-model to find the P-value of the test.

A Test for the Regression Slope

When the assumptions and conditions are met, we can find a confidence interval for $\beta_1$ from $b_1 \pm t^*_{n-2} * SE(b_1)$

where the critical value $t^*$ depends on the confidence level and has $n - 2$ degrees of freedom.

A Hypothesis Test for Correlation

What if we want to test whether the correlation between $x$ and $y$ is 0?

When the conditions are met, we can test the hypothesis $H_0 : r = 0$ vs. $H_A : r \ne 0$ using the test statistic:

\[t = r \sqrt \frac {n-2}{1-r^2}\]

which follows a Student's $t$-model with $n - 2$ degrees of freedom.

We can use the $t$-model to find the P-value of the test.

The Confidence Interval for the Mean Response

When the conditions are met, we find the confidence interval for the mean response value $\mu_v$ at a value $x_v$ as

\[\hat{y}_v \pm t^*_{n-2} * SE\]

where the standard error is

\[SE(\hat{\mu}_v) = \sqrt{SE^2(b_1) \times (x_v - \bar{x}) + s^2_e/n }\]

The Prediction Interval for an Individual Response

When the conditions are met, we can find the prediction interval for all values of $y$ at a value $x_v$ as

\[\hat{y}_v \pm t^*_{n-2} * SE\]

where the standard error is

\[SE(\hat{y}_v) = \sqrt{SE^2(b_1) \times (x_v - \bar{x}) + s^2_e/n +s^2_e}\]