Regression

Regression


Least Squares Estimation

Regression models relate a response variable to one or several predictors. Having observed predictors, we can forecast the response by computing its conditional expectation, given all the available predictors.

\[G(x^{(1)}, \ldots, x^{(k)}) = E \{ Y | X^{(1)} = x{(1)}, \ldots, X^{(k)} = x{(k)}\]

It is a function of $x^{(1)}, \ldots, x^{(k)}$ whose form can be estimated from data.


Method of Least Squares

In univariate regression, we observe pairs $(x _i, y_i)$.

For forecasting, we are looking for the function $G(x)$ that is close to the observed data points. This is achieved by minimizing distances between observed $y_i$ and the corresponding points on the fitted regression line, $\hat{y}_i = \hat{G}(x_i)$.

The difference between the predicted value $\hat{y}_i$ and the observed value, $y_i$, is called the residual and is denoted $e_i$.

\[e_i = y_i - \hat{y}_i\]

Method of least squares finds a regression function $\hat{G}(x)$ that minimizes the sum of squared residuals

\[\sum^n_{i=1} e^2_i = \sum^n_{i=1} ( y_i - \hat{y}_i)^2\]

The Linear Model

The scatterplot below shows Lowe's sales and home improvement expenditures between 1985 and 2007.

lin-model


The Linear Model (cont.)

We see that the points don't all line up, but that a straight line can summarize the general pattern. We call this line a linear model.

A linear model can be written in the form

\[\hat{y} = \beta_0 + \beta_1x\]

where $\beta_0$ (slope) and $\beta_1$ (intersept) are numbers estimated from the data and $\hat{y}$ is the predicted value.


Example 1

In the computer usage model for 301 stores, the model predicts 262.2 MIPS (Millions of Instructions Per Second) and the actual value is 218.9 MIPS. We may compute the residual for 301 stores.

lin-model


Estimation in Linear Regression

Let us estimate the slope and intercept by method of least squares.

\[Q = \sum^n_{i=1} ( y_i - \hat{y}_i)^2 = \sum^n_{i=1} (y_i - \beta_0 - \beta_1 x_i)^2\]

We minimize sum of sq. residuals by taking partial derivatives of $Q$, equating them to 0, and solving the equations for $\beta_0$ and $\beta_1$.

\[\begin{cases} \frac{\partial Q}{\partial \beta_0} = -2 \sum^n_{i=1} (y_i - \beta_0 - \beta_1 x_i) = 0 \\ \frac{\partial Q}{\partial \beta_1} = -2 \sum^n_{i=1} (y_i - \beta_0 - \beta_1 x_i) x_i = 0 \end{cases}\]

Estimation in Linear Regression (cont.)

From the first equation, $\beta_0 = \frac{\sum_i y_i - \beta_1 \sum_i x_i}{n} = \bar{y} - \beta_1 \bar{x}$

From the second equation,

\[\sum^n_{i=1} (y_i - \beta_0 - \beta_1 x_i) x_i = \sum^n_{i=1} ((y_i - \bar{y}) - \beta_1(x_i - \bar{x})) x_i\]
\[= S_{xy} - \beta_1 S_{xx} = 0 \Rightarrow \beta_1 = S_{xy}/S_{xx}\]

where

\[S_{xx} = \sum^n_{i=1} (x_i - \bar{x})^2, \; S_{xy} = \sum^n_{i=1} (x_i - \bar{x})(y_i - \bar{y})\]

Example 2

According to the International Data Base of the U.S. Census Bureau, population of the world grows according to below table. How can we use these data to predict the world population in years 2020 and 2030?

YearPopulationYearPopulationYearPopulation
195025581975408920006090
195527821980445120056474
196030431985485520106864
19653350199052872020?
19703712199557002030?

Regression and Correlation

We can find the slope of the least squares line using the correlation and the standard deviations.

\[\beta_1 = \frac{S_{xy}}{S_{xx}} = \frac{s_{xy}}{s^2_{x}} = r\frac{s_x}{s_y}\]

where

\[r =\frac{s_{xy}}{s_{x}s_{y}}\]

Understanding Regression from Correlation

If we consider finding the least squares line for standardized variables $z_x$ and $z_y$, the formula for slope can be simplified.

\[\beta_1 = r \frac{s_{z_x}}{s_{z_y}} = r\frac{1}{1}=r\]
\[\beta_0 = \bar{z}_y - b_1 \bar{z}_x = 0-r0 = 0\]
\[\hat{z}_y = rz_x\]

From above we see that for an observation 1 SD above the mean in $x$, you'd expect $y$ to have a z-score of $r$.


Regression to the Mean


Checking the Model

Models are useful only when specific assumptions are resonable:

  1. Quantitative Data Condition: linear models only make sense for quantitative data, so don't be fooled by categorical data recorded as numbers.

  2. Linearity Assumption check Linearity Condition: two variables must have a linear association, or a linear model won't mean a thing.

  3. Outlier Condition: outliers can dramatically change a regression model.

  4. Equal Spread Condition: check a residual plot for equal scatter for all x-values.


Nonlinear Relationships

PlotDescription
nonlinrelA nonlinear relationship that is not appropriate for linear regression.
nonlinrel2The Spearman rank correlation works with the ranks of data, but a linear model is difficult to interpret so it's not appropriate.
nonlinrel3Transforming or re-expresing one or both variables by a function such as square root, logarithm, etc. Though some times difficult to interpret, regression models and supporting statistics are useful.

Analysis of Variance

The total variation among observed responses is measured by the total sum of squares:

\[SS_{TOT} = \sum^n_{i=1}(y_i - \bar{y})^2 = (n-1) s^2_y\]

This is the variation of $y_i$ about their sample mean regardless of our regression model.


Analysis of Variance (cont.)

A portion of this total variation is attributed to predictor $X$ and the regression model connecting predictor and response. This portion is measured by the regression sum of squares:

\[SS_{REG} = \sum^n_{i=1}(\hat{y}_i - \bar{y})^2\]

This is the portion of total variation explained by the model.

\[SS_{REG} = \sum^n_{i=1}(b_0+b_1 x_i - \bar{y})^2 = \sum^n_{i=1} b^2_1 (x_i - \bar{x})^2 = b^2_1 S_{xx}\]

Analysis of Variance (cont.)

The rest of total variation is attributed to "error". It is measured by the error sum of squares:

\[SS_{ERR} = \sum^n_{i=1}(y_i - \hat{y}_i)^2 = \sum^n_{i=1} e^2_i\]

This is the portion of total variation not explained by the model.


R-square

The goodness of fit, appropriateness of the predictor and the chosen regression model can be judged by the proportion of $SS_{TOT}$ that the model can explain.

\[R^2\]

, or coefficient of determination is the proportion of the total variation explained by the model,

\[R^2 = \frac{SS_{REG}}{SS_{TOT}}\]

It is always between 0 and 1, with high values generally suggesting a good fit.


Variation in the Model

The variation in the residuals shows how well a model fits

Consider the square of the correlation coefficient $r$ to get $r^2$ which is a value between 0 and 1.


Inference for Regression

As we know observations vary from sample to sample, so we imagine a true line that summarizes the relationship between x and y for the entire population.

We introduce standard regression assumptions.

\[\mu_y = E(Y_i) = \beta_0+\beta_1 x_i\]

The Population and the Sample

For a given value x:

regmeans

\[y = \beta_0+\beta_1 x + \varepsilon\]

Degrees of Freedom

Let us compute degrees of freedom for all three $SS$ in the regression ANOVA.

The total sum of squares $SS_{TOT}$ has $df_{TOT} = n-1$ degrees of freedom because it comesdirectly from the sample variance $s^2_y$.

The regression sum of squares $SS_{REG}$ has $df_{REG} = 1$.

For $df_{ERR} = n-2$ degrees of freedom for the error sum of squares, so that

\[df_{TOT} = SS_{REG} + df_{ERR}\]

Variance Estimation

The standard deviation of the residuals, $s_e$, gives us a measure of how much the points spread around the regression line.

\[s_e = \sqrt \frac{SS_{ERR}}{n-2} = \sqrt \frac{\sum e^2}{n-2}\]

|residuals|It appears that the spread in the residuals is increasing.| |-|-|


ANOVA Table

anova


Regression Variance

Mean squares $MS_{REG}$ and $MS_{ERR}$ are obtained from the corresponding sums of squares dividing them by their degrees of freedom.

The estimated standard deviation $s$ is usually called root mean squared error or RMSE.

\[s = \sqrt{MS_{ERR}}\]

The F-ratio

\[F = \frac{MS_{REG}}{MS_{ERR}}\]

is used to test significance of the entire regression model.


Assumptions and Conditions for ANOVA

Independence Assumption

The groups must be independent of each other.

No test can verify this assumption. You have to think about how the data were collected and check that the Randomization Condition is satisfied.


Equal Variance Assumption

ANOVA assumes that the true variances of the treatment groups are equal. We can check the corresponding Similar Variance Condition in various ways:


Normal Population Assumption

Like Student's t-tests, the F-test requires that the underlying errors follow a Normal model. As before when we faced this assumption, we'll check a corresponding Nearly Normal Condition.


Regression Inference

\[\hat{y} = b_0 + b_1 x\]

where $b_0$ estimates $\beta_0$, $b_1$ estimates $\beta_1$.


Assumptions and Conditions

The inference methods are based on these assumptions:


Assumptions and Conditions (cont.)

Summary of Assumptions and Conditions:


The Standard Error of the Slope

\[SE(b_1) = \frac{s_e}{s_x\sqrt{n-1}} = \frac{s}{\sqrt{S_{xx}}}\]

where $s_e$ is spread around the line, $s_x$ is spread of $x$ values, $n$ is a sample size.


Example 5

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population?

regse1


Example 6

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population?

regse2


Example 7

Which of these scatterplots would give the more consistent regression slope estimate if we were to sample repeatedly from the underlying population?

regse3


A Test for the Regression Slope

When the conditions are met, the standardized estimated regression slope,

\[t = \frac{b_1 - \beta_1}{SE(b_1)}\]

follows a Student's t-model with $n - 2$ degrees of freedom. We calculate the standard error as SE (see above), where $s_e = \sqrt{\frac{\sum(y-\hat{y})^2}{n-2}}$ and $s_x$ is the standard deviation of the $x$-values.


A Test for the Regression Slope (cont.)

\[t = \frac{b_1 - B_1}{SE(b_1)}\]

follows a Student's $t$-model with $n - 2$ degrees of freedom.


CI for the Regression Slope

When the assumptions and conditions are met, we can find a confidence interval for $\beta_1$ from

\[b_1 \pm t^*_{n-2}SE(b_1)\]

where the critical value $t^*$ depends on the confidence level and has $n - 2$ degrees of freedom.


ANOVA F-test

It compares the portion of variation explained by regression with the portion that remains unexplained.

\[MS_{REG} = \frac{SS_{REG}}{df_{REG}} = \frac{SS_{REG}}{1} = SS_{REG}\]
\[MS_{ERR} = \frac{SS_{ERR}}{df_{ERR}} = \frac{SS_{ERR}}{n-2} = s^2\]

Under null hypothesis $H_0: \beta_1 = 0$, both means are independent, and their ratio

\[F = MSR/MSE = SSR/s^2\]

has F-distribution with $df_{REG} = 1$ and $df_{ERR} = n - 2$.


A Hypothesis Test for Correlation

What if we want to test whether the correlation between $x$ and $y$ is 0?

\[t = r \sqrt \frac {n-2}{1-r^2}\]

which follows a Student's $t$-model with $n - 2$ degrees of freedom.


F-test and T-test

The T-test for the regression slope and the ANOVA F-test for the univariate regression, they are absolutely equivalent.

\[t^2 = \frac{b^2_1}{s^2/S_{xx}} = \frac{S_{yy}(S_{xy}/S_{xx})^2}{S_{yy}s^2/S_{xx}} = \frac{r^2 SS_{TOT}}{s^2} = \frac{S_{REG}}{s^2} = F\]

Prediction

Let $x_*$ be the value of the predictor $X$. The corresponding value of the response $Y$ is

\[\hat{y}_* = \hat{G}(x_*) = b_0 + b_1 x_*\]

How reliable are regression predictions, and how close are they to the real true values? We can construct


The Confidence Interval for the Mean Response

When the conditions are met, we find the confidence interval for the mean response value $\mu_*$ at a value $x_*$ as

\[\hat{y}_* \pm t^*_{n-2} SE(\hat{\mu}_*)\]

where the standard error is

\[SE(\hat{\mu}_*) = \sqrt{SE^2(b_1) \times (x_* - \bar{x}) + s^2_e/n } = s \sqrt{\frac{1}{n} + \frac{(x_* - \bar{x})^2} {S_{xx}}}\]

The Prediction Interval for an Individual Response

When the conditions are met, we can find the prediction interval for all values of $y_*$ at a value $x_*$ as

\[\hat{y}_* \pm t^*_{n-2} SE(\hat{y}_*)\]

where the standard error is

\[SE(\hat{y}_*) = \sqrt{SE^2(b_1) \times (x_* - \bar{x}) + s^2_e/n +s^2_e}\]
\[= s \sqrt{1 + \frac{1}{n} + \frac{(x_* - \bar{x})^2} {S_{xx}}}\]