Unit 6: Linear Regression


6a. Identify the assumptions that inferential statistics in regression are based on

  • What are the assumptions about a population we make while conducting inferential statistics?
  • Why do we call the regression line the "least squares" regression line?
  • What conditions must be true of a sample of points to make the correlation or regression line statistically significant?

There are three assumptions we make while conducting inferential statistics.

  1. Linearity: We assume the relationship between the variables is linear. We measure the strength of that linear relationship with the correlation coefficient, r.
  2. Homoscedasticity: This means the variance of the data from the regression line is the same.
  3. Errors of prediction are normally distributed.

We calculate statistical significance in much the same way as we determine the mean or proportion (from a sample) statistically significant, as we reviewed in Units 4 and 5.

Remember that finding and interpreting the correlation coefficient is always best. While correlation does not necessarily prove a causative relationship between the two variables, if the correlation is very low, it is unlikely that the regression line will be of any use.

As long as the line is non-vertical, you will always get a solution for the least-squares regression line. Think of the phrase, "garbage in, garbage out". If the slope is not significant, then the regression is useless.

The general method behind the regression line formula is to find the line as follows: Draw a vertical line between every point on the scatter plot and the regression line. Then, make that one side of a square. The line that gives the lowest total area (the lowest sum of squares) will be considered the best fit. This process is called least squares regression.

Review

To review, see:


6b. Compute the standard error of a slope

  • What does the standard error of a slope tell you?
  • How is the standard error computed?

The standard error for a slope tells you basically the same thing that any other standard error tells you. The standard error is the standard deviation of the sampling distribution. So, the standard error for the mean is the standard deviation of a set of sample means. It shows how reliable those samples are by how much the samples vary. A low standard error will produce a narrower confidence interval and make it more likely to reject an incorrect null hypothesis here.

Carry this logic forward to the interpretation of a slope. The standard error may not tell you much by itself (its computation is more complex than for means and proportions), but it is a component of statistical inference involving the slope of a regression line.

The estimated standard error of b is computed using the formula  s_b = \dfrac{s_{est}} { \sqrt {SSX} } , where s_{\mathrm{est}} is the standard error of the estimate and \mathrm{SSX} is the sum of squared deviations of X from the mean of X. \mathrm{SSX} is calculated as \mathrm{SSX}=\sum \left ( X-M_{X} \right )^{2}.

Review

To review, see:


6c. Test a slope for significance

  • How would we test a slope for significance?
  • How does this relate to hypothesis testing?

Hypothesis testing works for the slope or correlation of a regression line in the same general way that it works for the mean and proportion: You have a null hypothesis of no significance (r=0) and an alternative that is almost always two-tailed (r≠0). You can use the formulas in the resources below to find the T-statistic and then use the same methods to find the p-value (the combined area of the right and left tails formed by the positive and negative values of that t-statistic).

Review

To review, see:


6d. Construct a confidence interval on a slope

  • What should the confidence interval for a slope look like if the slope is significant?
  • What is the formula needed to construct a confidence interval for the slope?

Remember, the confidence interval gives the range of values most likely to contain the true parameter. In the case of the slope, we want a confidence interval that does not include 0. If the confidence interval is [-0.8, 2.1], then the slope could be positive or negative, which would cause us to conclude that the slope we found is not significant.

To find a 100(1-\alpha)% confidence interval for the slope \beta_{1} of the population regression line, use the formula  \hat \beta _1 \pm t_{ \frac {\alpha}{2} \frac{s_{\epsilon}}{\sqrt{SS_{xx}}}} and a number of degrees of freedom of df=n-2.

Review

To review, see:


6e. Calculate the coefficient of determination and the correlation coefficient

  • How is the coefficient of determination calculated?
  • How is the correlation coefficient related to the coefficient of determination?

Simply put, the coefficient of determination (a measure of how much of the variation in one variable is explained by another variable in a regression model) is the square of the correlation coefficient. We use the letter R to represent the correlation coefficient and R^2 to represent the coefficient of determination. So, to calculate the coefficient of determination, square the correlation coefficient.

Review

To review, see:


6f. Interpret the coefficient of determination and the correlation coefficient

  • What is the correlation coefficient, and what does it tell us?
  • How is the correlation related to the slope of a regression line? Do they tell us roughly the same thing?
  • What information does the coefficient of determination tell us?

The correlation coefficient measures the linear relationship between two variables, x and y. It is a number between −1 and 1, inclusive.

  • 1 means there is a perfect positive correlation. The scatter plot slopes upward in a straight line.
  • −1 means perfect negative correlation. The scatter plot slopes downward in a straight line.
  • 0 means there is no correlation, as if every x value produces a completely random value for y.

This way, the correlation coefficient is related to the slope of the regression line because they both have the same sign. The slope will be positive when the correlation is positive, and the slope will be negative when the correlation is negative. However, if the points are in a straight line sloping upward, they will have a correlation coefficient of 1 regardless of the line's slope. Remember, the slope of a line can be any real number, while the correlation coefficient is capped between −1 and +1.

The coefficient of determination tells us, in effect, the proportion of the variable y that is explained by the variable(s) x. So if a correlation is 0.8, then the coefficient of determination is 0.64, telling us that roughly 64% of the dependent variable Yes is explained by the independent variable (x).

Review

To review, see:


Unit 6 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • coefficient of determination
  • correlation coefficient
  • homoscedasticity
  • least squares regression