"To guess is cheap. To guess
wrongly is expensive. "
If plotting the data results in a scatterplot that suggests a linear relationship, it would be useful to summarize the overall pattern by drawing a line through the scatterplot. Least Squares Regression is the method for doing this but only in a specific situation. A regression line (LSRL - Least Squares Regression Line) is a straight line that describes how a response variable y changes as an explanatory variable x changes. The line is a mathematical model used to predict the value of y for a given x. Regression requires that we have an explanatory and response variable.
No line will pass through all the data points unless the relation is PERFECT. More likely it will mimic the points but should be as close as possible. Close means "close in the vertical direction." Error is defined as observed value - predicted value and we are seeking a line that minimizes the sum of these distances. Specifically, the least squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Yes, actual squares. See page 152 for visual.
The least squares regression line is of the same form as any line...has slope and intercept. To indicate that this is a calculated line we will change from "y=" to "y hat =". It can be shown that the slope (b) = r (sy/sx) where r is the correlation factor and s are the standard deviations for both x and y. Note: the standard deviations are in the same order as typical slope (change in y / change in x from Algebra I). The y intercept (a) = `y - b`x where `y and `x are the respective means. I don't like to say "memorize" too much, but.....these facts need to be recorded for later use. In real life the slope is the rate of change, that amount of change in y when x increases by 1. The intercept is the value of y when x = 0. The equation of the regression line makes prediction easy. Just SUBSTITUTE an x value into the equation.
A quantity related to the regression output is "r2". Although it simply looks like this quantity is equal to the square of "r", there is much much more to learn. When r2 is close to 0 the regression line is NOT a good model for the data. When r2 is close to 1, the line would fit the data well. r2 has a technical name, the coefficient of determination, and represents the fraction of the variation in the values of y that is explained by least squares regression of y on x.
Let's see the text (pp 158-162) for the complete explanation of the development of r2 from previously measured values. Once we understand how the method is derived...we shall use the calculator to calculate the values.
Some additional facts about least squares regression are:
Regression is one of the most common statistical settings and least squares is the most common method for fitting a regression line to data. (Another method would be using the median-median measure which produces a line very similar to the LSRL.) Order of the variables (explanatory and response) is critical when calculating regression lines and would produce different results if the x and y were interchanged. There is a close connection between correlation and the slope of the least square line. It is interesting that the least squares regression line always passes through the point (`x , `y ). The correlation (r) describes the strength of a straight line relationship. The square of the correlation, r2 , is the fraction of the variation in the values of y that is explained by the regression of y on x. Remember, it is a good idea to include r2 as a measure of how successful the regression was in explaining the response when you report a regression line.
When the regression line is calculated based on least squares and the vertical y distances to the regression line are measured, it is implied that there ARE distances and they represent "left-over" variation. These distances are called residuals. A residual is the difference between an observed value of the response variable and the value predicted by the regression line....residual = observed y - predicted y or y - y hat. The residuals show how far the data fall from the regression line and assess how well the line describes the data. THE MEAN OF THE LEAST SQUARE RESIDUALS IS ALWAYS ZERO and will be plotted around the line y = 0 on the calculator. A residual plot is a scatterplot of the regression residuals against the explanatory variable. IF the plot shows a uniform scatter of the points about the fitted line (above and below) with no unusual observations or systematic pattern, then the regression line captures the overall relationship well. Residual plots help us assess the "fit" of a regression line. (RESID is a command on the graphing calculator located in the "list" menu as #7 under "names.")
Lots of things can happen when
A curved pattern might appear showing that the relationship is not linear
Increasing or decreasing spread about the line as x increases indicates that prediction of y will be LESS accurate for larger x's.
Individual points with large residuals are outliers in the vertical direction
Individual points that are extreme in the x direction are also important....as influential observations
An outlier is an observation that lies outside the overall pattern of the other observations.
An observation is influential if removing it would greatly change the result of a statistical calculation.
We will complete the activity on page 154. You have experience from Algebra 2.