[Home]Linear regression

HomePage | Recent Changes | Preferences

Linear Regression is a method of data analysis intended to be used with a set of paired observations on two variables on the same set of statistical units. Conventionally, we refer to one of the variables as independent (usually labeled X) and the other as dependent (labeled Y). The notion of an independent variable often (but not always) implies the ability to choose the levels of the independent variable and that the dependent variable will respond naturally as in the stimulus-response model.

The analysis has several steps:

  1. Summarize the data by summing the data, their squares, and cross products
  2. Estimate the parameters, first b, the estimate of beta and then a, the estimate of alpha.
  3. Calculate and display the residuals between the equation using the estimated parameters and the observations, {Y - a - b*X}.
  4. Calculate several ancillary statistics which permit evaluation of the success of the experiment.

Note: A useful alternative to linear regression is [robust regression]? in which mean absolute error is minimized instead of mean squared error as in linear regression. Robust regression is computationally much more intensive than linear regression and is somewhat more difficult to implement as well.

Summarizing the data
We sum the observations, the squares of the Y's and X's and the products of X*Y to obtain the following quantities.
Estimating beta
We use the summary statistics above to calculate b, the estimate of beta.
Estimating alpha
We use the estimate of beta and the other statistics to estimate alpha by:

Displaying the residuals
The first method of displaying the residuals use the histogram or cumulative distribution to depict the similarity (or lack thereof) to a normal distribution. Non-normality suggests that the model may not be a good summary description of the data.

We plot the residuals, (Y-a-bX) against the independent variable, X. There should be no discernible trend or pattern if the model is satisfactory for this data. Some of the possible problems are:

Ancillary statistics
The sum of squared deviations can be partitioned as in ANOVA to indicate what part of the dispersion of the dependent variable is explained by the independent variable.

The correlation coefficient, r, can be calculated by
This statistic is a measure of how well a straight line describes the data. Values near zero suggest that the model is ineffective. r2 is frequently interpreted as the fraction of the variability explained by the independent variable, X.

HomePage | Recent Changes | Preferences
This page is read-only | View other revisions
Last edited June 30, 2001 3:32 pm by Larry Sanger (diff)
Search: