Least Squares Linear Regression

The purpose of least squares linear regression is to represent the relationship between one or more independent variables x1, x2, and a variable y that is dependent upon them in the following form:

where

= the ith observed value of the independent variable xj

= the ith observed value of the dependent variable y

= the error term or residual (i.e. the difference between the observed y values and that predicted by the model)

= the regression slope for the variable xj and

= the y-axis intercept

Simple least squares linear regression assumes that there is only one independent variable x . If we assume that the error terms are Normally distributed, the equation reduces to:

yi = Normal(m*xi + c, s)

where m is the slope of the line and c is the y-axis intercept and s is the standard deviation of the variation of y about this line.

Simple least squares linear regression is a very standard statistical analysis techniques, particularly when one has little or no idea of the relationship between the x and y variables. It is probably particularly common because the analysis mathematics are simple (because of the Normality assumption), rather than it being a very common rule for the relationship between variables. LSR makes four important assumptions:

Individual y values are independent
For each xi, there are an infinite number of possible values of y, which are Normally distributed
The distribution of y given a value of x has equal standard deviation for all x values and is centred about the least squares regression line
The means of the distribution of y at each x value can be connected by a straight line y = mx + c.

Assumptions behind least squares regression analysis

Statisticians often make transformations of the data (e.g. Log(Y), √X) to force a linear relationship. That greatly extends the applicability of the regression model but one must be particularly careful that the errors are reasonably Normal, and one runs an enormous risk in using the regression equations to make predictions outside the range of observations.

Estimation of parameters

The simple least squares regression model determines the straight line that minimizes the sum of the square of the ei errors. It can be shown that this occurs when:

where are the mean of the observed x and y data and n is the number of data pairs (xi,yi).

The fraction of the total variation in the dependent variable that is explained by the independent variable is known as the coefficient of determination R2, which is calculated as:

where SSE, the sum of squares errors, is given by:

and TSS, the total sum of squares, is given by:

and where are the predicted y values at each xi:

For simple least squares regression (i.e. only one independent variable), the square root of R2 is equivalent to the simple correlation coefficient r

sqrt(R) = r

r may alternatively be calculated as:

r provides a quantitative measure of the linear relationship between x and y. It ranges from -1 to +1: a value of r = -1 or +1 indicates a perfect linear fit, and r = 0 indicates no linear relationship exists at all. As

the sum of squared errors between the observed and predicted y-values, tends to zero, so r2 tends to 1 and therefore r tends to -1 or +1, its sign depending on whether m is negative or positive respectively.

The value of r is used to determine the statistical significance of the fitted line, by first calculating the test statistic t as:

The t statistic follows a t-distribution with (n-2) degrees of freedom (provided the linear regression assumptions of Normally distributed variation of y about the regression line hold) which is used to determine whether the fit should be rejected or not at the required level of confidence.

The standard error of the y-estimate Syx is calculated as:

This is equivalent to the standard deviation of the error terms ei. These errors reflect the true variability of the dependent variable y from the least squares regression line. The denominator (n-2) is used, instead of the (n-1) we have seen before for sample standard deviation calculations, because two values m and c have been estimated from the data to determine the equation values, and we have therefore lost 2 degrees of freedom instead of the 1 degree of freedom usually lost in determining the mean.

The equation of the regression line equation and the Syx statistic can be used together to produce a stochastic model of the relationship between X and Y, as follows:

Y = Normal(m*X+c,Syx)

Some caution is needed in using such a model. The regression model is intended to work within the range of the independent variable x for which there have been observations. Using the model outside this range can produce very significant errors if the relationship between x and y deviates from this linear relationship. This is also purely a model of variability, i.e. we are assuming that the linear relationship is correct and that the parameters are known. We should also include our uncertainty about the parameters, and perhaps about whether the linear relationship is even appropriate.

Example

Consider the data set in Table 1 that shows the result of a survey of 30 people. They were asked to provide details of their monthly net income {xi} and the amount they spent on food each month {yi}.

The values of m, c, r and Syx were calculated using the Excel functions:

m = SLOPE({yi},{xi}) = 0.1356

c = INTERCEPT({yi},{xi}) = 167.8

r2 = RSQ({yi},{xi}) = 0.8831

Syx = STEYX({yi},{xi}) = 54.86

The line is plotted against the data points in Figure 2.

Figure2.

The error terms are shown in Figure 3:

Figure 3.

A distribution fit of these ei values shows that they are approximately Normally distributed. A test of significance of r also shows that, for 28 degrees of freedom (n-2), there is only about a 5.10-11 chance that such a high value of r could have been observed from purely random data. We would therefore feel confident in modelling the relationship between any net monthly income value N (between the values 505 and 1581) and monthly expenditure on food F using:

F = Normal(0.1356*N+167.8, 54.86)

Uncertainty about least squares regression parameters

The parameters m, c, and Syx for the least squares regression represent the best estimate of the variability model where we are assuming some stochastically linear relationship between x and y. However, since we will have only a limited number of observations (i.e. {x,y} pairs), we do not have perfect knowledge of the stochastic system and there is therefore some uncertainty about the regression parameters. The t-test tells us whether the linear relationship might exist at some level of confidence. More useful, however, from a risk analysis perspective is that we can readily determine distributions of uncertainty about these parameters using the Bootstrap.

Read on: Taylor series

Least Squares Linear Regression

Assumptions behind least squares regression analysis

Estimation of parameters

Example

Uncertainty about least squares regression parameters

Navigation