Generalized least squares

Regression analysis
Part of a series on Statistics

Models
Linear regression Simple regression Ordinary least squares Polynomial regression General linear model
Generalized linear model Discrete choice Logistic regression Multinomial logit Mixed logit Probit Multinomial probit Ordered logit Ordered probit Poisson
Multilevel model Fixed effects Random effects Mixed model
Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic Principal components Least angle Local Segmented
Errors-in-variables
Estimation
Least squares Ordinary least squares Linear (math) Partial Total Generalized Weighted Non-linear Non-negative Iteratively reweighted Ridge regression
Least absolute deviations Bayesian Bayesian multivariate
Background
Regression model validation Mean and predicted response Errors and residuals Goodness of fit Studentized residual Gauss–Markov theorem
Statistics portal

In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model. GLS can be used to perform linear regression when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinary least squares and weighted least squares can be statistically inefficient, or even give misleading inferences. GLS was first described by Alexander Aitken in 1934.^[1]

Method outline

In a typical linear regression model we observe data $\{y_i,x_{ij}\}_{i=1, \dots, n,j=2, \dots, k}$ on n statistical units. The response values are placed in a vector $\mathbf{y} = \left( y_{1}, \dots, y_{n} \right)^{\mathtt{T}}$ , and the predictor values are placed in the design matrix $\mathbf{X} = \left( \mathbf{x}_{1}^{\mathtt{T}}, \dots, \mathbf{x}_{n}^{\mathtt{T}} \right)^{\mathtt{T}}$ , where $\mathbf{x}_{i} = \left( 1, x_{2i}, \dots, x_{ki} \right)$ is a vector of the k predictor variables (plus a constant) for the ith unit. The model assumes that the conditional mean of $\mathbf{y}$ given $\mathbf{X}$ is a linear function of $\mathbf{X}$ , whereas the conditional variance of the error term given $\mathbf{X}$ is a known nonsingular matrix $\mathbf{\Omega}$ . This is usually written as

\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\varepsilon}, \qquad \mathbb{E}[\varepsilon|\mathbf{X}]=0,\ \operatorname{Var}[\varepsilon|\mathbf{X}]= \mathbf{\Omega}.

Here $\beta \in \mathbb{R}^{k}$ is a vector of unknown constants (known as “regression coefficients”) that must be estimated from the data.

Suppose $\mathbf{b}$ is a candidate estimate for $\mathbf{\beta}$ . Then the residual vector for $\mathbf{b}$ will be $\mathbf{y}- \mathbf{X} \mathbf{b}$ . Generalized least squares method estimates $\mathbf{\beta}$ by minimizing the squared Mahalanobis length of this residual vector:

\mathbf{\hat{\beta}} = \underset{b}{\rm arg\,min}\,(\mathbf{y}- \mathbf{X} \mathbf{b})^{\mathtt{T}}\,\mathbf{\Omega}^{-1}(\mathbf{y}- \mathbf{X} \mathbf{b}),

Since the objective is a quadratic form in $\mathbf{b}$ , the estimator has an explicit formula:

\mathbf{\hat{\beta}} = \left( \mathbf{X}^{\mathtt{T}} \mathbf{\Omega}^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathtt{T}}\mathbf{\Omega}^{-1}\mathbf{y}.

Properties

The GLS estimator is unbiased, consistent, efficient, and asymptotically normal:

\sqrt{n}(\hat\beta - \beta)\ \xrightarrow{d}\ \mathcal{N}\!\left(0,\,\left( \mathbf{X}^{\mathtt{T}} \mathbf{\Omega}^{-1} \mathbf{X} \right)^{-1}\right).

GLS is equivalent to applying ordinary least squares to a linearly transformed version of the data. To see this, factor $\mathbf{\Omega} = \mathbf{C} \mathbf{C}^{\mathtt{T}}$ , for instance using the Cholesky decomposition. Then if we pre-multiply both sides of the equation $\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\varepsilon}$ by $\mathbf{C}^{-1}$ , we get an equivalent linear model $\mathbf{y}^{*} = \mathbf{X}^{*} \mathbf{\beta} + \mathbf{\varepsilon}^{*}$ where $\mathbf{y}^{*} = \mathbf{C}^{-1} \mathbf{y}$ , $\mathbf{X}^{*} = \mathbf{C}^{-1} \mathbf{X}$ , and $\mathbf{\varepsilon}^{*} = \mathbf{C}^{-1} \mathbf{\varepsilon}$ . In this model $\operatorname{Var}[\varepsilon^{*}|\mathbf{X}]= \mathbf{C}^{-1} \mathbf{\Omega} \left(\mathbf{C}^{-1} \right)^{\mathtt{T}} = \mathbf{I}$ , where $\mathbf{I}$ is the identity matrix. Thus we can efficiently estimate $\mathbf{\beta}$ by applying OLS to the transformed data, which requires minimizing

\left(\mathbf{y}^{*} - \mathbf{X}^{*} \mathbf{\beta} \right)^{\mathtt{T}} (\mathbf{y}^{*} - \mathbf{X}^{*} \mathbf{\beta}) = (\mathbf{y}- \mathbf{X} \mathbf{b})^{\mathtt{T}}\,\mathbf{\Omega}^{-1}(\mathbf{y}- \mathbf{X} \mathbf{b}).

This has the effect of standardizing the scale of the errors and “de-correlating” them. Since OLS is applied to data with homoscedastic errors, the Gauss–Markov theorem applies, and therefore the GLS estimate is the best linear unbiased estimator for β.

Weighted least squares

Main article: Weighted least squares

A special case of GLS called weighted least squares (WLS) occurs when all the off-diagonal entries of Ω are 0. This situation arises when the variances of the observed values are unequal (i.e. heteroscedasticity is present), but where no correlations exist among the observed variances. The weight for unit i is proportional to the reciprocal of the variance of the response for unit i.^[2]

Feasible generalized least squares

If the covariance of the errors $\Omega$ is unknown, one can get a consistent estimate of $\Omega$ , say $\widehat \Omega$ .^[3] One strategy for building an implementable version of GLS is the Feasible Generalized Least Squares (FGLS) estimator. In FGLS, we proceed in two stages: (1) the model is estimated by OLS or another consistent (but inefficient) estimator, and the residuals are used to build a consistent estimator of the errors covariance matrix (to do so, we often need to examine the model adding additional constraints, for example if the errors follow a time series process, we generally need some theoretical assumptions on this process to ensure that a consistent estimator is available); and (2) using the consistent estimator of the covariance matrix of the errors, we implement GLS ideas.

Whereas GLS is more efficient than OLS under heteroscedasticity or autocorrelation, this is not true for FGLS. The feasible estimator is, provided the errors co variance matrix is consistently estimated, asymptotically more efficient, but for a small or medium size sample, it can be actually less efficient than OLS. This is why, some authors prefer to use OLS, and re formulate their inferences by simply considering an alternative estimator for the variance of the estimator robust to heteroscedasticity or serial autocorrelation. But for large samples FGLS is preferred over OLS under heteroskedasticity or serial correlation.^[3] ^[4] A cautionary note is that the FGLS estimator is not always consistent. One case in which FGLS might be inconsistent is if there are individual specific fixed effects.^[5]

In general this estimator has different properties than GLS. For large samples (i.e., asymptotically) all properties are (under appropriate conditions) common with respect to GLS, but for finite samples the properties of FGLS estimators are unknown: they vary dramatically with each particular model, and as a general rule their exact distributions cannot be derived analytically. For finite samples, FGLS may be even less efficient than OLS in some cases. Thus, while GLS can be made feasible, it is not always wise to apply this method when the sample is small. A method sometimes used to improve the accuracy of the estimators in finite samples is to iterate, i.e. taking the residuals from FGLS to update the errors covariance estimator, and then updating the FGLS estimation, applying the same idea iteratively until the estimators vary less than some tolerance. But this method does not necessarily improve the efficiency of the estimator very much if the original sample was small. A reasonable option when samples are not too large is to apply OLS, but throwing away the classical variance estimator

\sigma^2*(X'X)^{-1}

(which is inconsistent in this framework) and using a HAC (Heteroskedasticity and Autocorrelation Consistent) estimator. For example, in autocorrelation context we can use the Bartlett estimator (often known as Newey-West estimator since these authors popularized the use of this estimator among econometricians in their 1987 Econometrica article), and in heteroskedastic context we can use the Eicker–White estimator (Eicker–White). This approach is much safer, and it is the appropriate path to take unless the sample is large, and "large" is sometimes a slippery issue (e.g. if the errors distribution is asymmetric the required sample would be much larger).

The ordinary least squares (OLS) estimator is calculated as usual by

\widehat \beta_{OLS} = (X' X)^{-1} X' y

and estimates of the residuals $\widehat{u}_j= (Y-Xb)_j$ are constructed.

For simplicity consider the model for heteroskedastic errors. Assume that the variance-covariance matrix $\Omega$ of the error vector is diagonal, or equivalently that errors from distinct observations are uncorrelated. Then each diagonal entry may be estimated by the fitted residuals $\widehat{u}_j$ so $\widehat{\Omega}_{OLS}$ may be constructed by

\widehat{\Omega}_{OLS} = \operatorname{diag}(\widehat{\sigma}^2_1, \widehat{\sigma}^2_2, \dots , \widehat{\sigma}^2_n).

It is important to notice that the squared residuals cannot be used in the previous expression; we need an estimator of the errors variances. To do so, we can use a parametric heteroskedasticity model, or a nonparametric estimator. Once this step is fulfilled, we can proceed:

Estimate $\beta_{FGLS1}$ using $\widehat{\Omega}_{OLS}$ using^[4] weighted least squares

\widehat \beta_{FGLS1} = (X'\widehat{\Omega}^{-1}_{OLS} X)^{-1} X' \widehat{\Omega}^{-1}_{OLS} y

The procedure can be iterated. The first iteration is given by

\widehat{u}_{FGLS1} = Y - X \widehat \beta_{FGLS1}

\widehat{\Omega}_{FGLS1} = \operatorname{diag}(\widehat{\sigma}^2_{FGLS1,1}, \widehat{\sigma}^2_{FGLS1,2}, \dots ,\widehat{\sigma}^2_{FGLS1,n})

\widehat \beta_{FGLS2} = (X'\widehat{\Omega}^{-1}_{FGLS1} X)^{-1} X' \widehat{\Omega}^{-1}_{FGLS1} y

This estimation of $\widehat{\Omega}$ can be iterated to convergence.

Under regularity conditions any of the FGLS estimator (or that of any of its iterations, if we iterate a finite number of times) is asymptotically distributed as

\sqrt{n}(\hat\beta_{FGLS} - \beta)\ \xrightarrow{d}\ \mathcal{N}\!\left(0,\,V\right).

where n is the sample size and

V = \text{p-lim}(X'\Omega^{-1}X/T)

here p-lim means limit in probability

References

↑ Aitken, A. C. (1934). "On Least-squares and Linear Combinations of Observations". Proceedings of the Royal Society of Edinburgh 55: 42–48.
↑ Strutz, T. (2016). Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Springer Vieweg. ISBN 978-3-658-11455-8. , chapter 3
1 2 Baltagi, B. H. (2008). Econometrics (4th ed.). New York: Springer.
1 2 Greene, W. H. (2003). Econometric Analysis (5th ed.). Upper Saddle River, NJ: Prentice Hall.
↑ Hansen, Christian B. (2007). "Generalized Least Squares Inference in Panel and Multilevel Models with Serial Correlation and Fixed Effects". Journal of Econometrics 140 (2): 670–694. doi:10.1016/j.jeconom.2006.07.011.