Errors-in-variables models

In statistics, errors-in-variables models or measurement error models[1][2] are regression models that account for measurement errors in the independent variables. In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

In the case when some regressors have been measured with errors, estimation based on the standard assumption leads to inconsistent estimates, meaning that the parameter estimates do not tend to the true values even in very large samples. For simple linear regression the effect is an underestimate of the coefficient, known as the attenuation bias. In non-linear models the direction of the bias is likely to be more complicated.[3][4]

Motivational example

Consider a simple linear regression model of the form


    y_{t} = \alpha + \beta x_{t}^{*} + \varepsilon_t\,, \quad t=1,\ldots,T,

where x_{t}^{*} denotes the true but unobserved regressor. Instead we observe this value with an error:


    x_{t} = x_{t}^{*} + \eta_{t}\,

where the measurement error \eta_{t} is assumed to be independent from the true value x_{t}^{*}.

If the y_{t}′s are simply regressed on the x_{t}′s (see simple linear regression), then the estimator for the slope coefficient is


    \hat{\beta} = \frac{\tfrac{1}{T}\sum_{t=1}^T(x_t-\bar{x})(y_t-\bar{y})}
                     {\tfrac{1}{T}\sum_{t=1}^T(x_t-\bar{x})^2}\,,

which converges as the sample size T increases without bound:


    \hat{\beta} \xrightarrow{p}
      \frac{\operatorname{Cov}[\,x_t,y_t\,]}{\operatorname{Var}[\,x_t\,]}
      = \frac{\beta \sigma^2_{x^*}} {\sigma_{x^*}^2 + \sigma_\eta^2}
      = \frac{\beta} {1 + \sigma_\eta^2/\sigma_{x^*}^2}\,.

Variances are non-negative, so that in the limit the estimate is smaller in magnitude than the true value of \beta, an effect which statisticians call attenuation or regression dilution.[5] Thus the ‘naїve’ least squares estimator is inconsistent in this setting. However, the estimator is a consistent estimator of the parameter required for a best linear predictor of y given x: in some applications this may be what is required, rather than an estimate of the ‘true’ regression coefficient, although that would assume that the variance of the errors in observing x^{*} remains fixed. This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the y_{t}′s to the actually observed x_{t}′s, in a simple linear regression, is given by


    \beta_x =      \frac{\operatorname{Cov}[\,x_t,y_t\,]}{\operatorname{Var}[\,x_t\,]} .

It is this coefficient, rather than \beta, that would be required for constructing a predictor of y based on an observed x which is subject to noise.

It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous.[6] Jerry Hausman sees this as an iron law of econometrics: "The magnitude of the estimate is usually smaller than expected."[7]

Specification

Usually measurement error models are described using the latent variables approach. If y is the response variable and x are observed values of the regressors, then it is assumed there exist some latent variables y^{*} and x^{*} which follow the model's “true” functional relationship g(\cdot), and such that the observed quantities are their noisy observations:

\begin{cases}
  x = x^{*} + \eta, \\
  y = y^{*} + \varepsilon, \\
  y^* = g(x^*\!,w\,|\,\theta),
  \end{cases}

where \theta is the model's parameter and w are those regressors which are assumed to be error-free (for example when linear regression contains an intercept, the regressor which corresponds to the constant certainly has no "measurement errors"). Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of \eta's are zero.

The variables y, x, w are all observed, meaning that the statistician possesses a data set of n statistical units \left\{ y_{i}, x_{i}, w_{i} \right\}_{i = 1, \dots, n} which follow the data generating process described above; the latent variables x^*, y^*, \varepsilon, and \eta are not observed however.

This specification does not encompass all the existing errors-in-variables models. For example in some of them function g(\cdot) may be non-parametric or semi-parametric. Other approaches model the relationship between y^* and x^* as distributional instead of functional, that is they assume that y^* conditionally on x^* follows a certain (usually parametric) distribution.

Terminology and assumptions

Linear model

Linear errors-in-variables models were studied first, probably because linear models were so widely used and they are easier than non-linear ones. Unlike standard least squares regression (OLS), extending errors in variables regression (EiV) from the simple to the multivariable case is not straightforward.

Simple linear model

The simple linear errors-in-variables model was already presented in the "motivation" section:

\begin{cases}
    y_t = \alpha + \beta x_t^* + \varepsilon_t, \\
    x_t = x_t^* + \eta_t,
  \end{cases}

where all variables are scalar. Here α and β are the parameters of interest, whereas σε and ση—standard deviations of the error terms—are the nuisance parameters. The "true" regressor x* is treated as a random variable (structural model), independent from the measurement error η (classic assumption).

This model is identifiable in two cases: (1) either the latent regressor x* is not normally distributed, (2) or x* has normal distribution, but neither εt nor ηt are divisible by a normal distribution.[10] That is, the parameters α, β can be consistently estimated from the data set \scriptstyle(x_t,\,y_t)_{t=1}^T without any additional information, provided the latent regressor is not Gaussian.

Before this identifiability result was established, statisticians attempted to apply the maximum likelihood technique by assuming that all variables are normal, and then concluded that the model is not identified. The suggested remedy was to assume that some of the parameters of the model are known or can be estimated from the outside source. Such estimation methods include[11]

Newer estimation methods that do not assume knowledge of some of the parameters of the model, include

  • Method of moments — the GMM estimator based on the third- (or higher-) order joint cumulants of observable variables. The slope coefficient can be estimated from [12]
    
    \hat\beta = \frac{\hat{K}(n_1,n_2+1)}{\hat{K}(n_1+1,n_2)}, \quad n_1,n_2>0,

    where (n1,n2) are such that K(n1+1,n2) — the joint cumulant of (x,y) — is not zero. In the case when the third central moment of the latent regressor x* is non-zero, the formula reduces to

    
    \hat\beta = \frac{\tfrac{1}{T}\sum_{t=1}^T (x_t-\bar x)(y_t-\bar y)^2}
                     {\tfrac{1}{T}\sum_{t=1}^T (x_t-\bar x)^2(y_t-\bar y)}\ .
  • Instrumental variables — a regression which requires that certain additional data variables z, called instruments, were available. These variables should be uncorrelated with the errors in the equation for the dependent variable (valid), and they should also be correlated (relevant) with the true regressors x*. If such variables can be found then the estimator takes form
    \hat\beta = \frac{\tfrac{1}{T}\sum_{t=1}^T (z_t-\bar z)(y_t-\bar y)}
                         {\tfrac{1}{T}\sum_{t=1}^T (z_t-\bar z)(x_t-\bar x)}\ .

Multivariable linear model

Multivariable model looks exactly like the simple linear model, only this time β, ηt, xt and x*t are 1 vectors.

\begin{cases}
    y_t = \alpha + \beta'x_t^* + \varepsilon_t, \\
    x_t = x_t^* + \eta_t.
  \end{cases}

The general identifiability condition for this model remains an open question. It is known however that in the case when (ε,η) are independent and jointly normal, the parameter β is identified if and only if it is impossible to find a non-singular k×k block matrix [a A] (where a is a 1 vector) such that a′x* is distributed normally and independently from A′x*.[13]

Some of the estimation methods for multivariable linear models are

  • Total least squares is an extension of Deming regression to the multivariable setting. When all the k+1 components of the vector (ε,η) have equal variances and are independent, this is equivalent to running the orthogonal regression of y on the vector x — that is, the regression which minimizes the sum of squared distances between points (yt,xt) and the k-dimensional hyperplane of "best fit".
  • The method of moments estimator [14] can be constructed based on the moment conditions E[zt·(ytαβ'xt)] = 0, where the (5k+3)-dimensional vector of instruments zt is defined as
    \begin{align}
    & z_t = \left( 1\ z_{t1}'\ z_{t2}'\ z_{t3}'\ z_{t4}'\ z_{t5}'\ z_{t6}'\ z_{t7}' \right)', \quad \text{where} \\
    & z_{t1} = x_t \circ x_t \\
    & z_{t2} = x_t y_t \\
    & z_{t3} = y_t^2 \\
    & z_{t4} = x_t \circ x_t \circ x_t - 3\big(\operatorname{E}[x_tx_t'] \circ I_k\big)x_t \\
    & z_{t5} = x_t \circ x_t y_t - 2\big(\operatorname{E}[y_tx_t'] \circ I_k\big)x_t - y_t\big(\operatorname{E}[x_tx_t'] \circ I_k\big)\iota_k \\
    & z_{t6} = x_t y_t^2 - \operatorname{E}[y_t^2]x_t - 2y_t\operatorname{E}[x_ty_t] \\
    & z_{t7} = y_t^3 - 3y_t\operatorname{E}[y_t^2]
  \end{align}

    where \circ designates the Hadamard product of matrices, and variables xt, yt have been preliminarily de-meaned. The authors of the method suggest to use Fuller's modified IV estimator.[15]

    This method can be extended to use moments higher than the third order, if necessary, and to accommodate variables measured without error.[16]
  • The instrumental variables approach requires to find additional data variables zt which would serve as instruments for the mismeasured regressors xt. This method is the simplest from the implementation point of view, however its disadvantage is that it requires to collect additional data, which may be costly or even impossible. When the instruments can be found, the estimator takes standard form
    
    \hat\beta = \big(X'Z(Z'Z)^{-1}Z'X\big)^{-1}X'Z(Z'Z)^{-1}Z'y.

Non-linear models

A generic non-linear measurement error model takes form

\begin{cases}
  y_t = g(x^*_t) + \varepsilon_t, \\
  x_t = x^*_t + \eta_t.
  \end{cases}

Here function g can be either parametric or non-parametric. When function g is parametric it will be written as g(x*, β).

For a general vector-valued regressor x* the conditions for model identifiability are not known. However in the case of scalar x* the model is identified unless the function g is of the "log-exponential" form [17]

g(x^*) = a + b \ln\big(e^{cx^*} + d\big)

and the latent regressor x* has density


    f_{x^*}(x) = \begin{cases}
               A e^{-Be^{Cx}+CDx}(e^{Cx}+E)^{-F}, & \text{if}\ d>0 \\
               A e^{-Bx^2 + Cx} & \text{if}\ d=0
             \end{cases}

where constants A,B,C,D,E,F may depend on a,b,c,d.

Despite this optimistic result, as of now no methods exist for estimating non-linear errors-in-variables models without any extraneous information. However there are several techniques which make use of some additional data: either the instrumental variables, or repeated observations.

Instrumental variables methods

  • Newey's simulated moments method[18] for parametric models — requires that there is an additional set of observed predictor variabels zt, such that the true regressor can be expressed as
    x^*_t = \pi_0'z_t + \sigma_0 \zeta_t,

    where π0 and σ0 are (unknown) constant matrices, and ζtzt. The coefficient π0 can be estimated using standard least squares regression of x on z. The distribution of ζt is unknown, however we can model it as belonging to a flexible parametric family — the Edgeworth series:

    f_\zeta(v;\,\gamma) = \phi(v)\,\textstyle\sum_{j=1}^J \!\gamma_j v^j

    where ϕ is the standard normal distribution.

    Simulated moments can be computed using the importance sampling algorithm: first we generate several random variables {vts ~ ϕ, s = 1,…,S, t = 1,…,T} from the standard normal distribution, then we compute the moments at t-th observation as

    m_t(\theta) = A(z_t) \frac{1}{S}\sum_{s=1}^S H(x_t,y_t,z_t,v_{ts};\theta) \sum_{j=1}^J\!\gamma_j v_{ts}^j,

    where θ = (β, σ, γ), A is just some function of the instrumental variables z, and H is a two-component vector of moments

    \begin{align}
    & H_1(x_t,y_t,z_t,v_{ts};\theta) = y_t - g(\hat\pi'z_t + \sigma v_{ts}, \beta), \\
    & H_2(x_t,y_t,z_t,v_{ts};\theta) = z_t y_t - (\hat\pi'z_t + \sigma v_{ts}) g(\hat\pi'z_t + \sigma v_{ts}, \beta)
  \end{align}
    With moment functions mt one can apply standard GMM technique to estimate the unknown parameter θ.

Repeated observations

In this approach two (or maybe more) repeated observations of the regressor x* are available. Both observations contain their own measurement errors, however those errors are required to be independent:

\begin{cases}
    x_{1t} = x^*_t + \eta_{1t}, \\
    x_{2t} = x^*_t + \eta_{2t},
  \end{cases}

where x*η1η2. Variables η1, η2 need not be identically distributed (although if they are efficiency of the estimator can be slightly improved). With only these two observations it is possible to consistently estimate the density function of x* using Kotlarski's deconvolution technique.[19]

  • Li's conditional density method for parametric models.[20] The regression equation can be written in terms of the observable variables as
    
    \operatorname{E}[\,y_t|x_t\,] = \int g(x^*_t,\beta) f_{x^*|x}(x^*_t|x_t)dx^*_t ,

    where it would be possible to compute the integral if we knew the conditional density function ƒx*|x. If this function could be known or estimated, then the problem turns into standard non-linear regression, which can be estimated for example using the NLLS method.
    Assuming for simplicity that η1, η2 are identically distributed, this conditional density can be computed as

    
    \hat f_{x^*|x}(x^*|x) = \frac{\hat f_{x^*}(x^*)}{\hat f_{x}(x)} \prod_{j=1}^k \hat f_{\eta_{j}}\big( x_{j} - x^*_{j} \big),

    where with slight abuse of notation xj denotes the j-th component of a vector.
    All densities in this formula can be estimated using inversion of the empirical characteristic functions. In particular,

    \begin{align}
  & \hat \varphi_{\eta_j}(v) = \frac{\hat\varphi_{x_j}(v,0)}{\hat\varphi_{x^*_j}(v)}, \quad \text{where }
    \hat\varphi_{x_j}(v_1,v_2) = \frac{1}{T}\sum_{t=1}^T e^{iv_1x_{1tj}+iv_2x_{2tj}}, \\
    \hat\varphi_{x^*_j}(v) = \exp \int_0^v \frac{\partial\hat\varphi_{x_j}(0,v_2)/\partial v_1}{\hat\varphi_{x_j}(0,v_2)}dv_2, \\
  & \hat \varphi_x(u) = \frac{1}{2T}\sum_{t=1}^T \Big( e^{iu'x_{1t}} + e^{iu'x_{2t}} \Big), \quad
    \hat \varphi_{x^*}(u) = \frac{\hat\varphi_x(u)}{\prod_{j=1}^k \hat\varphi_{\eta_j}(u_j)}.
  \end{align}

    In order to invert these characteristic function one has to apply the inverse Fourier transform, with a trimming parameter C needed to ensure the numerical stability. For example:

    \hat f_x(x) = \frac{1}{(2\pi)^k} \int_{-C}^{C}\cdots\int_{-C}^C e^{-iu'x} \hat\varphi_x(u) du.
  • Schennach's estimator for a parametric linear-in-parameters nonlinear-in-variables model.[21] This is a model of the form
    \begin{cases}
    y_t = \textstyle \sum_{j=1}^k \beta_j g_j(x^*_t) + \sum_{j=1}^\ell \beta_{k+j}w_{jt} + \varepsilon_t, \\
    x_{1t} = x^*_t + \eta_{1t}, \\
    x_{2t} = x^*_t + \eta_{2t},
  \end{cases}

    where wt represents variables measured without errors. The regressor x* here is scalar (the method can be extended to the case of vector x* as well).
    If not for the measurement errors, this would have been a standard linear model with the estimator

    
    \hat{\beta} = \big(\hat{\operatorname{E}}[\,\xi_t\xi_t'\,]\big)^{-1} \hat{\operatorname{E}}[\,\xi_t y_t\,],

    where

     \xi_t'= (g_1(x^*_t), \cdots ,g_k(x^*_t), w_{1,t}, \cdots , w_{l,t}).

    It turns out that all the expected values in this formula are estimable using the same deconvolution trick. In particular, for a generic observable wt (which could be 1, w1t, …, w t, or yt) and some function h (which could represent any gj or gigj) we have

    
    \operatorname{E}[\,w_th(x^*_t)\,] = \frac{1}{2\pi} \int_{-\infty}^\infty \varphi_h(-u)\psi_w(u)du,

    where φh is the Fourier transform of h(x*), but using the same convention as for the characteristic functions,

     \varphi_h(u)=\int e^{iux}h(x)dx,

    and

    
    \psi_w(u) = \operatorname{E}[\,w_te^{iux^*}\,]
              = \frac{\operatorname{E}[w_te^{iux_{1t}}]}{\operatorname{E}[e^{iux_{1t}}]}
                \exp \int_0^u i\frac{\operatorname{E}[x_{2t}e^{ivx_{1t}}]}{\operatorname{E}[e^{ivx_{1t}}]}dv
    The resulting estimator \scriptstyle\hat\beta is consistent and asymptotically normal.
  • Schennach's estimator for a nonparametric model.[22] The standard Nadaraya–Watson estimator for a nonparametric model takes form
    
    \hat{g}(x) = \frac{\hat{\operatorname{E}}[\,y_tK_h(x^*_t - x)\,]}{\hat{\operatorname{E}}[\,K_h(x^*_t - x)\,]},
    for a suitable choice of the kernel K and the bandwidth h. Both expectations here can be estimated using the same technique as in the previous method.

References

  1. Carroll, Raymond J.; Ruppert, David; Stefanski, Leonard A.; Crainiceanu, Ciprian (2006). Measurement Error in Nonlinear Models: A Modern Perspective (Second ed.). ISBN 1-58488-633-1.
  2. Koul, Hira; Song, Weixing (2008). "Regression model checking with Berkson measurement errors". Journal of Statistical Planning and Inference 138 (6): 1615–1628. doi:10.1016/j.jspi.2007.05.048.
  3. Griliches, Zvi; Ringstad, Vidar (1970). "Errors-in-the-variables bias in nonlinear contexts". Econometrica 38 (2): 368–370. doi:10.2307/1913020. JSTOR 1913020.
  4. Chesher, Andrew (1991). "The effect of measurement error". Biometrika 78 (3): 451–462. doi:10.1093/biomet/78.3.451. JSTOR 2337015.
  5. Greene, William H. (2003). Econometric Analysis (5th ed.). New Jersey: Prentice Hall. Chapter 5.6.1. ISBN 0-13-066189-9.
  6. Wansbeek, T.; Meijer, E. (2000). "Measurement Error and Latent Variables in Econometrics". In Baltagi, B. H. A Companion to Theoretical Econometrics. Blackwell. pp. 162–179. doi:10.1111/b.9781405106764.2003.00013.x.
  7. Hausman, Jerry A. (2001). "Mismeasured variables in econometric analysis: problems from the right and problems from the left". Journal of Economic Perspectives 15 (4): 57–67 [p. 58]. doi:10.1257/jep.15.4.57. JSTOR 2696516.
  8. Fuller, Wayne A. (1987). Measurement Error Models. John Wiley & Sons. p. 2. ISBN 0-471-86187-1.
  9. Hayashi, Fumio (2000). Econometrics. Princeton University Press. pp. 7–8.
  10. Reiersøl, Olav (1950). "Identifiability of a linear relation between variables which are subject to error". Econometrica 18 (4): 375–389 [p. 383]. doi:10.2307/1907835. JSTOR 1907835. A somewhat more restrictive result was established earlier by Geary, R. C. (1942). "Inherent relations between random variables". Proceedings of the Royal Irish Academy 47: 63–76. JSTOR 20488436. He showed that under the additional assumption that (ε, η) are jointly normal, the model is not identified if and only if x*s are normal.
  11. Fuller, Wayne A. (1987). "A Single Explanatory Variable". Measurement Error Models. John Wiley & Sons. pp. 1–99. ISBN 0-471-86187-1.
  12. Pal, Manoranjan (1980). "Consistent moment estimators of regression coefficients in the presence of errors in variables". Journal of Econometrics 14 (3): 349–364 [pp. 360–1]. doi:10.1016/0304-4076(80)90032-9.
  13. Bekker, Paul A. (1986). "Comment on identification in the linear errors in variables model". Econometrica 54 (1): 215–217. doi:10.2307/1914166. JSTOR 1914166. An earlier proof by Willassen contained errors, see Willassen, Y. (1979). "Extension of some results by Reiersøl to multivariate models". Scand. J. Statistics 6 (2): 89–91. JSTOR 4615738.
  14. Dagenais, Marcel G.; Dagenais, Denyse L. (1997). "Higher moment estimators for linear regression models with errors in the variables". Journal of Econometrics 76: 193–221. doi:10.1016/0304-4076(95)01789-5. In the earlier paper Pal (1980) considered a simpler case when all components in vector (ε, η) are independent and symmetrically distributed.
  15. Fuller, Wayne A. (1987). Measurement Error Models. John Wiley & Sons. p. 184. ISBN 0-471-86187-1.
  16. Erickson, Timothy; Whited, Toni M. (2002). "Two-step GMM estimation of the errors-in-variables model using high-order moments". Econometric Theory 18 (3): 776–799. doi:10.1017/s0266466602183101. JSTOR 3533649.
  17. Schennach, S.; Hu, Y.; Lewbel, A. (2007). "Nonparametric identification of the classical errors-in-variables model without side information". Working paper.
  18. Newey, Whitney K. (2001). "Flexible simulated moment estimation of nonlinear errors-in-variables model". Review of Economics and Statistics 83 (4): 616–627. doi:10.1162/003465301753237704. JSTOR 3211757.
  19. Li, Tong; Vuong, Quang (1998). "Nonparametric estimation of the measurement error model using multiple indicators". Journal of Multivariate Analysis 65 (2): 139–165. doi:10.1006/jmva.1998.1741.
  20. Li, Tong (2002). "Robust and consistent estimation of nonlinear errors-in-variables models". Journal of Econometrics 110 (1): 1–26. doi:10.1016/S0304-4076(02)00120-3.
  21. Schennach, Susanne M. (2004). "Estimation of nonlinear models with measurement error". Econometrica 72 (1): 33–75. doi:10.1111/j.1468-0262.2004.00477.x. JSTOR 3598849.
  22. Schennach, Susanne M. (2004). "Nonparametric regression in the presence of measurement error". Econometric Theory 20 (6): 1046–1093. doi:10.1017/S0266466604206028.

Further reading

External links

This article is issued from Wikipedia - version of the Wednesday, February 10, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.