Influential observation

In Anscombe's quartet the two datasets on the bottom both contain influential points. All four sets are identical when examined using simple summary statistics, but vary considerably when graphed. If one point were removed, the line would look very different.

In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation.[1] In particular, in regression analysis an influential point is one whose deletion has a large effect on the parameter estimates.[2]

Assessment

Various methods have been proposed for measuring influence.[3][4] Assume an estimated regression \mathbf{y} = \mathbf{X} \mathbf{b} + \mathbf{e}, where \mathbf{y} is an n×1 column vector for the response variable, \mathbf{X} is the n×k design matrix of explanatory variables (including a constant), \mathbf{e} is the n×1 residual vector, and \mathbf{b} is a k×1 vector of estimates of some population parameter \mathbf{\beta} \in \mathbb{R}^{k}. Also define \mathbf{H} \equiv \mathbf{X} \left(\mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{X}^{\mathsf{T}}, the projection matrix of \mathbf{X}. Then we have the following measures of influence:

  1. \text{DFBETA}_{i} \equiv \mathbf{b} - \mathbf{b}_{(-i)} = \frac{\left( \mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{x}_{i}^{\mathsf{T}} e_{i}}{1 - h_{i}}, where \mathbf{b}_{(-i)} denotes the coefficients estimated with the i-th row \mathbf{x}_{i} of \mathbf{X} deleted, h_{i} = \mathbf{x}_{i} \left( \mathbf{X}^{\mathsf{T}} \mathbf{X} \right)^{-1} \mathbf{x}_{i}^{\mathsf{T}} denotes the i-th row of \mathbf{H}. Thus DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each point and each observation (if there are N points and k variables there are N·k DFBETAs).[5]
  2. DFFITS
  3. Cook's D measures the effect of removing a data point on all the parameters combined.[2]

Outliers, leverage and influence

An outlier may be defined as a surprising data point. Leverage is a measure of how much the estimated value of the dependent variable changes when the point is removed. There is one value of leverage for each data point.[6] Data points with high leverage force the regression line to be close to the point.[2] In Anscombe's quartet, only the bottom right image has a point with high leverage.

See also

References

  1. Burt, James E.; Barber, Gerald M.; Rigby, David L. (2009), Elementary Statistics for Geographers, Guilford Press, p. 513, ISBN 9781572304840.
  2. 1 2 3 Everitt, Brian (1998). The Cambridge Dictionary of Statistics. Cambridge, UK New York: Cambridge University Press. ISBN 0-521-59346-8.
  3. Winner, Larry (March 25, 2002). "Influence Statistics, Outliers, and Collinearity Diagnostics".
  4. Belsley, David A.; Kuh, Edwin; Welsh, Roy E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley Series in Probability and Mathematical Statistics. New York: John Wiley & Sons. pp. 11–16. ISBN 0-471-05856-4.
  5. "Outliers and DFBETA" (PDF). Archived (PDF) from the original on May 11, 2013.
  6. Hurvich, Clifford. "Simple Linear Regression VI: Leverage and Influence" (PDF). NYU Stern. Archived (PDF) from the original on September 21, 2006.

Further reading


This article is issued from Wikipedia - version of the Friday, March 25, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.