Pooled variance

In statistics, pooled variance (also known as combined, composite, or overall variance) is a method for estimating variance of several different populations when the mean of each population may be different, but one may assume that the variance of each population is the same. If the populations are indexed i = 1,\ldots,k, then the pooled variance s^2_p can be estimated by the weighted average of the sample variances s^2_i

s_p^2=\frac{\sum_{i=1}^k (n_i - 1)s_i^2}{\sum_{i=1}^k(n_i - 1)} = \frac{(n_1 - 1)s_1^2+(n_2 - 1)s_2^2+\cdots+(n_k - 1)s_k^2}{n_1+n_2+\cdots+n_k - k}


where n_i is the sample size of population i and s^2_i is \frac{1}{n_i-1} \sum_{j=1}^{n_i} \left(y_j - \overline{y_j} \right)^2 . Use of (n_i-1) weighting factors instead of n_i comes from Bessel's correction.

Under the assumption of equal population variances, the pooled sample variance provides a higher precision estimate of variance than the individual sample variances. This higher precision can lead to increased statistical power when used in statistical tests that compare the populations, such as the t-test.

The square-root of a pooled variance estimator is known as a pooled standard deviation (also known as combined, composite, or overall standard deviation).

Motivation

In statistics, many times, data are collected for a dependent variable, y, over a range of values for the independent variable, x. For example, the observation of fuel consumption might be studied as a function of engine speed while the engine load is held constant. If, in order to achieve a small variance in y, numerous repeated tests are required at each value of x, the expense of testing may become prohibitive. Reasonable estimates of variance can be determined by using the principle of pooled variance after repeating each test at a particular x only a few times.

Unbiased least square estimate vs. biased maximum likelihood estimate

Both

s_p^2=\frac{\sum_{i=1}^k (n_i - 1)s_i^2}{\sum_{i=1}^k (n_i - 1)}

and

s_p^2=\frac{\sum_{i=1}^k (n_i - 1)s_i^2}{\sum_{i=1}^k n_i }

are used in different contexts. The former can give an unbiased s_p^2 to estimate \sigma^2 when the two groups share an equal population variance. The latter one can give a more efficient s_p^2 to estimate \sigma^2 biasedly. Note that the quantities s_i^2 in the right hand sides of both equations are the unbiased estimates.

Example

Consider the following set of data for y obtained at various levels of the independent variable x.

x y
1 31, 30, 29
2 42, 41, 40, 39
3 31, 28
4 23, 22, 21, 19, 18
5 21, 20, 19, 18,17

The number of trials, mean, variance and standard deviation are presented in the next table.

x n ymean Sy2 S
1 3 30.0 1.0 1.0
2 4 40.5 1.67 1.29
3 2 29.5 4.5 2.12
4 5 20.6 4.3 2.07
5 5 19.0 2.5 1.58

These statistics represent the variance and standard deviation for each subset of data at the various levels of x. If we can assume that the same phenomena are generating random error at every level of x, the above data can be “pooled” to express a single estimate of variance and standard deviation. In a sense, this suggests finding a mean variance or standard deviation among the five results above. This mean variance is calculated by weighting the individual values with the size of the subset for each level of x. Thus, the pooled variance is defined by

S_P^2 = \frac{(n_1-1)S_1^2+(n_2-1)S_2^2 + \cdots + (n_k - 1)S_k^2}{(n_1 - 1) + (n_2 - 1) + \cdots +(n_k - 1)}

where n1, n2, . . . nk are the sizes of the data subsets at each level of the variable x, and S12, S22, . . ., Sk2 are their respective variances.

The pooled variance of the data shown above is therefore:

S_P^2 = 2.765 \,

See also

References

External links


This article is issued from Wikipedia - version of the Thursday, October 08, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.