Ratio estimator

The ratio estimator is a statistical parameter and is defined to be the ratio of means of two variates. Ratio estimates are biased and corrections must be made when they are used in experimental or survey work. The ratio estimates are asymmetrical and symmetrical tests such as the t test should not be used to generate confidence intervals.

The bias is of the order O(1/n) (see big O notation) so as the sample size (n) increases, the bias will asymptotically approach 0. Therefore, the estimator is approximately unbiased for large sample sizes.

Definition

Assume there are two characteristics – x and y – that can be observed for each sampled element in the data set. The ratio R is

R = \bar{\mu}_y / \bar{\mu}_x \,

The ratio estimate of a value of the y variate (θ_y) is

\theta_y = R \theta_x \,

where θ_x is the corresponding value of the x variate. θ_y is known to be asymptotically normally distributed.^[1]

Statistical properties

Correction of the mean's bias

The correction methods, depending on the distributions of the x and y variates, differ in their efficiency making it difficult to recommend an overall best method. Because the estimates of r are biased a corrected version should be used in all subsequent calculations.

A correction of the bias accurate to the first order is^[3]

r_\mathrm{ corr } = r - \frac{ s_{ [ y / x ] x } }{ m_x }

where m_x is the mean of the variate x and s_ab is the covariance between a and b.

To simplify the notation s_ab will be used subsequently to denote the covariance between the variates a and b.

Another estimator based on the Taylor expansion is

r_\mathrm{ corr } = r + \frac{ 1 }{ n }( 1 - \frac{ n - 1 }{ N - 1 } ) \frac{ r s_x^2 - \rho s_x s_y }{ m_x^2 }

where n is the sample size, N is the population size, m_x is the mean of the variate x, s_x² and s_y² are the sample variances of the x and y variates respectively and ρ is the sample correlation between the x and y variates.

A computationally simpler but slightly less accurate version of this estimator is

r_\mathrm{ corr } = r - \frac{ N - n }{ N } \frac{ ( r s_x^2 - \rho s_x s_y ) }{ n m_x^2 }

where N is the population size, n is the sample size, m_x is the mean of the x variate, s_x² and s_y² are the sample variances of the x and y variates respectively and ρ is the sample correlation between the x and y variates. These versions differ only in the factor in the denominator ( N - 1 ). For a large N the difference is negligible.

A second-order correction is^[4]

r_\mathrm{ corr } = r \left[ 1 + \frac{ 1 }{ n } \left( \frac{ 1 }{ m_x } - \frac{ s_{ xy } }{ m_x m_y } \right) + \frac{ 1 }{ n^2 } \left( \frac{ 2 }{ m_x^2 } - \frac{ s_{ xy } }{ m_x m_y } \left[ 2 + \frac{ 3 }{ m_x } \right] + \frac{ s_{ x^2 y } }{ m_x^2 m_y } \right) \right]

Other methods of bias correction have also been proposed. To simplify the notation the following variables will be used

\theta = \frac{ 1 }{ n } - \frac{ 1 }{ N }

c_x^2 = \frac{ s_x^2 }{ m_x^2 }

c_{ xy } = \frac{ s_{xy} }{ m_x m_y }

Pascual's estimator:^[5]

r_\mathrm{ corr } = r + \frac{ N - 1 }{ N } \frac{ m_y - r m_x }{ n - 1 }

Beale's estimator:^[6]

r_\mathrm{ corr } = r \frac{ 1 + \theta c_{ xy } }{ 1 + \theta c_x^2 }

Tin's estimator:^[7]

r_\mathrm{ corr } = r \left( 1 + \theta \left( c_{ xy } - c_x^2 \right) \right)

Sahoo's estimator:^[8]

r_\mathrm{ corr } = \frac{ r }{ 1 + \theta ( c_x^2 - c_{ xy } ) }

Sahoo has also proposed a number of additional estimators:^[9]

r_\mathrm{ corr } = r ( 1 + \theta c_{ xy } ) ( 1 - \theta c_x^2 )

r_\mathrm{ corr } = \frac{ r ( 1 - \theta c_x^2 ) }{ 1 - \theta c_{ xy } }

r_\mathrm{ corr } = \frac{ r }{ ( 1 + \theta c_{ xy } )( 1 + \theta c_x^2 ) }

If m_x and m_y are both greater than 10, then the following approximation is correct to order O( n⁻³ ).^[4]

r_\mathrm{corr} = r \left[ 1 - \frac{ 2 }{ n^2 m_x } \left( \frac{ 1 }{ m_x } - \frac{ s_{ xy } }{ m_x m_y } \right) \left( 1 + \frac{ 13 }{ 2n } + \frac{ 8 }{ n m_x } \right) \right]

An asymptotically correct estimator is^[10]

r_\mathrm{ corr } = r + c_x^2 \frac{ m_y }{ m_x } - \frac{ s_{ xy } }{ m_x^2 }

Jackknife estimation

A jackknife estimate of the ratio is less biased than the naive form. A jackknife estimator of the ratio is

r_\mathrm{corr} = nr - \frac{ n - 1 }{ n } \sum_{ i \ne j = 1 }^n r_i

where n is the size of the sample and the r_i are estimated with the omission of one pair of variates at a time.^[11]

An alternative method is to divide the sample into g groups each of size p with n = pg.^[12] Let r_i be the estimate of the i^th group. Then the estimator

r_\mathrm{corr} = gr - \frac{ g - 1 }{ g } \sum_{ i = 1 }^g r_i

has a bias of at most O( n⁻² ).

Other estimators based on the division of the sample into g groups are:^[13]

r_\mathrm{ corr } = \frac{ g }{ g + 1 } r - \frac{ 1 }{ g ( g - 1 ) } \sum_{ i = 1 }^g r_i

r_\mathrm{ corr } = \bar{ r } +\frac{ n }{ n - 1 } \frac{ m_y - \bar{ r } m_x }{ m_x }

r_\mathrm{ corr } = \bar{ r_g } + \frac{ g ( m_y - \bar{ r_g } m_x ) }{ m_x }

where $\bar{ r }$ is the mean of the ratios r_g of the g groups and

\bar{ r_g } = \sum \frac{ r_i^{'} }{ g }

where r_i^' is the value of the sample ratio with the i^th group omitted.

Other methods of estimation

Other methods of estimating a ratio estimator include maximum likelihood and bootstrapping.^[11]

Estimate of total

The estimated total of the y variate ( τ_y ) is

\tau_y = r \tau_x

where ( τ_x ) is the total of the x variate.

Variance estimates

The variance of the sample ratio is approximately:

\operatorname{ var }( r ) = \frac{ 1 }{ s_x^2 + m_x^2 } \left[ ( s_y^2 - s_{ x^2 [ y^2 / x^2 ] } ) - ( s_{ x [ y / x ] } )^2 +2 m_y s_{ x[ y / x ] } - \frac{ s_x^2 }{ m_x^2 }( m_y - s_{ x[ y / x ] }^2) \right]

where s_x² and s_y² are the variances of the x and y variates respectively, m_x and m_y are the means of the x and y variates respectively and s_ab is the covariance of a and b.

Although the approximate variance estimator of the ratio given below is biased, if the sample size is large, the bias in this estimator is negligible.

\operatorname{ var }( r ) = \frac{ N - n }{ N } \frac{ 1 }{ m_x^2 } \frac{ \sum_{ i = 1 }^n( y_i - rx_i ) }{ n - 1 }

where N is the population size, n is the sample size and m_x is the mean of the x variate.

Another estimator of the variance based on the Taylor expansion is

\operatorname{ var }( r ) = \frac{ 1 }{ n } ( 1 - \frac{ n - 1 }{ N - 1 } ) \frac{ r^2 s_x^2 + s_y^2 - 2 r \rho s_x s_y }{ m_x^2 }

where n is the sample size, N is the population size and ρ is the correlation coefficient between the x and y variates.

An estimate accurate to O( n⁻² ) is^[10]

\operatorname{ var }( r ) = \frac{ 1 }{ n }\left[ \frac{ s_y^2 }{ m_x^2 } + \frac{ m_y^2 s_x^2 }{ m_x^4 } - \frac{ 2m_y s_{ xy } }{ m_x^3 } \right]

An estimator accurate to O( n⁻³ ) is^[4]

\operatorname{ var }( r ) = r^2 \left[ \frac{ 1 }{ n } \left( \frac{ 1 }{ m_x } + \frac{ 1 }{ m_y } - \frac{ 2 s_{ xy } }{ m_x m_y } \right) + \frac{ 1 }{ n^2 } \left( \frac{ 6 }{ m_x^2 } + \frac{ 3 }{ m_x m_y } + s_{ xy }\left[ \frac{ 4 }{ m_y^2 } - \frac{ 8 }{ m_x m_y } - \frac{ 16 }{ m_x^2 m_y } + \frac{ 5 s_{ xy } }{ m_x^2 m_y^2 } \right] + \frac{ 4 s_{ x^2 y } }{ m_x^2 m_y } - \frac{ 2 s_{ x y^2 }}{ m_x m_y^2 } \right) \right]

A jackknife estimator of the variance is

\operatorname{ var }( r ) = \frac{ 1 }{ n ( n - 1 ) } \sum_{ i \ne j }^n ( r_i - r_J )^2

where r_i is the ratio with the i^th pair of variates omitted and r_J is the jackknife estimate of the ratio.^[11]

Variance of total

The variance of the estimated total is

\operatorname{ var }( \tau_y ) = \tau_y^2 \operatorname{ var }( r )

Variance of mean

The variance of the estimated mean of the y variate is

\operatorname{ var }( \bar{ y } ) = m_x^2 \operatorname{ var }( r ) = \frac{ N - n }{ N } \frac{ 1 }{ m_x^2 } \frac{ \sum_{ i = 1 }^n( y_i - rx_i ) }{ n - 1 } = \frac{ N - n }{ N } \frac{ ( s_y^2 +r^2 s_x^2 - 2r \rho s_x s_y ) }{ n }

where m_x is the mean of the x variate, s_x² and s_y² are the sample variances of the x and y variates respectively and ρ is the sample correlation between the x and y variates.

Skewness

The skewness and the kurtosis of the ratio depend on the distributions of the x and y variates. Estimates have been made of these parameters for normally distributed x and y variates but for other distributions no expressions have yet been derived. It has been found that in general ratio variables are skewed to the right, are leptokurtic and their nonnormality is increased when magnitude of the denominator's coefficient of variation is increased.

For normally distributed x and y variates the skewness of the ratio is approximately^[7]

\gamma = \left( \frac{ m_y \omega }{ \sqrt{ n m_x m_y \omega^2 + m_x^2 m_y } } \right)\left( 6 + \frac{ 1 }{ n m_x } \left[ 44 + \frac{ 1 }{ 1 + \omega^2 m_y / m_x } \right] \right)

where

\omega = 1 - m_x \operatorname{cov}( x, y ) \,

Effect on confidence intervals

Because the ratio estimate is generally skewed confidence intervals created with the variance and symmetrical tests such as the t test are incorrect.^[11] These confidence intervals tend to overestimate the size of the left confidence interval and underestimate the size of the right.

If the ratio estimator is unimodal (which is frequently the case) then a conservative estimate of the 95% confidence intervals can be made with the Vysochanskiï–Petunin inequality.

Alternative methods of bias reduction

An alternative method of reducing or eliminating the bias in the ratio estimator is to alter the method of sampling. The variance of the ratio using these methods differs from the estimates given previously.

Lahiri's method

Lahiri introduced the first of these sampling schemes in 1951.^[14]

Choose a number M ≥ max( x₁, ..., x_N) where N is the population size. Chose one of these elements (x_i). Chose u at random from a uniform distribution U(0, 1). If uM ≤ x_i, then x_i is retained in the sample. If not then it is rejected and a new element is chosen. Repeat this process N times. The same process is carried out with the y variate. Then the ratio of the sum of the y variates and the sum of the x variates chosen in this fashion is an unbiased estimate of the ratio estimator.

In symbols we have

r = \frac { \sum y_i }{ \sum x_i }

where x_i and y_i are chosen according to the scheme described above.

Midzuno-Sen's method

In 1952 Midzuno and Sen independently described a sampling scheme that provides an unbiased estimator of the ratio.^[15]^[16]

The first sample is chosen with probability proportional to the size of the x variate. The remaining n - 1 samples are chosen at random without replacement from the remaining N - 1 members in the population. The probability of selection under this scheme is

P = \frac{ \sum x_i } { { N - 1 \choose n - 1 } X }

where X is the sum of the N x variates and the x_i are the n members of the sample.

The ratio estimator given by this scheme is unbiased.

Ordinary least squares regression

If a linear relationship between the x and y variates exists and the regression equation passes through the origin then the estimated variance of the regression equation is always less than that of the ratio estimator. The precise relationship between the variances depends on the linearity of the relationship between the x and y variates: when the relationship is other than linear the ratio estimate may have a lower variance than that estimated by regression.

Uses

Although the ratio estimator may be of use in a number of settings it is of particular use in two cases:

when the variates x and y are highly correlated through the origin
when the total population size is unknown

History

The first known use of the ratio estimator was by John Graunt in England who in 1662 was the first to estimate the ratio y/x where y represented the total population and x the known total number of registered births in the same areas during the preceding year.

Later Messance (~1765) and Moheau (1778) published very carefully prepared estimates for France based on enumeration of population in certain districts and on the count of births, deaths and marriages as reported for the whole country. The districts from which the ratio of inhabitants to birth was determined only constituted a sample.

In 1802, Laplace wished to estimate the population of France. No population census had been carried out and Laplace lacked the resources to count every individual. Instead he sampled 30 parishes whose total number of inhabitants was 2,037,615. The parish baptismal registrations were considered to be reliable estimates of the number of live births so he used the total number of births over a three-year period. The sample estimate was 71,866.333 baptisms per year over this period giving a ratio of one registered baptism for every 28.35 persons. The total number of baptismal registrations for France was also available to him and he assumed that the ratio of live births to population was constant. He then used the ratio from his sample to estimate the population of France.

Karl Pearson said in 1897 that the ratio estimates are biased and cautioned against their use.^[17]

References

↑ Scott AJ, Wu CFJ (1981) On the asymptotic distribution of ratio and regression estimators. JASA 76: 98–102
↑ Cochran WG (1977) Sampling techniques. New York: John Wiley & Sons
↑ Hartley HO, Ross A (1954) Unbiased ratio estimators. Nature 174: 270–271
1 2 3 Ogliore RC, Huss GR, Nagashima K (2011) Ratio estimation in SIMS analysis. Nuclear Instruments and Methods in Physics Research Section B: Beam Interactions with Materials and Atoms 269 (17) 1910–1918
↑ Pascual JN (1961) Unbiased ratio estimators in stratified sampling. JASA 56(293):70–87
↑ Beale EML (1962) Some use of computers in operational research. Industrielle Organization 31: 27-28
1 2 Tin M (1965) Comparison of some ratio estimators. JASA 60: 294–307
↑ Sahoo LN (1983). On a method of bias reduction in ratio estimation. J Statist Res 17:1—6
↑ Sahoo LN (1987) On a class of almost unbiased estimators for population ratio. Statistics 18: 119-121
1 2 van Kempen GMP, van Vliet LJ (2000) Mean and variance of ratio estimators used in fluorescence ratio imaging. Cytometry 39:300–305
1 2 3 4 Choquet D, L'ecuyer P, Léger C (1999) Bootstrap confidence intervals for ratios of expectations. ACM Transactions on Modeling and Computer Simulation - TOMACS 9 (4) 326-348 DOI: 10.1145/352222.352224
↑ Durbin J (1959) A note on the application of Quenouille's method of bias reduction to estimation of ratios. Biometrika 46: 477-480
↑ Mickey MR (1959) Some finite population unbiased ratio and regression estimators. JASA 54: 596–612
↑ Lahiri DB (1951) A method of sample selection providing unbiased ratio estimates. Bull Int Stat Inst 33: 133–140
↑ Midzuno H (1952) On the sampling system with probability proportional to the sum of the sizes. Ann Inst Stat Math 3: 99-107
↑ Sen AR (1952) Present status of probability sampling and its use in the estimation of a characteristic. Econometrika 20-103
↑ Pearson K (1897) On a form of spurious correlation that may arise when indices are used for the measurement of organs. Proc Roy Soc Lond 60: 498

Statistics

Descriptive statistics

Continuous data

Location	Mean arithmetic geometric harmonic Median Mode

Dispersion	Range Standard deviation Coefficient of variation Percentile Interquartile range

Shape	Variance Skewness Kurtosis Moments L-moments

Count data

Index of dispersion

Summary tables

Dependence

Statistical graphics

Data collection

Study design	Effect size Standard error Statistical power Sample size determination

Survey methodology	Sampling stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Confidence interval Testing hypotheses Power

Unbiased estimators	Mean unbiased minimum-variance Median unbiased

Biased estimators	Maximum likelihood Method of moments Minimum distance Density estimation

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F Shapiro–Wilk Kolmogorov–Smirnov

Goodness of fit	Chi-squared G Sample source (Anderson–Darling) Sample normality (Shapiro–Wilk) Skewness / kurtosis normality (Jarque-Bera) Model comparison (Likelihood-ratio) Model quality (Akaike criterion)

Signed-rank	1-sample (Wilcoxon) 2-sample (Mann–Whitney U) 1-way anova (Kruskal–Wallis)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the Tuesday, March 01, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.