Rank correlation

In statistics, a rank correlation is any of several statistics that measure the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

Context

If, for example, one variable is the identity of a college basketball program and another variable is the identity of a college football program, one could test for a relationship between the poll rankings of the two types of program: do colleges with a higher-ranked basketball program tend to have a higher-ranked football program? A rank correlation coefficient can measure that relationship, and the measure of significance of the rank correlation coefficient can show whether the measured relationship is small enough to likely be a coincidence.

If there is only one variable, the identity of a college football program, but it is subject to two different poll rankings (say, one by coaches and one by sportswriters), then the similarity of the two different polls' rankings can be measured with a rank correlation coefficient.

Correlation coefficients

Some of the more popular rank correlation statistics include

An increasing rank correlation coefficient implies increasing agreement between rankings. The coefficient is inside the interval [−1, 1] and assumes the value:

1 if the agreement between the two rankings is perfect; the two rankings are the same.
0 if the rankings are completely independent.
−1 if the disagreement between the two rankings is perfect; one ranking is the reverse of the other.

Following Diaconis (1988), a ranking can be seen as a permutation of a set of objects. Thus we can look at observed rankings as data obtained when the sample space is (identified with) a symmetric group. We can then introduce a metric, making the symmetric group into a metric space. Different metrics will correspond to different rank correlations.

General correlation coefficient

Kendall (1944) showed that his $\tau$ (tau) and Spearman's $\rho$ (rho) are particular cases of a general correlation coefficient.

Suppose we have a set of $n$ objects, which are being considered in relation to two properties, represented by $x$ and $y$ , forming the sets of values $\{x_i\}_{i\le n}$ and $\{y_i\}_{i\le n}$ . To any pair of individuals, say the $i$ -th and the $j$ -th we assign a $x$ -score, denoted by $a_{ij}$ , and a $y$ -score, denoted by $b_{ij}$ . The only requirement made to this functions is anti-symmetry, so $a_{ij}=-a_{ji}$ and $b_{ij}=-b_{ji}$ . Then the generalised correlation coefficient $\Gamma$ is defined by

\Gamma = \frac{\sum_{i,j = 1}^n a_{ij}b_{ij}}{\sqrt{\sum_{i,j = 1}^n a_{ij}^2 \sum_{i,j = 1}^n b_{ij}^2}}

Kendall's $\tau$ as a particular case

If $r_i$ is the rank of the $i$ -member according to the $x$ -quality, we can define

a_{ij} = \sgn(r_j-r_i)

and similarly for $b$ . The sum $\sum a_{ij}b_{ij}$ is twice the amount of concordant pairs minus the discordant pairs (see Kendall tau rank correlation coefficient). The sum $\sum a_{ij}^2$ is just the number of terms $a_{ij}$ , equal to $n(n-1)$ , and so for $\sum b_{ij}^2$ . It follows that $\Gamma$ is equal to the Kendall's $\tau$ coefficient.

Spearman's $\rho$ as a particular case

If $r_i$ , $s_i$ are the ranks of the $i$ -member according to the $x$ and the $y$ -quality respectively, we can simply define

a_{ij} = r_j-r_i

b_{ij} = s_j-s_i

The sums $\sum a_{ij}^2$ and $\sum b_{ij}^2$ are equal, since both $r_i$ and $s_i$ range from $1$ to $n$ . Then we have:

\Gamma = \frac{\sum (r_j-r_i)(s_j-s_i)}{\sum(r_j-r_i)^2}

now

\sum_{i,j = 1}^n (r_j-r_i)(s_j-s_i)= \sum_{i=1}^n \sum_{j=1}^n r_is_i + \sum_{i=1}^n \sum_{j=1}^n r_js_j - \sum_{i=1}^n \sum_{j=1}^n (r_is_j+r_js_i)

=2n\sum_{i=1}^n r_is_i - 2 \sum_{i=1}^n r_i \sum_{j=1}^n s_j

=2n\sum_{i=1}^n r_is_i - \frac12 n^2(n+1)^2

since $\sum r_i$ and $\sum s_j$ are both equal to the sum of the first $n$ natural numbers, namely $\frac12n(n+1)$ .

We also have

S = \sum_{i=1}^n (r_i-s_i)^2 = 2 \sum r_i^2 - 2\sum r_is_i

and hence

\sum(r_j-r_i)(s_j-s_i) = 2n\sum r_i^2 - \frac12n^2(n+1)^2 - nS

$\sum r_i^2$ being the sum of squares of the first $n$ naturals equals $\frac16n(n+1)(2n+1)$ . Thus, the last equation reduces to

\sum(r_j-r_i)(s_j-s_i) = \frac16n^2(n^2-1) - nS

Further

\sum(r_j-r_i)^2 = 2n\sum r_i^2-2\sum r_ir_j

= 2n\sum r_i^2-2(\sum r_i)^2 = \frac16n^2(n^2-1)

and thus, substituting into the original formula these results we get

\Gamma_R = 1-\frac{6\sum d_i^2}{n^3-n}

where $d_i = x_i - y_i,$ is the difference between ranks.

which is exactly the Spearman's rank correlation coefficient $\rho$ .

Rank-biserial correlation

Gene Glass (1965) noted that the rank-biserial can be derived from Spearman's $\rho$ . "One can derive a coefficient defined on X, the dichotomous variable, and Y, the ranking variable, which estimates Spearman's rho between X and Y in the same way that biserial r estimates Pearson's r between two normal variables” (p. 91). The rank-biserial correlation had been introduced nine years before by Edward Cureton (1956) as a measure of rank correlation when the ranks are in two groups.

Kerby simple difference formula

Dave Kerby (2014) recommended the rank-biserial as the measure to introduce students to rank correlation, because the general logic can be explained at an introductory level. The rank-biserial is the correlation used with the Mann–Whitney U test, a method commonly covered in introductory college courses on statistics. The data for this test consists of two groups; and for each member of the groups, the outcome is ranked for the study as a whole.

Kerby showed that this rank correlation can be expressed in terms of two concepts: the percent of data that support a stated hypothesis, and the percent of data that do not support it. The Kerby simple difference formula states that the rank correlation can be expressed as the difference between the proportion of favorable evidence (f) minus the proportion of unfavorable evidence (u).

r = f - u

Example and interpretation

To illustrate the computation, suppose a coach trains long-distance runners for one month using two methods. Group A has 5 runners, and Group B has 4 runners. The stated hypothesis is that method A produces faster runners. The race to assess the results finds that the runners from Group A do indeed run faster, with the following ranks: 1, 2, 3, 4, and 6. The slower runners from Group B thus have ranks of 5, 7, 8, and 9.

The analysis is conducted on pairs, defined as a member of one group compared to a member of the other group. For example, the fastest runner in the study is a member of four pairs: (1,5), (1,7), (1,8), and (1,9). All four of these pairs support the hypothesis, because in each pair the runner from Group A is faster than the runner from Group B. There are a total of 20 pairs, and 19 pairs support the hypothesis. The only pair that does not support the hypothesis are the two runners with ranks 5 and 6, because in this pair, the runner from Group B had the faster time. By the Kerby simple difference formula, 95% of the data support the hypothesis (19 of 20 pairs), and 5% do not support (1 of 20 pairs), so the rank correlation is r = .95 - .05 = .90.

The maximum value for the correlation is r = 1, which means that 100% of the pairs favor the hypothesis. A correlation of r = 0 indicates that half the pairs favor the hypothesis and half do not; in other words, the sample groups do not differ in ranks, so there is no evidence that they come from two different populations. An effect size of r = 0 can be said to describe no relationship between group membership and the members' ranks.

References

Cureton, E. E. (1956). Rank-biserial correlation. Psychometrika 21, 287-290. doi:10.1007/BF02289138
Everitt, B. S. (2002), The Cambridge Dictionary of Statistics, Cambridge: Cambridge University Press, ISBN 0-521-81099-X
Diaconis, P. (1988), Group Representations in Probability and Statistics, Lecture Notes-Monograph Series, Hayward, CA: Institute of Mathematical Statistics, ISBN 0-940600-14-5
Glass, G. V. (1965). A ranking variable analogue of biserial correlation: implications for short-cut item analysis. Journal of Educational Measurement, 2(1), 91–95. DOI: 10.1111/j.1745-3984.1965.tb00396.x
Kendall, M. G. (1970), Rank Correlation Methods, London: Griffin, ISBN 0-85264-199-0
Kerby, D. S. (2014). The simple difference formula: An approach to teaching nonparametric correlation. Innovative Teaching, volume 3, article 1. doi:10.2466/11.CP.3.1. link to pdf

External links

Brief guide by experimental psychologist Karl L. Weunsch - Nonparametric effect sizes (Copyright 2015 by Karl L. Weunsch)

Statistics

Descriptive statistics

Continuous data

Location	Mean arithmetic geometric harmonic Median Mode

Dispersion	Range Standard deviation Coefficient of variation Percentile Interquartile range

Shape	Variance Skewness Kurtosis Moments L-moments

Count data

Index of dispersion

Summary tables

Dependence

Statistical graphics

Data collection

Study design	Effect size Standard error Statistical power Sample size determination

Survey methodology	Sampling stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Confidence interval Testing hypotheses Power

Unbiased estimators	Mean unbiased minimum-variance Median unbiased

Biased estimators	Maximum likelihood Method of moments Minimum distance Density estimation

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F Shapiro–Wilk Kolmogorov–Smirnov

Goodness of fit	Chi-squared G Sample source (Anderson–Darling) Sample normality (Shapiro–Wilk) Skewness / kurtosis normality (Jarque-Bera) Model comparison (Likelihood-ratio) Model quality (Akaike criterion)

Signed-rank	1-sample (Wilcoxon) 2-sample (Mann–Whitney U) 1-way anova (Kruskal–Wallis)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the Monday, November 16, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.