Misunderstandings of p-values

Misunderstandings of p-values are an important problem in scientific research and scientific education. P-values are often used or interpreted incorrectly.^[1] The data obtained by comparing the p-value to a significance level will yield one of two results: either the null hypothesis is rejected, or the null hypothesis cannot be rejected at that significance level (which however does not imply that the null hypothesis is true). In Fisher's formulation, there is a disjunction: a low p-value means either that the null hypothesis is true and a highly improbable event has occurred or that the null hypothesis is false.

Common misunderstandings

Common misunderstandings about p-values include:^[1]^[2]^[3]

The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either. In fact, frequentist statistics does not and cannot attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero and the posterior probability of the null is very close to unity (if there is no alternative hypothesis with a large enough a priori probability that would explain the results more easily), Lindley's paradox. There are also a priori probability distributions in which the posterior probability and the p-value have similar or equal values.^[4]
The p-value is not the probability that a finding is "merely a fluke." Calculating the p-value is based on the assumption that every finding is a fluke, the product of chance alone. The phrase "the results are due to chance" is used to mean that the null hypothesis is probably correct. However, that is merely a restatement of the inverse probability fallacy since the p-value cannot be used to figure out the probability of a hypothesis being true.
The p-value is not the probability of falsely rejecting the null hypothesis. That error is a version of the so-called prosecutor's fallacy.
The p-value is not the probability that replicating the experiment would yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of p-rep.
The significance level, such as 0.05, is not determined by the p-value. Rather, the significance level is decided by the person conducting the experiment (with the value 0.05 widely used by the scientific community) before the data are viewed, and it is compared against the calculated p-value after the test has been performed. (However, reporting a p-value is more useful than simply saying that the results were or were not significant at a given level and allows readers to decide for themselves whether to consider the results significant.)
The p-value does not indicate the size or importance of the observed effect. The two vary together, however, and the larger the effect, the smaller the sample size that will be required to get a significant p-value (see effect size).

Representing probabilities of hypotheses

The p-value does not in itself allow reasoning about the probabilities of hypotheses, which requires multiple hypotheses or a range of hypotheses, with a prior distribution of likelihoods between them, as in Bayesian statistics. There, one uses a likelihood function for all possible values of the prior instead of the p-value for a single null hypothesis. The p-value describes a property of data when compared to a specific null hypothesis; it does not apply to the hypothesis. For the same reason, p-values do not give the probability that the data were produced by random chance alone.^[1]

False discovery rate

Main article: False discovery rate

The misinterpretation of p-values as frequentist error rates occurs because the false discovery rate is not taken into account.^[5] The false discovery rate (FDR) refers to the probability that the p-value will indicate that a result is significant when it actually is not, i.e. the odds of incorrectly rejecting the null hypothesis (also known as type I error). The FDR will increase with the number of tests performed.^[5]^[6] In general, for n independent hypotheses to test with the criteria p < p₀, the probability of obtaining a false positive is given as

P(\mathrm{false~positive}) = 1 - (1-p_0)^n

Multiple comparisons problem

Main article: Multiple comparisons problem

The multiple comparisons problem occurs when one considers a set of statistical inferences simultaneously^[7] or infers a subset of parameters selected based on the observed values.^[8] It is also known as the look-elsewhere effect. Errors in inference, including confidence intervals that fail to include their corresponding population parameters or hypothesis tests that incorrectly reject the null hypothesis, are more likely to occur when one considers the set as a whole. Several statistical techniques have been developed to prevent this from happening, allowing significance levels for single and multiple comparisons to be directly compared. These techniques generally require a higher significance threshold for individual comparisons, so as to compensate for the number of inferences being made.

Webcomic artist and science popularizer Randall Munroe of xkcd parodied the mainstream media's misunderstanding of p-values by portraying scientists investigate the claim that eating jellybeans caused acne.^[5]^[6]^[9]^[10] Scientists test the claim, and find no link between the consumption of jellybeans and the prevalence of acne, to p > 0.05, the usual 1-in-20 threshold that the results are due to statistical effects rather than a true correlation. Then, when a new claim is made that only jellybeans of certain colors cause acne, they proceed to investigate 20 different colors of jellybeans, one of which (green) is found to correlate with acne, with p < 0.05. The general media then runs the sensationalistic headline "Green jellyeans linked to acne! 95% confidence! Only 5% chance of coincidence!", ignoring that this corresponds to the 1-in-20 chance of a statistical oddity that one would expect when using the criteria of p > 0.05.

When doing 20 tests with a criteria of p < 0.05, like in the xkcd comic, there is a 64.2% chance of having at least one false positive result (assuming there are no real effects). If the number of test is increased to 100 instead, there will be a 99.4% chance of a false positive result.^[6]

Application to the alternative hypothesis

The p-value refers only to the null hypothesis, and does not make reference to or allow conclusions about any other hypotheses, such as the alternative hypothesis in Neyman–Pearson statistical hypothesis testing. In that approach, one instead has a decision function between two alternatives, often based on a test statistic, and computes the rate of type I and type II errors as α and β. However, the p-value of a test statistic cannot be directly compared to these error rates α and β. Instead, it is fed into a decision function.

The p-value fallacy

The p-value fallacy is a common misinterpretation of the meaning of a p-value whereby a binary classification of experimental results as true or false is made, based on whether or not they are statistically significant. It derives from the assumption that a p-value can be used to summarize an experiment's results, rather than being a heuristic that is not always useful.^[11]^[12] The term "p-value fallacy" was coined in 1999 by Steven N. Goodman.^[12]^[13]

In the p-value fallacy, a single number is used to represent both the false positive rate under the null hypothesis H₀ and also the strength of the evidence against H₀. However, there is a trade-off between these factors, and it is not logically possible to do both at once.^[12] Neyman and Pearson described the trade-off as between being able to control error rates over the long term and being able to evaluate conclusions of specific experiments in the short term, but a common misinterpretation of p-values is that the trade-off can be avoided.^[12] Another way to view the error is that studies in medical research are often designed using a Neyman-Pearson statistical approach but analyzed with a Fisherian approach.^[14] However, this is not a contradiction between frequentist and Bayesian reasoning, but a basic property of p-values that applies in both cases.^[13]

This fallacy is contrary to the intent of the statisticians who originally supported the use of p-values in research.^[12]^[2] As described by Sterne and Smith, "An arbitrary division of results, into 'significant' or 'non-significant' according to the P value, was not the intention of the founders of statistical inference."^[2] In contrast, common interpretations of p-values discourage the ability to distinguish statistical results from scientific conclusions, and discourage the consideration of background knowledge such as previous experimental results.^[12] The correct use of p-values is to guide behavior, not to classify results;^[11] that is, to inform a researcher's choice of which hypothesis to accept, not provide an inference about which hypothesis is true.^[12]

References

1 2 3 Wasserstein, Ronald L.; Lazar, Nicole A. (2016). "The ASA's statement on p-values: context, process, and purpose". The American Statistician. doi:10.1080/00031305.2016.1154108.
1 2 3 Sterne JA, Smith GD (2001). "Sifting the evidence–what's wrong with significance tests?". BMJ 322 (7280): 226–231. doi:10.1136/bmj.322.7280.226. PMC 1119478. PMID 11159626.
↑ Schervish MJ (1996). "P Values: What They Are and What They Are Not". The American Statistician 50 (3): 203. doi:10.2307/2684655. JSTOR 2684655.
↑ Casella, George; Berger, Roger L. (1987). "Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem". Journal of the American Statistical Association 82 (397): 106–111. doi:10.1080/01621459.1987.10478396.
1 2 3 Colquhoun, David (19 November 2014). "An investigation of the false discovery rate and the misinterpretation of p-values". Royal Society Open Science 1 (3): 140216–140216. doi:10.1098/rsos.140216.
1 2 3 Reinhart, A. (2015). Statistics Done Wrong: The Woefully Complete Guide. No Starch Press. pp. 47–48. ISBN 9781593276201.
↑ Miller, R.G. (1981). Simultaneous Statistical Inference 2nd Ed. Springer Verlag New York. ISBN 0-387-90548-0.
↑ Benjamini, Y. (2010). "Simultaneous and selective inference: Current successes and future challenges". Biometrical Journal 52 (6): 708–721. doi:10.1002/bimj.200900299. PMID 21154895.
↑ Munroe, R. "Significant". xkcd. Retrieved 2016-02-22.
↑ Barsalou, M. (2 June 2014). "Hypothesis Testing and P Values". Minitab blog. Retrieved 2016-02-22.
1 2 Dixon P (2003). "The p-value fallacy and how to avoid it.". Canadian Journal of Experimental Psychology 57 (3): 189–202. PMID 14596477.
1 2 3 4 5 6 7 Goodman SN (1999). "Toward evidence-based medical statistics. 1: The P value fallacy.". Annals of Internal Medicine 130 (12): 995–1004. PMID 10383371.
1 2 Sellke T, Bayarri MJ, Berger JO (2001). "Calibration of p values for testing precise null hypotheses". The American Statistician 55 (1): 62–71. doi:10.1198/000313001300339950. |access-date= requires |url= (help)
↑ de Moraes AC, Cassenote AJ, Moreno LA, Carvalho HB (2014). "Potential biases in the classification, analysis and interpretations in cross-sectional study: commentaries - surrounding the article "resting heart rate: its correlations and potential for screening metabolic dysfunctions in adolescents".". BMC Pediatrics 14: 117. doi:10.1186/1471-2431-14-117. PMC 4012522. PMID 24885992.