To p or not to p—The question of statistical significance

The validity of p<0.05 as a measure of a true effect has been under increasing scrutiny for more than a decade. Numerous articles since 2010 have questioned the validity of hypothesis testing via p-value; one of the most amusing is the comment by Tom Siegfried of ScienceNews who pointed out in 2014 that “…statistical techniques for testing hypotheses have more flaws than Facebook’s privacy policies.”

Studies into the reproducibility and replicability of scientific conclusions have estimated false discovery rates of 11% to 50%. In response to this chatter and to the decision by some journals to ban null hypothesis significance testing (aka p-values) in published articles, the American Statistical Association (ASA) convened a committee to address the issue in 2015, and the committee published the “ASA Statement on Statistical Significance and P-Values” in 2016.

In the introduction to that statement, the ASA laid the problem on the table—“While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted”—and then explained its belief that “the scientific community could benefit from a formal statement clarifying several widely agreed upon principles underlying the proper use and interpretation of the p-value….This statement…articulates in nontechnical terms a few select principles that could improve the conduct or interpretation of quantitative science, according to widespread consensus in the statistical community.”

The six principles in the ASA statement are an engaging read. While all six principles are important, three of them have particular applicability to the work we do:

P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Researchers often wish to turn a p-value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p-value is neither.

Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Practices that reduce data analysis or scientific inference to mechanical “bright-line” rules (such as “p <0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making. A conclusion does not immediately become “true” on one side of the divide and “false” on the other.

A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
Statistical significance is not equivalent to scientific, human, or economic significance. Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect.

Not everyone agrees. We’ll continue this discussion in future posts.