What a nerdy debate about p-values shows about science — and how to fix it

The case for, and against, redefining “statistical significance.” 

There’s a huge debate going on in social science right now. The question is simple, and strikes near the heart of all research: What counts as solid evidence?

The answer matters because many disciplines are currently in the midst of a “replication crisis” where even textbook studies aren’t holding up against rigorous retesting. The list includes: ego depletion, the idea that willpower is a finite resource, and the facial feedback hypothesis, which suggested if we activate muscles used in smiling, we become happier, and many, many more.

Scientists are now figuring out how to right the ship, to ensure scientific studies published today won’t be laughed at in a few years.

One of the thorniest issues on this question is statistical significance. It’s one of the most influential metrics to determine whether a result is published in a scientific journal.

Most casual readers of scientific research know that for results to be declared “statistically significant,” they need to pass a simple test. The answer to this test is called a p-value. And if your p-value is less than .05 — bingo! — you got yourself a statistically significant result.

Now, a group of 72 prominent statisticians, psychologists, economists, biomedical researchers, and others want to disrupt the status quo. A forthcoming paper in the journal Nature Human Behavior argues that results should only be deemed “statistically significant” if they pass a higher threshold.

“We propose a change to P< 0.005,” the authors write. “This simple step would immediately improve the reproducibility of scientific research in many fields.”

This may sound nerdy, but it’s important. If the change is accepted, the hope is that fewer false positives will corrupt the scientific literature. It’s become too easy — using shady techniques known as p-hacking, and outcome switching — to find some publishable result that reaches the .05 significance level.

“There’s a major problem using p-values the way we have been using them,” says John Ioannidis, a Stanford professor of health research and one of the authors of the paper. “It’s causing a flood of misleading claims in the literature.”

Don’t be mistaken: This proposal won’t solve all the problems in science. “I see it as a dam to contain the flood until we make sure we have the more permanent fixes,” Ioannidis says. He calls it a “quick fix.” Though not everyone agrees it’s the best course of action.

At best, the proposal is an easy change to implement to protect academic literature from faulty change. At worst, it’s a patronizing decree that avoids addressing the real problem at the heart of science’s woes.

There is a lot to unpack and understand here. So we’re going to take it slow.

What is a p-value?

 Mick Wiggins / Getty Creative Images

Even the simplest definitions of p-values tend to get complicated. So bear with me as I break it down.

When researchers calculate a p-value, they’re putting to the test what’s known as the null hypothesis. First thing to know: This is not a test of the question the experimenter most desperately wants to answer.

Let’s say the experimenter really wants to know if eating one bar of chocolate a day leads to weight loss. To test that, they assign 50 participants to eat one bar of chocolate a day. Another 50 are commanded to abstain from the delicious stuff. Both groups are weighed before the experiment, and then after. And their average weight change is compared.

The null hypothesis is the devil’s advocate argument. It states: There is no difference in the weight loss of the chocolate eaters versus the chocolate abstainers.

Rejecting the null is a major hurdle scientists need to clear to prove their theory. If the null stands, it means they haven’t eliminated a major alternative explanation for their results. And what is science if not a process of narrowing down explanations?

So how do they rule out the null? They calculate some statistics.

This test basically asks: How ridiculous would it be to believe the null hypothesis is true answer, given the results we’re seeing?

Rejecting the null is kind of like the “innocent until proven guilty” principle in court cases, Regina Nuzzo, a mathematics professor at Gallaudet University, explains. In court, you start off with the assumption the defendant is innocent. Then, you start looking at the evidence: the bloody knife with his fingerprints on it, his history of violence, eyewitness accounts. As the evidence mounts, that presumption of innocence starts to look naïve. At a certain point, jurors get the feeling, beyond a reasonable doubt, that the defendant is not innocent.

Null hypothesis testing follows a similar logic: If there are huge and consistent weight differences between the chocolate eaters and chocolate abstainers, the null hypothesis — that there are no weight differences — starts to look silly. And you can reject it.

You might be thinking: Isn’t this a pretty roundabout way to prove an experiment worked?

You are correct!

Rejecting the null hypothesis is indirect evidence of an experimental hypothesis. It says nothing about whether your scientific conclusion is correct.

Sure, the chocolate eaters may lose some weight. But is it the because of the chocolate? Maybe. Or maybe they felt extra guilty eating candy every day, and they knew they were going to be weighed by strangers wearing lab coats (weird!). So they skimped on other meals.

Rejecting the null doesn’t tell you anything about the mechanism by which chocolate causes weight loss. It doesn’t tell you if the experiment is well designed, or well controlled for, or if the results have been cherry picked.

It just helps you understand how rare the results are.

But — and this is a tricky, tricky point — it’s not how rare the results of your experiment are. It’s how rare the results would be in the world where the null hypothesis is true. That is: It’s how rare the results would be if nothing in your experiment worked, and the difference in weight was due to random chance alone.

Here’s where the p-value comes in: The p-value quantifies this rareness. It tells you how often you’d see the numerical results of an experiment — or even more extreme results — if the null hypothesis is true, and there’s no difference between the groups.

If the p-value is very small, it means the numbers would rarely (but not never!) occur by chance alone. And so, when the p is small, researchers start to think the null hypothesis looks improbable. And they take a leap to conclude “their [experimental] data are pretty unlikely to be due to random chance,” Nuzzo explains.

And here’s another tricky point: Researchers can never completely rule out the null (just like jurors are not first-hand witnesses to a crime). So scientists instead pick a threshold where they feel pretty confident that they reject the null. That’s now set at less than .05.

Ideally, a p of .05 means if you ran the experiment 100 times — again, assuming the null hypothesis is true — you’d see these same numbers (or more extreme results) five times.

And one last, super thorny concept that almost everyone gets wrong: A p<.05 does not mean there’s less than a 5 percent chance your experimental results are due to random chance. It does not mean there’s only a 5 percent chance you’ve landed on a false positive. Nope. Not at all.

Again: A p of .05 means there’s a less than 5 percent chance that, in the world where the null hypothesis is true, the results you’re seeing would be due to random chance. This sounds nit-picky, but it’s critical. It’s is the misunderstanding that leads people to be unduly confident in p-values. The false-positive rate for experiments at p=.05 can be much, much higher than 5 percent.

Okay. Still with me? It’s okay if you need to take a break. Grab a soda. Catch up with mom. She’s wondering why you haven’t called in a while. Tell her about your summer plans.

Cause now we’re going to dive into….

The case against p<.05

 erhui1979 / Getty Creative Images

“Generally, p-values should not be used to make conclusions, but rather to identify possibilities — like a sniff test,” Rebecca Goldin, the director for Stats.org and a math professor at George Mason University, explains in an email.

And for a long while, a sniff of p that’s less than .05 smelled pretty good. But over the last several years researchers and statisticians have realized that a p<.05 is not as strong of evidence as they once thought.

And be sure: Evidence for this is abundant.

Here’s the most obvious, easy-to-understand piece of evidence: Many papers that have used the .05 significance threshold have not replicated with more methodologically rigorous designs.

A famous 2015 paper in Science attempted to replicate 100 findings published in a prominent psychological journal. Only 39 percent passed. Other disciplines have fared somewhat better. A similar replication effort in economic papers found 60 percent of findings replicated. There’s a reproducibility “crisis” in biomedicine too, but it hasn’t been so specifically quantified.

The 2015 Science paper on psych studies offered some clues to which papers were more likely to replicate. Studies that yielded highly significant results (less than p=.01) are more likely to reproduce than those that are just barely significant at the .05 level.

“Reporting effects that really aren’t there undermine the credibility of science,” says Valen Johnson, a co-author of the Nature Human Behavior proposal who heads the statistics department at Texas A&M. “It’s important that science adopt these higher standards, before they claim they have made a discovery.”

Elsewhere, researchers find evidence of a “an epidemic” of statistical significance. “Practically everything that you read in a published paper has a nominally statistically significant result,” say Ioannidis. “The large majority of these p-values of less than .05 do not correspond to some true effect.”

For a long while scientists thought p<.05 represented something rare. New work in statistics shows that it’s not.

In a 2013 PNAS paper, Johnson used more advanced statistical techniques to test the assumption researchers commonly make: That a p of .05 means there’s a 5 percent chance the null hypothesis is true. His analysis revealed that it didn’t. “In fact there’s a 25 percent to 30 percent chance the null hypothesis is true when the p-value is 05,” Johnson said.

Remember: The p-value is supposed to assure researchers that their results are rare. Twenty-five percent is not rare.

For another way to think about all this, let’s flip the question around: What if instead of assuming the null hypothesis is true, let’s assume an experimental hypothesis is true?

Scientists and statisticians have shown that if assuming experimental hypotheses are true, it should actually be somewhat uncommon for studies to keep churning out p-values of around .05. More often, assuming an effect is true, the p-value of should come in lower.

Psychology PhD student Kristoffer Magnusson has designed a pretty cool interactive calculator that estimates the probability of obtaining a range of p-values for any given true difference between groups. I used it to create the following scenario.

Let’s say there’s a study with where the actual difference between two groups is equal to half a standard deviation (yes, this is a nerdy way of putting it. But think of it like this: It means 69 percent of those in the experimental group show results higher than the mean of the control group. Researchers call this a “medium-sized” effect). And let’s say there are 50 people each in the experimental group and the control group.

In this scenario, you should only be able to obtain a p-value between .03 and .05 around 7.62 percent of the time.

If you ran this experiment over and over and over again, you’d actually expect to see a lot more p-values with a much lower number. That’s what the following chart shows. The x-axis are the specific p-values, and the y-axis is the frequency you’d find them repeating this experiment. Look how many p-values you’d find below .001.

(And from this chart you’ll see: Yes, you can obtain a p-value of greater than .05 while your experimental hypothesis being true. It just shouldn’t happen as often. In this case, around 9.84 percent of all p-values should fall between .05 and .1.)

This is a specific, hypothetical scenario. But in general, it’s weird when so many p-values in the published literature don’t match this distribution. Sure, a few studies on a question should get a p-value of .05. But more should find lower numbers.

A change in the definition of statistical significance could nudge researchers into adopting more rigorous methods

The biggest change the paper is advocating for is rhetorical: Results that currently meet the .05 level will be called “suggestive” and those that reach the stricter standard of .005 will be called statistically significant.

“Journals can still publish weak (and of course null) results just like they always could,” says Simine Vazire, a personality psychologist who edits Social Psychological and Personality Science (though is not speaking on the behalf of the journal). The language tweak will hopefully trickle down to press releases and news reports, which might avoid buzzwords such as “breakthroughs.”

The change, Vazire says, “should make it so that authors need stronger results before they can make strong claims. That’s all.”

Historians of science are always quick to point out that Ronald Fisher, the UK statistical who invented the p-value, never intended it to be the final word on scientific evidence. That “statistical significance” means the hypothesis is worthy of a follow-up investigation. “In a way, we’re proposing to returning to his original vision of what statistical significance means,” Daniel Benjamin, a behavioral economist at the University of California and the lead author of the proposal says.

If labs do want to publish “statistically significant” results, it’s going to be much harder.

Most concretely, it mean labs will need to increase the number of participants in their studies by 70 percent. “The change essentially requires six times stronger evidence,” Benjamin says.

The increased burden of proof — the proposal authors hope — would nudge labs into adopting other practices science reformers have been calling for. Such as sharing data with other labs to reach consensus conclusion, and think more long-term about their work. Perhaps their first experiment doesn’t reach this new threshold. But a second experiment might. The higher threshold encourages labs to reproduce their own work before submitting to a publication.

The case against p<.005

 erhui1979 / Getty Creative Images

The proposal has critics. One of them is Daniel Lakens, a psychologist at Eindhoven University of Technology in the Netherlands, who is currently organizing a rebuttal paper with dozens of authors.

Mainly, he says the significance proposal might work to stifle scientific progress.

“A good metaphor is driving a car and setting a maximum speed,” Lakens says. “You can set the maximum speed in your country to 20 miles an hour, and no one is going to get killed. You hit someone, they won’t die. So that’s pretty good, right? But we don’t do this. We set the maximum speed a little higher, because then we actually get somewhere a little bit quicker. … The same is for science.”

Ideally, Lakens says, the level of statistical significance needed to prove a hypothesis depends on how outlandish the hypothesis is.

Yes, you’d want a very low p-value in a study that claims mental-telepathy is possible. But do you need such an extreme level testing out a well-worn idea? The high standards could impede young PhD’s with low budgets from testing out their ideas.

Again, a p-value of .05 doesn’t necessarily mean the experiment will be a false positive. A good researcher would know how to follow up, and suss out the truth.

Another critique of the proposal: It keeps scientific communities fixated on p-values. Which, as discussed in the sections above, don’t really tell you much about the merits of a hypothesis.

There are better, more nuanced approaches to evaluating science.

Such as:

  • Concentrating on effect sizes (how big of a difference does an intervention make, and it is practically meaningful)
  • Confidence intervals (what’s the range of doubt built into any given answer?)
  • Whether a result is novel study or a replication (put some more weight into a theory many labs have looked into)
  • Whether a study’s design was preregistered (so that authors can manipulate their results post-test), and that the underlying data is freely accessible (so anyone can check the math)
  • There are also new, advanced statistical techniques — like Bayesian analysis — that, in some ways, more directly evaluate a study’s outcome.

Ioannidis admits that “statistical significance [alone] doesn’t convey much about the meaning, the importance, the clinical value, utility [of research].”

Ideally, he says, scientists would retrain themselves not to rely on null-hypothesis testing. But we don’t live in the ideal world. In the real world, p-values are a quick and easy tool any scientist can easily use to run their tests. And in our real world, p-values still carry a lot of weight into saying what gets published.

With the proposal, “you don’t need to train all these millions of people in heavy statistics,” Ioannidis says. “And it would work. It would help.”

Redefining statistical significance is not an ideal solution to the problem of replication. It’s a solution that nudges people to adopt the ideal solution.

Though no one I spoke to said it directly, I wouldn’t be surprised if some scientists find that a bit patronizing. Why couldn’t they learn advanced statistics? Or come to appreciate more nuanced way of evaluating results?

The real problem isn’t with statistical significance, it’s with the culture of science

There’s a critique of the proposal the authors who I spoke to agree completely agree with.

It’s this: Changing the definition of statistical significance doesn’t address the real problem. And the real problem is the culture of science.

In 2015, Vox sent out a survey to more than 200 scientists, asking “If you could change one thing about how science works today, what would it be and why?” One of the clear themes in the responses: The institutions of science need to get better at rewarding failure.

One young scientist told us: “I feel torn between asking questions that I know will lead to statistical significance and asking questions that matter.”

The biggest problem in science isn’t statistical significance. It’s the culture. She felt torn because young scientists need publications to get jobs. Under the status quo, in order to get publications, you need statistically significant results. Statistical significance alone didn’t lead to the replication crisis. The intuitions of science incentivized the behaviors that allowed it to fester.

Keep in mind, this is all just a proposal; something to spark debate. To my knowledge, journals are not rushing to change their editorial standards overnight.

This will continue to be debated.

But if it becomes that case where it’s still hard to publish “suggestive” results, and if it’s still difficult to secure grant money off of “suggestive” results, then the institutions of science will not have learned its lesson. Yes, a lot of this is just tweaking the language of how we talk about science. But we have to make words “suggestive” and “null” results matter.

“‘Failures,’ on average, are more valuable than positive studies,” Ioannidis says.

Scientific institutions and journals know this. They don’t always act like they do.