What is the replication crisis and why did it happen?
If you haven’t already heard – there’s a replication crisis going on in psychology. At heart of this crisis is the spreading realization that many of the foundational studies that you learned about in your undergrad psych class or read about in the news are probably wrong. Willpower is a limited resource? Probably not. Thinking about old age will make you walk slower? Not so much. The list goes on and on. It’s easy to lose faith in psychology if you read too many of these failed replications, especially when some of the most serious attempts to get a broad perspective on the problem show that less than half of the studies do replicate, 39% in fact. Yikes. Should we just scrap the whole enterprise and start from scratch? In this series of posts, I’ll argue that some old philosophers of science have some fresh insights that can help us stay clear-headed about what’s going on and retain a tempered confidence in psychology’s capacity to tell us about ourselves.
Before diving into those insights, it’s important that we talk about what the replication crisis is and why it emerged. Being able to repeat a study and find the same (or at least a similar) result is fundamental to our trust in science. Imagine your doctor prescribing a drug that had only been successful in one study where she gave it to thirty of her interns. Would you take that pill? You want treatments that have been tested lots of different times on loads of different people. It’s the same for any scientific theory – the more times it’s been tested with the same results, the more confident we can be that the effect or relationship it’s describing is real. As Karl Popper put it back in 1934:
“non-reproducible single occurrences are of no significance to science”
At the core of the replication crisis is the simple fact that lots of the foundational studies in psychology were not being repeated, and once people started trying to replicate these findings, they weren’t finding the same thing. Crisis. That’s the basic “what” of the replication crisis. But why did this happen?
There are lots of answers to this question and some are more technical than others. But it’s helpful to start by thinking about some of the incentive structures that guide research. I’m talking about publishing articles in as prestigious journals as you can. This is the primary way that psychologists get grants, get hired, and get tenure. These tangible concerns are important because publishing a replication was really hard and would almost never happen in the fancy journals. Same goes for getting grants – pitching a funding agency to support a project that’s already been done wasn’t a good strategy for getting money. In other words, the whole system was (and still largely is) set up against doing replications.
This isn’t to say that researchers were all working on totally novel theories with each study. Instead, the typical strategy would be to take the basic idea from one study and then shift it slightly. For example, take the study about priming people with ideas about old age and watching how slowly they walk, and instead test whether being exposed to guilty feelings will make people want to wash their hands (also, probably not the case). This is called a conceptual replication – in this example the broad concept of social priming is being replicated instead of the specific association between old-age and walking slowly. I’ll talk more about the proliferation of theories that conceptual replications cause in the next post. For now though, note how easy it would be to feel as if findings were being replicated, when in fact something totally different was being tested. These conceptual replications aren’t a direct cause of the crisis, but they do tend to muddy the waters.
To understand a more direct cause, imagine you’re a young psychologist, recently hired and trying to get a lab running. Building on work you did in grad school, you decide to run a study that tests how close relationships impact self-control. But, when you run the study and do the analyses, hmmm, there doesn’t seem to be any relationship. Now you’re faced with a choice – do you trust those findings (or lack thereof) or do you think that maybe something just went wrong with this study. In light of all the other research you’ve read that suggest there should be a relationship here, it’s quite understandable to just assume something went wrong, shelve the results, and try again. No worries? Not exactly.
The problem is that you don’t know how many other labs have done the same thing. This is called the file-drawer problem and it’s a large cause of the replication crisis. If there are too many null-results like this that never see the light of day, this can severely undermine the positive results that are reported. To see why, you’ve got to know that the statistical tests psychologists tend to use assume there’s a 1 in 20 chance that the results are a false positive. Those seem like good odds until you remember all the other labs out there who have tried to do similar studies. At the heart of this problem is the fact that journals tend to only publish positive results. For each of those studies you read to make you think there should be a relationship, there might be 19 null-results sitting in various hard drives (though hard-drive-problem doesn’t have quite the same poetic ring). If that’s the case, then all those published results are probably just the result of chance – in other words, they’re not real.
One of the other main causes of the replication crisis is what are known as questionable research practices (QRPs if you wanna sound cool). Put on the Imaginary lab coat of that young researcher once more. You’re doing a second-round attempt at the self-control study and some of your new students are interested in depression and attachment styles. “Great,” you think, “there are short surveys for those, so I’ll just tag them on.” You don’t really have hypotheses that you’re trying to test, but figure it’s worth including so they can hit the ground running with some data.
In and of itself this isn’t really a bad thing. At least it wouldn’t be if you had specific stuff you wanted to test. The potential problems emerge when you start analyzing your data. Say you went through and looked all the relationships between your variables. If you already had 4 variables in the study, a really modest amount, then adding these other 2 will give you at least 21 different relationships, (and that’s not counting any multi-variable interactions you might test). It turns out that once again self-control isn’t related to anything, bummer. But aha! The relationship between depressive symptoms and people’s religious attendance was statistically significant. That’s cool and you can think of a solid background literature that would support this finding. Do you write it up and try to publish?
For decades the answer was almost certainly yes. But, remember – there’s a 1 in 20 chance that this result is just the result of randomness. That’s a problem because you had at least 21 relationships you could’ve tested. Imagine I was trying to hit the 20 bed on a dartboard. I throw 21 darts and most of them miss the board altogether (I’m really bad at darts). But one of them hits the 16 bed. If I pretend that I’d been trying to hit that bed all along, in this case by not reporting my initial intention (hitting the 20) or any of the other tests I performed (all the other darts I threw), then I’m presenting what is likely just chance as if it were a solid result backed up by rigorous stats.
This selective reporting is just one example of QRPs. Others include removing data that doesn’t fit your predictions, collecting more data if you don’t have the results you want (or stopping collection once you do), or just straight up falsifying data. The latter is pretty rare and in most cases the use of QRPs is not a deliberately devious thing, even if it is pretty prevalent (check out this study where researchers admit, anonymously, to doing these things). For decades, training in psychology labs included these practices as the norm. Plus, remember the incentives above – we’re talking about keeping your job. With that weighing on you, it’s really easy to give yourself and your students some liberty here. After all, how bad could the consequences be?
Well, pretty bad. Think back to the big project mentioned above that found only 39% of social psychological studies were capable of being replicated. That sort of finding has cast an air of mistrust over the whole enterprise. In light of how prevalent these practices are, it’s reasonable to be skeptical. But, recognizing the extent of this problem has spurred some radical and substantial reforms, such as changing statistical practices, more transparent data sharing, and pre-registering of studies. These practices will help create a more robust psychological science and are a cause for hope in and of themselves. But, what to do with the past four decades of psychological research? Are we best off just trashing all of that? I don’t think so, and in the coming posts I’ll argue why.
 This example is purely hypothetical – it’s not a sign that I think the work on religiosity and depression is ill-founded.
 It’s worth noting that this percentage varies pretty radically by field. On a whole, cognitive psychology and personality psychology are weathering the replication crisis better than social psychology. Many of these differences may boil down to method – experiments that use a between-subjects design (comparing a control group with a treatment group) dominate social psychology and are really hard to replicate. Within-subjects design (where individuals are exposed to multiple treatments and act as their own controls) tend to fare better: https://replicationindex.com/tag/cognitive-psychology/