The Limits of Statistical Methodology: Why A “Statistically Significant” Number of Published Scientific Research Findings are False, #1.

(Ioannidis, 2005a)

TABLE OF CONTENTS

1. Introduction

2. Troubles in Statistical Paradise

3. A Critique of Bayesianism

4. The Limits of Probability Theory

5. Conclusion

The essay that follows below will be published in four installments; this is the first.

But you can also download and read or share a .pdf of the complete text of this essay, including the REFERENCES, by scrolling down to the bottom of this post and clicking on the Download tab.

The Limits of Statistical Methodology: Why A “Statistically Significant” Number of Published Scientific Research Findings are False

1. Introduction

In 2005, John Ioannidis published a now widely cited paper, “Why Most Published Research Findings are False,” in which he pointed out that in many sciences there is a high rate of non-replication and also a high rate of failure of confirmation, due to a number of factors (Ioannidis, 2005a, 2005b.) In the next section, I’ll discuss the basic problems with the methodology of statistical significance. One principal reason for holding that most published research findings are false, is basing research on a single study assessed by the methodology of statistical significance, with a p-value less than 0.05. But Ioannidis also mentioned other factors that collectively would lead one to conclude that most published research findings are false, such as the use of unreasonably small samples, and outright fraud, which has been found to be more common that one would have suspected, or feared (Smith & Smith, 2023a).

Since the publication of Ioannidis’s paper, and even before that (see, e.g., de Long & Kang, 1992), there have been other papers published also proposing that “most published research findings are false” (Tabarrok, 2005; Moonesinghe et al., 2007; Diekmann, 2011; Freedman, 2010). As well, running parallel to this issue, there has been deep concern in the literature about a “reproducibility crisis” in psychology and other sciences (Yong, 2015; Simmons, 2011). For example, in an attempt to replicate results in 98 original papers in three psychology journals, one research team found only 39 of 100 replication attempts successful (with two replication attempts duplicated by separate research teams) (Open Science Collaboration, 2015). While 97 percent of the original studies found significance, only 36 percent of the replications found significance (Open Science Collaboration, 2015).

Matters are even worse in cancer biology research, where only six of 53 high-profile peer-reviewed papers could be replicated, the problem arising from the fact that the basic cell line animal models themselves were inadequate (Begley & Ellis, 2012). Further, similar problems have been found in neuroscience and genetics research. Button et al. have concluded:

[T]he average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of result. (Button et al., 2013: p. 365)

The same is true of genetics research (Ioannidis & Trikalinos, 2005). In general, “the cumulative (total) prevalence of irreproducible preclinical research exceeds 50 %,” with the estimated range being from 51 to 89 percent (Ioannidis, 2008; Freedman et al., 2015; Hartshorne et al., 2012; Everett & Earp, 2015).

There is no doubt, as Button (et al. 2013) note, that small sample sizes in research is one factor undermining the reliability of such research. But as Higginson and Munafò have argued, the “institutional incentive structure of academia,” and the “publish or perish” mentality, especially the desire for publications in journals with a High Impact Factor (IF), leads researchers to pursue small samples, in order to get publishable results quickly, and maintain the continuity of their careers (Higginson & Munafò, 2016). They show, using an ecological model, that scientists seek to maintain their “fitness” (academic survivability) and thus would conduct research producing novel results with small studies in order to publish quickly and reduce research costs, having only 10-40 percent statistical power. Thus, roughly half of published studies in the sciences will be false, with erroneous conclusions (Higginson & Munafò, 2016).

Richard Horton, writing in The Lancet, lamented the precarious state of scientific research:

The case against science is straightforward; much of the scientific literature, perhaps half, may be simply untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness. (Horton, 2015: p. 1380)

Horton observes that scientists no longer have an incentive to be “right” in the disinterested pursuit of truth, since academic incentives reward only those who are innovative and productive, however wrong they might be. Ironically, as shown by the “Matthew effect” (Merton, 1968), the professional academic establishment might be, perhaps, implicitly aware of this problem, because in one experiment, papers that had previously been published were resubmitted to journals under different titles. The majority were rejected, not because prior publication was detected, but because of the poor quality of the papers. Yet, the errors were not originally detected (Peters & Ceci, 1982).

Worse still, reviewers were found in one study to have failed to detect all errors deliberately inserted into a paper for review—and the reviewers were peer-review experts in that field (Godlee et al., 1998; Jefferson, et al., 2002). R. Smith, commenting in the Journal of the Royal Society of Medicine, concluded that scientific peer review—a review by experts—is a process merely based on “belief” (faith), not strict rationality:

So, peer review is a flawed process, full of easily identified defects with little evidence that it works. Nevertheless, it is likely to remain central to science and journals because there is no obvious alternative, and scientists and editors have a continuing belief in peer review. How odd that science should be rooted in belief. (Smith, 2006)

Moreover, scientific experts are biased in many ways, including selectively reporting data (Ioannidis et al., 2014); and even outright fraud and the use of “false” data is more frequent than is often thought by mainstream scientists (Martin, 1992; Vogel, 2011). The limitations of peer review, and the reproducibility crisis, have been discussed by me in other papers (Smith, 2023; Smith & Smith, 2023a, 2023b), and an ingenious response to the reproducibility/replication crisis has been given by Robert Hanna (Hanna, 2023a, 2023b), namely, that properly conducted empirical scientific research does not require reproducibility/replication anyway. It remains to see whether mainstream social and biological scientists take up this idea.

However, beyond these larger issues, the focus of the rest of the paper will be on another issue mentioned by Ioannidis, problems with statistical methodology. It will be argued that critics such as Gigerenzer are right to suppose that problems such as the reproducibility/replication crisis—assuming that reproducibility/replication is even required for properly conducted empirical scientific research, which, as I’ve mentioned, is open to question (Hanna, 2023a, 2023,b)—can be viewed as a product of “statistical ritual and associated delusions” (Gigerenzer, 2018). We will now explore these “delusions.”

Download