The Limits of Statistical Methodology: Why A “Statistically Significant” Number of Published Scientific Research Findings are False, #2.

(Ioannidis, 2005a)

TABLE OF CONTENTS

1. Introduction

2. Troubles in Statistical Paradise

3. A Critique of Bayesianism

4. The Limits of Probability Theory

5. Conclusion

The essay that follows below will be published in four installments; this is the second.

But you can also download and read or share a .pdf of the complete text of this essay, including the REFERENCES, by scrolling down to the bottom of this post and clicking on the Download tab.

2. Troubles in Statistical Paradise

There is considerable debate about the correct interpretation, and epistemological merits of significance testing, with many methodologists maintaining that the current approach is not scientific. For example, a paper published in Nature on 20 March 2019 with 800 signatories (Amrhein et al., 2019), summarized the most recent debates and criticisms, since there seems from the literature to have been something of a cycle of criticisms and doubts expressed about this statistical method, over the decades. Amrhein et al., proposed that the very idea of statistical significance should be retired:

[W]e are calling for a stop to the use of P values in the conventional, dichotomous way … to decide whether a result refutes or supports a scientific hypothesis. (Amrhein et al., 2019).

The reason for this, which has often been given in the critical literature, is

that all statistics, including P values and confidence intervals, naturally vary from study to study, and often do so to a surprising degree. In fact, random variation alone can easily lead to large disparities in P values, far beyond falling just to either side of the 0.05 threshold. (Amrhein et al., 2019)

On theoretical grounds, statistical significance can be spurious, arising from pure noise factors (McShane et al., 2019: p. 235). One reason that this is possible is that a result with a p-value below 0.05 is not necessarily evidence of a causal relationship (Holmon, et al., 2001), and indeed,

researchers typically take the rejection of the sharp point null hypothesis of zero effect and zero systematic error as positive or even definitive evidence in favour of some preferred alternative hypothesis—a logical fallacy. (McShane et al., 2019, 237)

Thus, many statisticians and methodologists believe that the very idea of statistical significance should “expire” (Hurlbert et al., 2019). These problems will now be examined in more depth.

There are deep, troublesome problems with statistics, related to the philosophical foundations of statistical inference, which cast doubt on the objectivity of statistical evidence (Kaye, 1986; Fienberg et al., 1995). A battle of epochal scale has, and is continuing between the Bayesian and traditional Neyman-Pearson methods of hypothesis testing. The Neyman-Pearson model is a hybrid of Ronald Fisher’s method and that of Jerzy Neyman and Egon Pearson. The hybrid method is usually known as the Null Hypothesis Significance Test (NHST). Stated simply, two hypotheses are formulated. The first is a statistical hypothesis called the null hypothesis (or restricted hypothesis) H_o and the second is called the alternative or research hypothesis H_r. The null hypothesis states that there are no statistically significant differences between the populations from which the two samples are taken, so that observed differences arise by chance alone. The alternative or research hypothesis is a proposition in probabilistic form about aspects of the data, which is operationalized through a parameter Ɵ. The null hypothesis might posit that Ɵ = 0 and the research hypothesis that Ɵ ≠0. Under the assumption that the null hypothesis is true, a test statistic, such as chi-square or t statistic in linear regression analysis, these being a function of Ɵ and the collected data, is then computed. A p value is then determined; as Nickerson summarizes:

Application of NHST to the difference between the two means yields a value of p, the theoretical probability that if two samples of the size of those used had been drawn at random from the same population, the statistical test would have yielded a statistic (e.g., t) as large or larger than the one obtained. (Nickerson 2000: p. 242).

A significance level µ is specified and the null hypothesis is rejected only if the p value is not greater than µ, often set at 0.05, and the experiment is statistically significant at the 0.05 level. Thus, either the null hypothesis is rejected or there is a failure to reject the null hypothesis.

Rejecting the null hypothesis is conventionally taken to be indirect evidence for the research hypothesis, since chance has supposedly been eliminated as an explanation for sample differences. Nevertheless, it is a fallacy to treat failure to disconfirm as confirmation, and to suppose that if H_o is rejected that the theory is established as true: it still may be false, but not by chance (Oakes, 1986: p. 83). NHST does not tell us the answer to the question, “Given these data, what is the probability that H_o is true?” (Cohen, 1994: p. 997). Rather, it tells us that “Given that H_o is true, what is the probability of these (or more extreme data)?” (Cohen, 1990).

NHST has been subjected to searching criticism (Selvin, 1957; Nunnally, 1960; Rozeboom, 1960; Lykken, 1968; Baken, 1966; Morrison & Henkel eds, 1970; Carver, 1978; Glass et al., 1981; Guttman, 1985; Rosenthal & Rubin 1985; McCloskey, 1986; Pratt, 1987; Chow, 1988, 1996, 1998; Loftus, 1991; Schmidt, 1991; Goodman, 1993; Frick, 1996; Kirk, 1996; Albelson, 1997; Berger et al., 1997; Harlow et al. eds, 1997; Hagan, 1997; Harris, 1997; Hunter, 1997; Shrout, 1997; Johnson, 1997; Gelman & Stern, 2006; Albert, 2002; Gliner et al., 2002; Hubbard & Bayarri, 2003; Morgan, 2003; Fidler et al. 2004; Banasiewicz, 2005). Many critics argue that the method lacks a sound scientific basis (Bakan, 1966; Carver, 1978; Gigerenzer, 1998; Sterne & Hunter, 2001; Schmidt & Hunter, 2002; Anderson et al., 2000; Armstrong, 2007; Wasserstein & Lazar, 2016; Hurlbert, et al., 2019). More generally, the criticisms are many and fundamental (Hurlbert & Lombardi, 2009).

For example, if the sample size is large enough statistical significance can occur for trivial effects (Ziliak & McCloskey, 2008). P-values depend upon sample size and with a large enough sample, the null hypothesis may be rejected (Berkson, 1938; Rozeboom, 1960; Grant, 1962; Bakan, 1966; Johnson, 1999, 2005; Shrader-Frechette, 2008). As I noted above, statistical significance can be generated by “pure noise” (Carney et al., 2010; Bem, 2011). Most null hypotheses are known to be false before any data are collected: indeed, nearly all null hypotheses are false a priori. (Ziliak & McCloskey, 2008). Further, it is not necessarily the case that a small p value shows strong evidence against the null; according to statisticians Berger and Sellke:

[A]ctual evidence against a null (as measured, say, by posterior probability or comparative likelihood) can differ by an order of magnitude from the P value. For instance, data that yield a P value of .05, when testing a normal mean, result in a posterior probability of the null of at least .30 for any objective prior distribution. (Berger & Sellke, 1987: p. 112)

Correspondingly, they conclude: “P values can be highly misleading measures of the evidence provided by the data against the null hypothesis” (Berger & Sellke, 1987: p. 112; see also Berger & Berry, 1988; Simberloff, 1990). The difference between “significant” and “not significant” has been shown to be not itself statistically significant, as no sharp demarcation is possible, conceptually (Rosnow & Rosenthal, 1989; Gelman & Stern, 2006).

Bayesians Colin Howson and Peter Urbach, in Scientific Reasoning: The Bayesian Approach, are highly critical of significance testing (Howson & Urbach, 2006). For example, they point out that the chi-square test is “used to test theories asserting that some population has a particular, continuous probability distribution, such as the normal distribution” and to test such a theory,

the range of possible results of some sampling trial would be divided into several intervals and the number of subjects falling into each would be compared with the ‘expected’ number. (Howson & Urbach, 2006: p. 139)

However, “the test … is … vitiated by the absence of any principled rule for partitioning the outcomes into separate intervals or cells, for not all partitions “lead to the same inferences when the significance test is applied” (Howson & Urbach, 2006: p. 139). They conclude that there is no epistemic basis for the chi-square test. Furthermore, Lindley’s paradox, which shows that a well-supported hypothesis can be rejected in significance tests (Lindley, 1957; Loftus, 1996), indicates that

the classical thesis that a null hypothesis may be rejected with greater confidence, the greater the power of the test is not borne out; indeed, the reverse trend is signalled. (Howson & Urbach, 2006: p. 154)

In their opinion, Lindley’s paradox “shows unanswerably and decisively that inferences drawn from significance tests have no inductive significance whatsoever” (Howson & Urbach, 2006: p. 154). Likewise, they are skeptical about the epistemic cogency of classical estimates:

classical ‘estimates’ are not estimates in any normal or scientific sense, and, like judgments of ‘significance’ and ‘non-significance’, they carry no inductive meaning at all. Therefore, they cannot be used to arbitrate between rival theories or to determine practical policy. (Howson & Urbach, 2006: p. 182)

In conclusion, they reject frequentism in favor of Bayesianism, as “classical methods are set altogether on the wrong lines, and are based on ideas inimical to scientific method” (Howson & Urbach, 2006: p. 182).

McShane et al. also believe that the problems with null hypothesis significance testing (NHST) remain unresolved, even by measures such as modified p-value thresholds, Bayes factors, and confidence intervals, holding that

it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures. (McShane et al., 2019: p. 236)

These criticisms are independent of Bayesian considerations and will hold even if Bayesianism is rejected on independent grounds, as I will argue below.

There is a further challenging critique of this field of statistics, alleging that scientific inference makes only a limited use of formal statistical inference, applying the statistical toolkit to random samples of data (Guttman, 1985; Gigerenzer, 2004; Hubbert et al., 2019). Many of the harder sciences than psychology and the social sciences, such as physics, astrophysics, cosmology, and chemistry, engage in mathematical model construction of physical phenomena usually by ordinary and partial differential equations, aiming to produce testable hypotheses, are subject to carefully designed experiments and/or observational studies, which are most often not random at all, with the aim being to obtain empirically replicable and generalizable data (Harman, 1965). As Hubbard et al. state:

Scientific inference is better viewed as being grounded in abductive (explanatory) reasoning. Abduction—sometimes termed inference to the best explanation … takes as its locus the studying of facts and proposing a theory about the causal mechanisms generating them. Thus, abduction is a method of scientific inference geared toward the development of feasible and best explanations for the stubborn facts we possess. Like detective work, this approach mirrors the behavior of practicing scientists. And it is not beholden to methods of formal statistical inference. (Hubbard et al., 2019: p. 96)

Others agree: “Much of causal inference is beyond statistics” (Shadish & Cook, 1999: p. 298). “Statistical inference … is fundamentally incompatible with ‘most’ science” (Gunter & Tong, 2016-2017: p. 1). More generally, statistical methods are limited in most physical sciences: “the estimation of fixed population parameters from random samples is limited” (Guttman, 1985; Gigerenzer, 2004).

Download