The previous example was designed to show, in an accessible example, the logic of confidence interval theory. Further, it shows that confidence procedures cannot be assumed to have the properties that analysts desire.

When presenting the confidence intervals, CI proponents almost always focus on estimation of the mean of a normal distribution. In this simple case, frequentist and Bayesian (with a “non-informative” prior) answers numerically coincide.^{6} However, the proponents of confidence intervals suggest the use of confidence intervals for many other quantities: for instance, standardized effect size Cohen’s \(d\) (G. Cumming & Finch, 2001), medians (Bonett & Price, 2002; Olive, 2008), correlations (Zou, 2007), ordinal association (Woods, 2007), and many others. Quite often authors of such articles provide no analysis of the properties of the proposed confidence procedures beyond showing that they contain the true value in the correct proportion of samples: that is, that they are confidence procedures. Sometimes the authors provide an analysis of the frequentist properties of the procedures, such as average width. The developers of new confidence procedures do not, as a rule, examine whether their procedures allow for valid post-data reasoning.

As the first example showed, a sole focus on frequentist properties of procedures is potentially disastrous for users of these confidence procedures because a confidence procedure has no guarantee of supporting reasonable inferences about the parameter of interest. Casella (1992) underscores this point with confidence intervals, saying that “we must remember that practitioners are going to make conditional (post-data) inferences. Thus, we must be able to assure the user that any inference made, either pre-data or post-data, possesses some definite measure of validity” (p. 10). Any development of an interval procedure that does not, at least in part, focus on its post-data properties is incomplete at best and extremely misleading at worst: *caveat emptor*.

Can such misleading inferences occur using procedures suggested by proponents of confidence intervals, and in use by researchers? The answer is yes, which we will show by examining a confidence interval for \(\omega^2\), the proportion of variance accounted for in ANOVA designs. The parameter \(\omega^2\) serves as a measure of effect size when there are more than two levels in a one-way design. This interval was suggested by (Steiger, 2004;see also Steiger & Fouladi, 1997), cited approvingly by G. Cumming (2014), implemented in software for social scientists (e.g., Kelley, 2007a, 2007b), and evaluated, solely for its frequentist properties, by W. H. Finch & French (2012). The problems we discuss here are shared by other related confidence intervals, such as confidence intervals for \(\eta^2\), partial \(\eta^2\), the noncentrality parameter of the \(F\) distribution, the signal-to-noise ratio \(f\), RMSSE \(\Psi\), and others discussed by Steiger (2004).

Steiger (2004) introduces confidence intervals by emphasizing a desire to avoid significance tests, and to focus more on the precision of estimates. Steiger says that “the scientist is more interested in knowing how large the difference between the two groups is (and how precisely it has been determined) than whether the difference between the groups is 0” (pp. 164-165). Steiger and Fouladi (1997) say that “[t]he advantage of a confidence interval is that the width of the interval provides a ready indication of the precision of measurement…” (p. 231). Given our knowledge of the precision fallacy these statements should raise a red flag.

Steiger then offers a confidence procedure for \(\omega^2\) by inverting a significance test. Given the strange behavior of the UMP procedure in the submersible example, this too should raise a red flag. A confidence procedure based on a test — even a good, high-powered test — will not in general yield a procedure that provides for reasonable inferences. We will outline the logic of building a confidence interval by inverting a significance test before showing how Steiger’s confidence interval behaves with data.

To understand how a confidence interval can be built by inverting a significance test, consider that a two-sided significance test of size \(\alpha\) can be thought of as a combination of two one-sided tests at size \(\alpha/2\): one for each tail. The two-sided test rejects when one of the one-tailed tests rejects. To build a 68% confidence interval (i.e., an interval that covers the true value as often as the commonly-used standard error for the normal mean), we can use two one-sided tests of size \((1 - .68)/2 = .16\). Suppose we have a one-way design with three groups and \(N=10\) participants in each group. The effect size \(\omega^2\) in such a design indexes how large \(F\) will be: the \(\omega^2\) values tend to lead larger \(F\) values. The distribution of \(F\) given the effect size \(\omega^2\) is called the noncentral \(F\) distribution. When \(\omega^2=0\) — that is, there is no effect — the familiar central \(F\) distribution is obtained.

Consider first a one-sided test that rejects when \(F\) is large. Figure 6A shows that a test of the null hypothesis that \(\omega^2=.1\) would yield \(p=.16\) when \(F(2,27)=5\). If we tested larger values of \(\omega^2\), the \(F\) value would not lead to a rejection; if we tested smaller values of \(\omega^2\), they would be rejected because their \(p\) values would be below \(.16\). The gray dashed line in Figure 6A shows the noncentral \(F(2,27)\) distribution for \(\omega^2=.2\); it is apparent that the \(p\) value for this test would be greater than .16, and hence \(\omega^2=.2\) would not be rejected by the upper-tailed test of size .16. Now consider the one-sided test that rejects when \(F\) is small. Figure 6B shows that a test of the null hypothesis that \(\omega^2=.36\) would yield \(p=.16\) when \(F(2,27)=5\); any \(\omega^2\) value greater than .36 would be rejected with \(p<.16\), and any \(\omega^2\) value less than .36 would not.

Considering the two one-tailed tests together, for any \(\omega^2\) value in \([.1, .36]\), the \(p\) value for both one-sided tests will be greater than \(p>.16\) and hence will not lead to a rejection. A 68% confidence interval for when \(F(2,27)=5\) can be defined as all \(\omega^2\) values that are not rejected by either of the two-tailed tests, and so \([.1,.36]\) is taken as a 68% confidence interval. A complication arises, however, when the \(p\) value from the ANOVA \(F\) test is greater than \(\alpha/2\); by definition, the \(p\) value is computed under the hypothesis that there is no effect, that is \(\omega^2=0\). Values of \(\omega^2\) cannot be any lower than 0, and hence there are no \(\omega^2\) values that would be rejected by the upper tailed test. In this case the lower bound on the CI does not exist. A second complication arises when the \(p\) value is greater than \(1-\alpha/2\): all lower-tailed tests will reject, and hence the upper bound of the CI does not exist. If a bound does not exist, Steiger (2004) arbitrarily sets it at 0.

To see how this CI works in practice, suppose we design a three-group, between-subjects experiment with \(N=10\) participants in each group and obtain an \(F(2,27)=0.18, p = 0.84\). Following recommendations for good analysis practices (e.g., Psychonomics Society, 2012; Wilkinson & Task Force on Statistical Inference, 1999), we would like to compute a confidence interval on the standardized effects size \(\omega^2\). Using software to compute Steiger’s CI, we obtain the 68% confidence interval \([0,0.01]\).

Figure 7A (top interval) shows the resulting 68% interval. If we were not aware of the fallacies of confidence intervals, we might publish this confidence interval thinking it provides a good measure of the precision of the estimate of \(\omega^2\). Note that the lower limit of the confidence interval is exactly 0, because the lower bound did not exist. In discussing this situation Steiger & Fouladi (1997)] say

“[Arbitrarily setting the confidence limit at 0] maintains the correct coverage probability for the confidence interval, but the width of the confidence interval may be suspect as an index of the precision of measurement when either or both ends of the confidence interval are at 0. In such cases, one might consider obtaining alternative indications of precision of measurement, such as an estimate of the standard error of the statistic.” (Steiger and Fouladi, 1997, p. 255)

Steiger (2004) further notes that “relationship [between CI width and precision] is less than perfect and is seriously compromised in some situations for several reasons” (p. 177). This is a rather startling admission: a major part of the justification for confidence intervals, including the one computed here, is that confidence intervals supposedly allow an assessment of the precision with which the parameter is estimated. The confidence interval fails to meet the purpose for which it was advocated in the first place, but Steiger does not indicate why, or under what conditions the CI will successfully track precision.

We can confirm the need for Steiger’s caution — essentially, a warning about the precision fallacy — by looking at the likelihood, which is the probability density of the observed \(F\) statistic computed for all true values of \(\omega^2\). Notice how narrow the confidence interval is compared to the likelihood of \(\omega^2\). The likelihood falls much more slowly as \(\omega^2\) gets larger than the confidence interval would appear to imply, if we believed the precision fallacy. We can also compare the confidence interval to a 68% Bayesian credible interval, computed assuming standard “noninformative” priors on the means and the error variance.^{7} The Bayesian credible interval is substantially wider, revealing the imprecision with which \(\omega^2\) is estimated.

Figure 7B shows the same case, but for a slightly smaller \(F\) value. The precision with which \(\omega^2\) is estimated has not changed to any substantial degree; yet now the confidence interval contains only the value \(\omega^2=0\): or, more accurately, the confidence interval is empty because this \(F\) value would always be rejected by one of the pairs of one-sided tests that led to the construction of the confidence interval. As Steiger points out, a “zero-width confidence interval obviously does not imply that effect size was determined with perfect precision,” (p. 177), nor can it imply that there is a 68% probability that \(\omega^2\) is exactly 0. This can be clearly seen by examining the likelihood and Bayesian credible interval.

Some authors (e.g. Dufour, 1997) interpret empty confidence intervals as indicative of model misfit. In the case of this one sample design, if the confidence interval is empty then the means are more similar than would be expected even under the null hypothesis \(\alpha/2\) of the time; that is, \(p>1-\alpha/2\), and hence \(F\) is small. If this model rejection significance test logic is used, the confidence interval itself becomes uninterpretable as the model gets close to rejection, because it appears to indicate false precision (Gelman, 2011). Moreover, in this case the \(p\) value is certainly more informative than the CI; the \(p\) value provides graded information that does not depend on the arbitrary choice of \(\alpha\), while the CI is simply empty for all values of \(p>1-\alpha/2\).

Figure 7C shows what happens when we increase the confidence coefficient slightly to 70%. Again, the precision with which the parameter is estimated has not changed, yet the confidence interval now again has nonzero width.

Figure 7D shows the results of an analysis with \(F(2,27)=4.24, p = 0.03\), and using a 95% confidence interval. Steiger’s interval has now encompassed most of the likelihood, but the lower bound is still “stuck” at 0. In this situation, Steiger and Fouladi advise us that the width of the CI is “suspect” as an indication of precision, and that we should “obtain[] [an] alternative indication[] of precision of measurement.” As it turns out, here the confidence interval is not too different from the credible interval, though the confidence interval is longer and is unbalanced. However, we would not know this if we did not examine the likelihood and the Bayesian credible interval; the only reason we know the confidence interval has a reasonable width in this particular case is its agreement with the actual measures of precision offered by the likelihood and the credible interval.

How often will Steiger’s confidence procedure yield a “suspect” confidence interval? This will occur whenever the \(p\) value for the corresponding \(F\) test is \(p>\alpha/2\); for a 95% confidence interval, this means that whenever \(p>0.025\), Steiger and Fouladi recommend against using the confidence interval for precisely the purpose that they — and other proponents of confidence intervals — recommend it for. This is not a mere theoretical issue; moderately-sized \(p\) values often occur. In a cursory review of papers citing Steiger (2004), we found many that obtained and reported, without note, suspect confidence intervals bounded at 0 (e.g., S. P. Cumming, Sherar, Gammon, Standage, & Malina, 2012; Gilroy & Pearce, 2014; Hamerman & Morewedge, 2015, 2015; Lahiri, Maloney, Rogers, & Ge, 2013; Todd, Vurbic, & Bouton, 2014; Winter et al., 2014). The others did not use confidence intervals, instead relying on point estimates of effect size and \(p\) values (e.g., Hollingdale & Greitemeyer, 2014); but from the \(p\) values it could be inferred that if they had followed “good practice” and computed such confidence intervals, they would have obtained intervals that according to Steiger could not be interpreted as anything but an inverted \(F\) test.

It makes sense, however, that authors using confidence intervals would not note that the interpretation of their confidence intervals is problematic. If confidence intervals truly contained the most likely values, or if they were indices of the precision, or if the confidence coefficient indexed the uncertainty we should have that the parameter is in an interval, then it would seem that a CI is a CI: what you learn from one is the same as what you learn from another. The idea that the \(p\) value can determine whether the interpretation of a confidence interval is possible is not intuitive in light of the way CIs are typically presented.

We see no reason why our ability to interpret an interval *should* be compromised simply because we obtained a \(p\) value that was not low enough. Certainly, the confidence coefficient is arbitrary; if the width is suspect for one confidence coefficient, it makes little sense that the CI width would become acceptable just because we changed the confidence coefficient so the interval bounds did not include 0. Also, if the width is too narrow with moderate \(p\) values, such that it is not an index of precision, it seems that the interval will be too wide in other circumstances, possibly threatening the interpretation as well. This was evident with the UMP procedure in the submersible example: the UMP interval was too narrow when the data provided little information, and was too wide when the data provided substantial information.

Steiger and Fouladi (1997) summarize the central problem with confidence intervals when they say that in order to maintain the correct coverage probability — a frequentist pre-data concern — they sacrifice the very thing researchers want confidence intervals to be: a post-data index of the precision of measurement. If our goal is to move away from significance testing, we should not use methods which cannot be interpreted except as inversions of significance tests. We agree with Steiger and Fouladi that researchers should consider obtaining alternative indications of precision of measurement; luckily, Bayesian credible intervals fit the bill rather nicely, rendering confidence intervals unnecessary.

This should not be taken to mean that inference by confidence intervals is not problematic even in this simple case; see e.g., Brown (1967) and Buehler & Feddersen (1963).↩

See supplement for details. We do not generally advocate non-informative priors on parameters of interest (Rouder, Morey, Speckman, & Province, 2012; Wetzels, Grasman, & Wagenmakers, 2012); in this instance we use them as a comparison because many people believe, incorrectly, that confidence intervals numerically correspond to Bayesian credible intervals with noninformative priors.↩