The four statisticians report their four confidence procedures to the rescue team, who are understandably bewildered by the fact that there appear to be at least four ways to summarize the data about the hatch location from two bubbles. Just after the statisticians present their confidence procedures to the rescuers, two bubbles appear at locations \(x_1=1\) and \(x_2=1.5\). The resulting likelihood and the four confidence intervals are shown in Figure 1A.

After using the observed bubbles to compute the four confidence intervals, the rescuers wonder how to interpret them. It is clear, first of all, why the fundamental confidence fallacy is a fallacy. As Fisher pointed out in the discussion of CI theory mentioned above, for any given problem — as for this one — there are many possible confidence procedures. These confidence procedures will lead to different confidence intervals. In the case of our submersible confidence procedures, all confidence intervals are centered around \(\bar{x}\), and so the intervals will be nested within one another.

If we mistakenly interpret these observed intervals as having a 50% probability of containing the true value, a logical problem arises. First, there must always be a 50% probability that the *shortest* interval contains the parameter. The reason is basic probability theory: the narrowest interval would have probability 50% of including the true value, and the widest interval would have probability 50% of excluding the true value. According to this reasoning, there must be a 0% probability that the true value is outside the narrower, nested interval yet inside the wider interval. If we believed the FCF, we would always come to the conclusion that the shortest of a set of nested \(X\%\) intervals has an \(X\%\) probability of containing the true value. Of course, the confidence procedure “always choose the shortest of the nested intervals” will tend to have a lower than \(X\%\) probability of including the true value. If we believed the FCF, then we must come to the conclusion that the shortest interval simultaneously has an \(X\%\) probability of containing the true value, and a less than \(X\%\) probability. Believing the FCF results in logical contradiction.

This point regarding the problem of interpreting nested CIs is not, by itself, a critique of confidence interval theory *proper*; it is rather a critique of the folk theory of confidence. Neyman himself was very clear that this interpretation was not permissible, using similarly nested confidence intervals to demonstrate the fallacy (Neyman, 1941, pp. 213–215). It is a warning that the improper interpretations of confidence intervals used throughout the scientific literature leads to mutually contradictory inferences, just as Fisher warned.

Even without nested confidence procedures, one can see that the FCF must be a fallacy. Consider Figure 1B, which shows the resulting likelihood and confidence intervals when \(x_1=0.5\) and \(x_2=9.5\). When the bubbles are far apart, as in this case, the hatch can be localized very precisely: the bubbles are far enough apart that they must have come from the bow and stern of the submersible. The sampling distribution, nonparametric, and UMP confidence intervals all encompass the likelihood, meaning that there is 100% certainty that these 50% confidence intervals contain the hatch. Reporting 50% certainty, 50% probability, or 50% confidence in a specific interval that surely contains the parameter would clearly be a mistake.

The fact that we can have 100% certainty that a 50% CI contains the true value is a specific case of a more general problem flowing from the FCF. The shaded regions in Figure 2, left column, shows when the true value is contained in the various confidence procedures for all possible pairs of observations. The top, middle, and bottom row correspond to the sampling distribution, nonparametric/UMP, and the Bayes procedures, respectively. Because each procedure is a 50% confidence procedure, in each plot the shaded area occupies 50% of the larger square delimiting the possible observations. The points ‘a’ and ‘b’ are the bubble patterns in Figure 2A and B, respectively; point ‘b’ is in the shaded region for all intervals because the true value is included in all the intervals, as shown in Figure 2B; likewise, ‘a’ is outside the shaded region because all CIs exclude the true value for these bubbles.

Instead of considering the bubbles themselves, we might also translate their locations into the mean location \(\bar{y}\) and the difference between them, \(b=y_2-y_1\). We can do this without loss of any information: \(\bar{y}\) contains the point estimate of the hatch location, and \(b\) contains the information about the precision of that estimate. Figure 2, right column, shows the same information as in the left column, except as a function of \(\bar{y}\) and \(b\). The figures in the right column are \(45^\circ\) clockwise rotations of those in the left. Although the two columns show the same information, the rotated right column reveals a critical fact: the various confidence procedures have different probabilities of containing the true value when the distance between the bubbles varies.

To see this, examine the horizontal line under point ‘a’ in Figure 2B. The horizontal line is the subset of all bubble pairs that show the same difference between the bubbles as those in Figure 1A: \(0.05\) meters. About 31% of this line falls under the shaded region, meaning that in the long run, 31% of sampling distributions intervals will contain the true value, when the bubbles are \(0.05\) meters apart. For the nonparametric and UMP intervals (middle row), this percentage is only about 5%. For the Bayes interval (bottom row), it is exactly 50%.

Believing the FCF implies believing that we can use the long-run probability that a procedure contains the true value as an index of our post-data certainty that a particular interval contains the true value. But in this case, we have identified *two* long-run probabilities for each interval: the average long-run probability *not* taking into account the observed difference — that is, 50% — and the long-run probability taking into account \(b\) which, for the sampling distribution interval is 31% and for the nonparametric/UMP intervals is 5%. Both are valid long-run probabilities; which do we use for our inference? Under FCF, both are valid. Hence the FCF leads to contradiction.

The existence of relevant subsets brings back into focus the confusion between what we know before the experiment with what we know after the experiment. For any of these confidence procedures, we know before the experiment that 50% of future CIs will contain the true value. After observing the results, conditioning on an aspect of the data — such as, in this case, the variance of the bubbles — can radically our assessment of the probability.

The problem of contradictory inferences arising from multiple applicable long-run probabilities is an example of the “reference class” problem (Reichenbach, 1949; Venn, 1888). Fisher noted that when there are identifiable subsets of the data that have different probabilities of containing the true value — such as those subsets with a particular value of \(d\), in our confidence interval example — those subsets are relevant to the inference (Fisher, 1959). The existence of relevant subsets means that one can assign more than one probability to an interval. Relevant subsets are identifiable in many confidence procedures, including the common classical Student’s \(t\) interval, where wider CIs have a greater probability of containing the true value (Buehler, 1959; Buehler & Feddersen, 1963; Casella, 1992; Robinson, 1979). There are, as far as we know, only two general strategies for eliminating the threat of contradiction from relevant subsets: Neyman’s strategy of avoiding any assignment of probabilities to particular intervals, and the Bayesian strategy of always conditioning on the observed data, to be discussed subsequently.

This set of confidence procedures also makes clear the precision fallacy. Consider Figure 3, which shows how the width of each of the intervals produced by the four confidence procedures changes as a function of the width of the likelihood. The Bayes procedure tracks the uncertainty in the data: when the likelihood is wide, the Bayes CI is wide. The reason for this necessary correspondence between the likelihood and the Bayes interval will be discussed later.

Intervals from the sampling distribution procedure, in contrast, have a fixed width, and so cannot reveal any information about the precision of the estimate. The sampling distribution interval is of the commonly-seen CI form \[ \bar{x}\pm C\times SE, \] Like the CI for a normal population mean with known population variance, the standard error — defined as the standard deviation of the sampling distribution of \(\bar{x}\) — is known and fixed; here, it is approximately 2.04 (see the supplement for details). This indicates that the long-run standard error — and hence, confidence intervals based on the standard error — cannot always be used as a guide to the uncertainty we should have in a parameter estimate.

Strangely, the nonparametric procedure generates intervals whose widths are *inversely* related to the uncertainty in the parameter estimates. Even more strangely, intervals from the UMP procedure initially increase in width with the uncertainty in the data, but when the width of the likelihood is greater than 5 meters, the width of the UMP interval is inversely related to the uncertainty in the data, like the nonparametric interval. This can lead to bizarre situations. Consider observing the UMP 50% interval \([1,1.5]\). This is consistent with two possible sets of observations: \((1,1.5)\), and \((-3.5,6)\). Both of these sets of bubbles will lead to the same CI. Yet the second data set indicates high precision, and the first very low precision! The UMP and sampling distribution procedures share the dubious distinction that their CIs cannot be used to work backwards to the observations. In spite of being the “most powerful” procedure, the UMP procedure clearly throws away important information.

To see how the likelihood fallacy is manifest in this example, consider again Figure 3. When the uncertainty is high, the likelihood is wide; yet the nonparametric and UMP intervals are extremely narrow, indicating both false precision and excluding almost all likely values. Furthermore, the sampling distribution procedure and the nonparametric procedure can contain impossible values.^{4}

In order to construct a better interval, a frequentist would typically truncate the interval to only the possible values, as was done in generating the UMP procedure from the nonparametric procedure (Spanos, 2011). This is guaranteed to lead to a better procedure. Our point here is that it is a mistake to naively assume that a procedure has good properties on the basis that it is a confidence procedure. However, see (Velicer et al., 2008) for an example of CI proponents including impossible values in confidence intervals, and (Fidler & Thompson, 2001) for a defense of this practice.↩