The rescuers who have been offered the four intervals above have a choice to make: which confidence procedure to choose? We have shown that several of the confidence procedures have counter-intuitive properties, but thus far, we have not made any firm commitments about which confidence procedures would be preferred to the others. For the sake of our rescue team, who have a decision to make about which interval to use, we now compare the four procedures directly. We begin with the evaluation of the procedures from the perspective of confidence interval theory, then evaluate them according to Bayesian theory.

As previously mentioned, confidence interval theory specifies that better intervals will include false values less often. Figure 4 shows the probability that each of the procedures include a value $$\theta'$$ at a specified distance from the hatch $$\theta$$. All procedures are 50% confidence procedures, and so they include the true value $$\theta$$ 50% of the time. Importantly, however, the procedures include particular false values $$\theta'\neq\theta$$ at different rates. See the interactive versions of Figures 1 and 4 linked in the figure captions for a hands-on demonstration.

The trivial procedure (T; gray horizontal line) is obviously a bad interval because it includes every false value with the same frequency as the true value. This is analogous to a hypothesis test with power equal to its Type I error rate. The trivial procedure will be worse than any other procedure, unless the procedure is specifically constructed to be pathological. The UMP procedure (UMP), on the other hand, is better than every other procedure for every value of $$\theta'$$. This is due to the fact that it was created by inverting a most-powerful test. No other confidence procedure can be better.

The ordering among the three remaining procedures can be seen by comparing their curves. The sampling distribution procedure (SD) is always superior to the Bayes procedure (B), but not to the nonparametric procedure (NP). The nonparametric procedure and the Bayes procedure curves overlap, so one is not preferred to the other. Welch (1939) remarked that the Bayes procedure is “not the best way of constructing confidence limits” using precisely the frequentist comparison shown in Figure 4 with the UMP interval.5

The frequentist comparison between the procedures is instructive, because we have arrived at an ordering of the procedures employing the criteria suggested by Neyman and used by the modern developers of new confidence procedures: coverage and power. The UMP procedure is the best, followed by the sampling distribution procedure. The sampling distribution procedure is better than the Bayes procedure. The nonparametric procedure is not preferred to any interval, but neither is it the worst.

We can also examine the procedures from a Bayesian perspective, which is primarily concerned with whether the inferences are reasonable in light of the data and what was known before the data were observed (Howson & Urbach, 2006). We have already seen that interpreting the non-Bayesian procedures in this way leads to trouble, and that the Bayesian procedure, unsurprisingly, has better properties in this regard. We will show how the Bayesian interval was derived in order to provide insight into why it has good properties.

Consider the left column of Figure 5, which shows Bayesian reasoning from prior and likelihood to posterior and so-called credible interval. The prior distribution in the top panel shows that prior to observing the data, all the locations in this region are equally probable. Upon observing the bubbles shown in Figure 1A — also shown in the top of the “likelihood” panel — the likelihood is a function that is 1 for all possible locations for the hatch, and 0 otherwise. To combine our prior knowledge with the new information from the two bubbles, we condition what we knew before on the information in the data by multiplying by the likelihood — or, equivalently, excluding values we know to be impossible — which results in the posterior distribution in the bottom row. The central 50% credible interval contains all values in the central 50% of the area of the posterior, shown as the shaded region. The right column of Figure 5 shows a similar computation using an informative prior distribution that does not assume that all locations are equally likely, as might occur if some other information about the location of the submersible were available.

It is now obvious why the Bayesian credible interval has the properties typically ascribed to confidence intervals. The credible interval can be interpreted as having a 50% probability of containing the true value, because the values within it account for 50% of the posterior probability. It reveals the precision of our knowledge of the parameter, in light of the data and prior, through its relationship with the posterior and likelihood.

Of the five procedures considered, intervals from the Bayesian procedure are the only ones that can be said to have 50% probability of containing the true value, upon observing the data. Importantly, the ability to interpret the interval in this way arises from Bayesian theory and not from confidence interval theory. Also importantly, it was necessary to stipulate a prior to obtain the desired interval; the interval should be interpreted in light of the stipulated prior. Of the other four intervals, none can be justified as providing a “reasonable” inference or conclusion from the data, because of their strange properties and that there is no possible prior distribution that could lead to these procedures. In this light, it is clear why Neyman's rejection of “conclusions” and “reasoning” from data naturally flowed from his theory: the theory, after all, does not support such ideas. It is also clear that if they care about making reasonable inferences from data, scientists might want want to reject confidence interval theory as a basis for evaluating procedures.

We can now review what we know concerning the four procedures procedures. Only the Bayesian procedure — when its intervals are interpreted as credible intervals — allows the interpretation that there is a 50% probability that the hatch is located in the interval. Only the Bayesian procedure properly tracks the precision of the estimate. Only the Bayesian procedure covers the plausible values in the expected way: the other procedures produce intervals that are known with certainty — by simple logic — to contain the true value, but still are “50%” intervals. The non-Bayesian intervals have undesirable, even bizarre properties, which would lead any reasonable analyst to reject them as a means to draw inferences. Yet the Bayesian procedure is judged by frequentist CI theory as inferior.

The disconnect between frequentist theory and Bayesian theory arises from the different goals of the two theories. Frequentist theory is a “pre-data” theory. It looks forward, devising procedures that will have particular average properties in repeated sampling (Jaynes, 2003; Mayo, 1981, 1982) in the future (see also Neyman, 1937, p. 349). This thinking can be clearly seen in Neyman (1942) as quoted above: reasoning ends once the procedure is derived. Confidence interval theory is vested in the average frequency of including or excluding true and false parameter values, respectively. Any given inference may — or may not — be reasonable in light of the observed data, but this is not Neyman’s concern; he disclaims any conclusions or beliefs on the basis of data. Bayesian theory, on the other hand, is a post-data theory: a Bayesian analyst uses the information in the data to determine what is reasonable to believe, in light of the model assumptions and prior information.

Using an interval justified by a pre-data theory to make post-data inferences can lead to unjustified, and possibly arbitrary, inferences. This problem is not limited to the pedagogical submersible example (J. O. Berger & Wolpert, 1988; E.-J. Wagenmakers et al., 2014) though this simple example is instructive for identifying these issues. In the next section we show how a commonly-used confidence interval leads to similarly flawed post-data inferences.

1. Several readers of a previous draft of this manuscript noted that frequentists use the likelihood as well, and so may prefer the Bayesian procedure in this example. However, as Neyman (1977) points out, the likelihood has no special status to a frequentist; what matters is the frequentist properties of the procedure, not how it was constructed.