In a classic paper, Neyman (1937) laid the formal foundation for confidence intervals. It is easy to describe the practical problem that Neyman saw CIs as solving. Suppose a researcher is interested in estimating a parameter \(\theta\). Neyman suggests that researchers perform the following three steps:

- Perform an experiment, collecting the relevant data.
- Compute two numbers — the smaller of which we can call \(L\), the greater \(U\) — forming an interval \((L,U)\) according to a specified procedure.
- State that \(L<\theta<U\) — that is, that \(\theta\) is in the interval.

This recommendation is justified by choosing an procedure for step (1) such that in the long run, the researcher’s claim in step (2) will be correct, on average, \(X\%\) of the time. A confidence interval is any interval computed using such a procedure.

We first focus on the meaning of the statement that \(\theta\) is in the interval, in step (3). As we have seen, according to CI theory, what happens in step (3) is not a belief, a conclusion, or any sort of reasoning from the data. Furthermore, it is not associated with any level of uncertainty about whether \(\theta\) is, actually, in the interval. It is merely a dichotomous statement that is meant to have a specified probability of being true in the long run.

Frequentist evaluation of confidence procedures is based on what can be called the “power” of the procedures, which is the frequency with which false values of a parameter are excluded. Better intervals are shorter on average, excluding false values more often (Lehmann, 1959; Neyman, 1937, 1941; Welch, 1939). Consider a particular false value of the parameter, \(\theta'\neq\theta\). Different confidence procedures will include that false value at different rates. If some confidence procedure CP \(A\) excludes \(\theta'\), on average, more often than some CP \(B\), then CP \(A\) is better than CP \(B\) for that value.

Sometimes we find that one CP excludes *every* false value at a rate greater than some other CP; in this case, the first CP is uniformly more powerful than the second. There may even be a “best” CP: one that excludes every false \(\theta'\) value at a rate greater than any other possible CP. This is analogous to a most-powerful test. Although a best confidence procedure does not always exist, we can always compare one procedure to another to decide whether one is better in this way (Neyman, 1952). Confidence procedures are therefore closely related to hypothesis tests: confidence procedures control the rate of including the true value, and better confidence procedures have more power to exclude false values.

Skepticism about the usefulness of confidence intervals arose as soon as Neyman first articulated the theory (Neyman, 1934).^{2} In the discussion of (Neyman, 1934), Bowley, pointing out what we call the fundamental confidence fallacy, expressed skepticism that the confidence interval answers the right question:

“I am not at all sure that the ‘confidence’ is not a ‘confidence trick.’ Does it really lead us towards what we need – the chance that in the universe which we are sampling the proportion is within these certain limits? I think it does not. I think we are in the position of knowing that either an improbable event has occurred or the proportion in the population is within the limits. To balance these things we must make an estimate and form a judgment as to the likelihood of the proportion in the universe [that is, a prior probability] – the very thing that is supposed to be eliminated.” (discussion of Neyman, 1934, p. 609)

In the same discussion, Fisher critiqued the theory for possibly leading to mutually contradictory inferences: “The [theory of confidence intervals] was a wide and very handsome one, but it had been erected at considerable expense, and it was perhaps as well to count the cost. The first item to which he [Fisher] would call attention was the loss of uniqueness in the result, and the consequent danger of apparently contradictory inferences.” (discussion of Neyman, 1934, p. 618; see also Fisher (1935)). Though, as we will see, the critiques are accurate, in a broader sense they missed the mark. Like modern proponents of confidence intervals, the critics failed to understand that Neyman’s goal was different from theirs: Neyman had developed a behavioral theory designed to control error rates, not a theory for reasoning from data (Neyman, 1941).

In spite of the critiques, confidence intervals have grown in popularity to be the most widely used interval estimators. The alternatives — such as Bayesian credible intervals and Fisher’s fiducial intervals — are not as commonly used. We suggest that this is largely because people do not understand the differences between confidence interval, Bayesian, and fiducial theories, and how the resulting intervals cannot be interpreted in the same way. In the next section, we will demonstrate the logic of confidence interval theory by building several confidence procedures and comparing them to one another. We will also show how the three fallacies affect inferences with these intervals.

Neyman first articulated the theory in another paper before his major theoretical paper in 1937.↩