In scientific practice, it is frequently desirable to estimate some quantity of interest, and to express uncertainty in this estimate. If our goal were to estimate the true mean \(\mu\) of a normal population, we might choose the sample mean \(\bar{x}\) as an estimate. Informally, we expect \(\bar{x}\) to be close to \(\mu\), but *how* close depends on the sample size and the observed variability in the sample. To express uncertainty in the estimate, CIs are often used.

If there is one thing that everyone who writes about confidence intervals agrees on, it is the basic definition: A confidence interval for a parameter — which we generically call \(\theta\) and might represent a population mean, median, variance, probability, or any other unknown quantity — is an interval generated by a procedure that, on repeated sampling, has a fixed probability of containing the parameter. If the probability that the process generates an interval including \(\theta\) is .5, it is a 50% CI; likewise, the probability is .95 for a 95% CI.

Confidence interval

An \(X\%\) confidence interval for a parameter \(\theta\) is an interval \((L,U)\) generated by a procedure that in repeated sampling has an \(X\%\) probability of containing the true value of \(\theta\), for all possible values of \(\theta\) (Neyman, 1937).^{1}

The confidence coefficient of a confidence interval derives from the procedure which generated it. It is therefore helpful to differentiate a *procedure* (CP) from a confidence *interval*: an X% confidence procedure is any procedure that generates intervals containing \(\theta\) in X% of repeated samples, and a confidence interval is a specific interval generated by such a process. A confidence procedure is a random process; a confidence interval is observed and fixed.

It seems clear how to interpret a confidence *procedure*: it is any procedure that generates intervals that will contain the true value in a fixed proportion of samples. However, when we compute a specific interval from the data and must interpret it, we are faced with difficulty. It is not obvious how to move from our knowledge of the properties of the confidence procedure to the interpretation of some observed confidence interval.

Textbook authors and proponents of confidence intervals bridge the gap seamlessly by claiming that confidence intervals have three desirable properties: first, that the confidence coefficient can be read as a measure of the uncertainty one should have that the interval contains the parameter; second, that the CI width is a measure of estimation uncertainty; and third, that the interval contains the “likely” or “reasonable” values for the parameter. These all involve reasoning about the parameter from the observed data: that is, they are “post-data” inferences.

For instance, with respect to 95% confidence intervals, Masson & Loftus (2003) state that “there is a 95% probability that the obtained confidence interval includes the population mean.” G. Cumming (2014) writes that “[w]e can be 95% confident that our interval includes [the parameter] and can think of the lower and upper limits as likely lower and upper bounds for [the parameter].”

These interpretations of confidence intervals are not correct. We call the mistake these authors have made the “Fundamental Confidence Fallacy” (FCF) because it seems to flow naturally from the definition of the confidence interval:

The Fundamental Confidence Fallacy

If the probability that a random interval contains the true value is \(X\%\), then the plausibility or probability that a particular observed interval contains the true value is also \(X\%\); or, alternatively, we can have \(X\%\) confidence that the observed interval contains the true value.

The reasoning behind the Fundamental Confidence Fallacy seems plausible: on a given sample, we could get any one of the possible confidence intervals. If 95% of the possible confidence intervals contain the true value, without any other information it seems reasonable to say that we have 95% certainty that we obtained one of the confidence intervals that contain the true value. This interpretation is suggested by the name “confidence interval” itself: the word “confident”, in lay use, is closely related to concepts of plausibility and belief. The name “confidence interval” — rather than, for instance, the more accurate “coverage procedure” — encourages the Fundamental Confidence Fallacy.

The key confusion underlying the FCF is the confusion of what is known *before* observing the data — that the CI, whatever it will be, has a fixed chance of containing the true value — with what is known *after* observing the data. Frequentist CI theory says nothing at all about the probability that a particular, observed confidence interval contains the true value; it is either 0 (if the interval does not contain the parameter) or 1 (if the interval does contain the true value).

We offer several examples in this paper to show that what is known before computing an interval and what is known after computing it can be different. For now, we give a simple example, which we call the “trivial interval.” Consider the problem of estimating the mean of a continuous population with two independent observations, \(y_1\) and \(y_2\). If \(y_1>y_2\), we construct an confidence interval that contains all real numbers \((-\infty, \infty)\); otherwise, we construct an empty confidence interval. The first interval is guaranteed to include the true value; the second is guaranteed not too. It is obvious that before observing the data, there is a 50% probability that any sampled interval will contain the true mean. After observing the data, however, we know definitively whether the interval contains the true value. Applying the pre-data probability of 50% to the post-data situation, where we know for certain whether the interval contains the true value, would represent a basic reasoning failure.

Post-data assessments of probability have never been an advertised feature of CI theory. Neyman, for instance, said “Consider now the case when a sample…is already drawn and the [confidence interval] given…Can we say that in this particular case the probability of the true value of [the parameter] falling between [the limits] is equal to [\(X\%\)]? The answer is obviously in the negative” (Neyman, 1937, p. 349). According to frequentist philosopher Mayo (1981) “[the misunderstanding] seems rooted in a (not uncommon) desire for [...] confidence intervals to provide something which they cannot legitimately provide; namely, a measure of the degree of probability, belief, or support that an unknown parameter value lies in a specific interval.” Recent work has shown that this misunderstanding is pervasive among researchers, who likely learned it from textbooks, instructors, and confidence interval proponents (Hoekstra, Morey, Rouder, & Wagenmakers, 2014).

If confidence intervals cannot be used to assess the certainty with which a parameter is in a particular range, what can they be used for? Proponents of confidence intervals often claim that confidence intervals are useful for assessing the precision with which a parameter can be estimated. This is cited as one of the primary reasons confidence procedures should be used over null hypothesis significance tests (G. Cumming, 2014; e.g., G. Cumming & Finch, 2005; Fidler & Loftus, 2009; Loftus, 1993, 1996). For instance, G. Cumming (2014) writes that “[l]ong confidence intervals (CIs) will soon let us know if our experiment is weak and can give only imprecise estimates” (p. 10). Young & Lewis (1997) state that “[i]t is important to know how precisely the point estimate represents the true difference between the groups. The width of the CI gives us information on the precision of the point estimate” (p. 309). This is the second fallacy of confidence intervals, the “precision fallacy”:

The Precision fallacy

The width of a confidence interval indicates the precision of our knowledge about the parameter. Narrow confidence intervals show precise knowledge, while wide confidence errors show imprecise knowledge.

There is no necessary connection between the precision of an estimate and the size of a confidence interval. One way to see this is to imagine two researchers — a senior researcher and a PhD student — are analyzing data of 50 participants from an experiment. As an exercise for the PhD student's benefit, the senior researcher decides to randomly divide the participants into two sets of 25 so that they can each separately analyze half the data set. In a subsequent meeting, the two share with one another their Student's \(t\) confidence intervals for the mean. The PhD student's 95% CI is \(52\pm2\), and the senior researcher's 95% CI is \(53\pm4\). The senior researcher notes that their results are broadly consistent, and that they could use the equally-weighted mean of their two respective point estimates, 52.5, as an overall estimate of the true mean.

The PhD student, however, argues that their two means should not be evenly weighted: she notes that her CI is half as wide and argues that her estimate is more precise and should thus be weighted more heavily. Her advisor notes that this cannot be correct, because the estimate from unevenly weighting the two means would be different from the estimate from analyzing the complete data set, which must be 52.5. The PhD student's mistake is assuming that CIs directly indicate post-data precision. Later, we will provide several examples where the width of a CI and the uncertainty with which a parameter is estimated are in one case inversely related, and in another not related at all.

We cannot interpret observed confidence intervals as containing the true value with some probability; we also cannot interpret confidence intervals as indicating the precision of our estimate. There is a third common interpretation of confidence intervals: Loftus (1996), for instance, says that the CI gives an “indication of how seriously the observed pattern of means should be taken as a reflection of the underlying pattern of population means.” This logic is used when confidence intervals are used to test theory (Velicer et al., 2008) or to argue for the null (or practically null) hypothesis (Loftus, 1996). This is another fallacy of confidence interval that we call the “likelihood fallacy”.

The Likelihood fallacy

A confidence interval contains the likely values for the parameter. Values inside the confidence interval are more likely than those outside. This fallacy exists in several varieties, sometimes involving plausibility, credibility, or reasonableness of beliefs about the parameter.

A confidence procedure may have a fixed *average* probability of including the true value, but whether on any given sample it includes the “reasonable” values is a different question. As we will show, confidence intervals — even “good” confidence intervals, from a CI-theory perspective — can exclude almost all reasonable values, and can be empty or infinitesimally narrow, excluding all possible values (Blaker & Spjøtvoll, 2000; Dufour, 1997; Steiger, 2004; Steiger & Fouladi, 1997; Stock & Wright, 2000). But Neyman (1941) writes,

“it is not suggested that we can ‘conclude’ that [the interval contains \(\theta\)], nor that we should ‘believe’ that [the interval contains \(\theta\)]…[we]

decideto behave as if we actually knew that the true value [is in the interval]. This is done as a result of our decision and has nothing to do with ‘reasoning’ or ‘conclusion’. The reasoning ended when the [confidence procedure was derived]. The above process [of using CIs] is also devoid of any ‘belief’ concerning the value [...] of [\(\theta\)].” (Neyman, 1941, pp. 133–134)

It may seem strange to the modern user of CIs, but Neyman is quite clear that CIs do not support any sort of reasonable belief about the parameter. Even from a frequentist testing perspective where one accepts and rejects specific parameter values, Mayo & Spanos (2006) note that just because a specific value is in an interval does not mean it is warranted to accept it; they call this the “fallacy of acceptance.” This fallacy is analogous to accepting the null hypothesis in a classical significance test merely because it has not been rejected.

If confidence procedures do not allow an assessment of the probability that an interval contains the true value, if they do not yield measures of precision, and if they do not yield assessments of the likelihood or plausibility of parameter values, then what are they?

The modern definition of a confidence interval allows the probability to be

*at least*\(X\%\), rather than exactly \(X\%\). This detail does not affect any of the points we will make; we mention it for completeness.↩