Frequentist theory can be counter-intuitive at times; as Fisher was fond of pointing out, frequentist theorists often seemed disconnected with the concerns of scientists, developing methods that did not suit their needs (e.g., Fisher, 1955, p. 70). This has lead to confusion where practitioners assume that methods designed for one purpose were really meant for another. In order to help mitigate such confusion, here we would like to offer readers a clear guide to interpreting and reporting confidence intervals.
Once one has collected data and computed a confidence interval, how does one then interpret the interval? The answer is quite straightforward: one does not – at least not within confidence interval theory.8 As Neyman and others pointed out repeatedly, and as we have shown, confidence limits cannot be interpreted as anything besides the result of a procedure that will contain the true value in a fixed proportion of samples. Unless an interpretation of the interval can be specifically justified by some other theory of inference, confidence intervals must remain uninterpreted, lest one make arbitrary inferences or inferences that are contradicted by the data. This applies even to “good” confidence intervals, as these are often built by inverting significance tests and may have strange properties (e.g., Steiger, 2004).
In order to help mitigate confusion in the scientific literature, we suggest the following guidelines for reporting of intervals informed by our discussion in this manuscript.
Report credible intervals instead of confidence intervals. We believe any author who chooses to use confidence intervals should ensure that the intervals correspond numerically with credible intervals under some reasonable prior. Many confidence intervals cannot be so interpreted, but if the authors know they can be, they should be called “credible intervals”. This signals to readers that they can interpret the interval as they have been (incorrectly) told they can interpret confidence intervals. Of course, the corresponding prior must also be reported. This is not to say that one cannot also refer to credible intervals as confidence intervals, if indeed they are; however, readers are likely more interested in knowing that the procedure allows valid post-data inference — not pre-data inference — if they are interested arriving at substantive conclusions from the computed interval.
Do not use confidence procedures whose Bayesian properties are not known. As Casella (1992) pointed out, the post-data properties of a procedure are necessary for understanding what can be inferred from an interval. Any procedure whose Bayesian properties have not been explored may have properties that make it unsuitable for post-data inference. Procedures whose properties have not been adequately studied are inappropriate for general use.
Warn readers if the confidence procedure does not correspond to a Bayesian procedure. If it is known that a confidence interval does not correspond to a Bayesian procedure, warn readers that the confidence interval cannot be interpreted as having a X% probability of containing the parameter, that cannot be interpreted in terms of the precision of measurement, and that cannot be said to contain the values that should be taken seriously: the interval is merely an interval that, prior to sampling, had a X% probability of containing the true value. Authors choosing to report CIs have a responsibility to keep their readers from invalid inferences, because it is almost sure that without a warning readers will misinterpret them (Hoekstra et al., 2014).
Never report a confidence interval without noting the procedure and the corresponding statistics. As we have described, there are many different ways to construct confidence intervals, and they will have different properties. Some will have better frequentist properties than others; some will correspond to credible intervals, and others will not. It is unfortunately common for authors to report confidence intervals without noting how they were constructed or even citing a source. As can be seen from the examples we have presented, this is a terrible practice: without knowing which confidence procedure was used, it is unclear what can be inferred. In the submersible example, consider a 50% confidence interval .5 meters wide. This could correspond to very precise information (Bayesian interval) or very imprecise information (UMP and nonparametric interval). Not knowing which procedure was used could lead to absurd inferences. In addition, enough information should be presented so that any reader can compute a different confidence interval or credible interval. In many cases, this is covered by standard reporting practices, but in other cases more information may need to be given.
Consider reporting likelihoods or posteriors instead. An interval provides fairly impoverished information. Just as proponents of confidence intervals argue that CIs provide more information than a significance test (although this is debatable for many CIs), a likelihood or a posterior provides much more information than an interval. Recently, (G. Cumming, 2014) has proposed so-called “cat’s eye” intervals which correspond to Bayesian posteriors under a “non-informative” prior for normally distributed data. With modern scientific graphics so easy to create, we see no reason why likelihoods and posteriors cannot augment or even replace intervals in most circumstances (e.g., Kruschke, 2010). With a likelihood or a posterior, the arbitrariness of the confidence or credibility coefficient is avoided altogether.
A complete account of Bayesian statistics is beyond the scope of this paper (and indeed, can fill entire courses). In recent years, a number of good resources have been developed for readers wishing to learn more about applied Bayesian statistics, including estimation of posterior distributions and credible intervals: on the less technical side, there are texts by Bolstad (2007), M. D. Lee & Wagenmakers (2013), and Lynch (2007); on the more technical side are texts by Jackman (2009), Ntzoufras (2009), and Gelman, Carlin, Stern, & Rubin (2004). There are also numerous resources on the world wide web to help beginners. For readers wishing to try some simple examples, the supplement to this article contains R code to estimate posterior distributions and credible intervals for the examples in this paper.
Some recent writers have suggested the replacement of Neyman’s behavioral view on confidence intervals with a frequentist view focused on tests at various levels of “stringency” (Mayo & Cox, 2006; Mayo & Spanos, 2006). Readers who prefer a frequentist paradigm may wish to explore this approach; however, we are unaware of any comprehensive account of CIs within this paradigm, and regardless, it does not offer the properties desired by CI proponents. This is not to be read as an argument against it, but rather a warning that one must make a choice.↩