Bayes: the science of evidence



Suppose that you are conducting an opinion poll in which you ask 20 people about whether they believe access to medical euthanasia for terminally ill patients, often called "physician-assisted suicide", is acceptable. Your poll finds that of those 20 people, 11 endorse physician-assisted suicide. What can be inferred about the population from these data? In this article, we discuss the "likelihood" portion of Bayes' theorem, which is the contribution of the data to belief.

We've discussed Bayes' theorem, and how it can be used to modify belief in light of data. It is composed of three parts — the prior, the likelihood, and the posterior: \[ \mbox{Posterior} \propto \mbox{Likelihood}\times\mbox{Prior}, \] which can be phrased in words as

"Weighting our prior belief by the evidence provided by the data yields our new belief, after observing the data."

An essential part of using Bayes' theorem is, then, figuring out the proper weighting from the data. The likelihood portion of Bayes' theorem is \(p(y\mid\theta)\), so we know it is related to the probability distribution of the data, but how?

 In our example, 11 out of 20 people we sampled endorsed the right to physician-assisted suicide. Assuming a binomial setting, we can use the binomial distribution as a model for these data. The binomial has an unknown parameter, \(\theta\), which describes the true proportion of people in the population who endose physician-assisted suicide. If we knew the true value of \(\theta\), the could use the binomial distribution to describe what values of data we would expect, and values we would not.

Suppose by some stroke of luck we had access to an oracle that we could ask about the true population. However, this oracle always gives us two possible values of \(\theta\), one of which is true, and one of which is not. We ask the oracle about the true proportion of the population that endorses physician-assisted suicide, and the oracle replies with the values "0.5" and "0.9". Which of these two values is our data most consistent with?

You probably have an intuition that \(\theta=0.5\) is most consistent with the data, because it is closer to our observed proportion, \(Y/N = 11/20 = 0.55\), than is \(\theta=0.9\). But how much better is it, and how can we formalize the idea?

One way to approach this problem is to assume, in turn, that one or the other of the values is true, and ask what our predictions about the samples we would obtain would be. This involves drawing two binomial distributions: one for which \(\theta=0.5\), and one for which \(\theta=0.9\). The two plots below show the two corresponding binomial distributions. The data we observed, \(y=11\), is highlighted in red.

The plots show the probability of the data we observed, \(y=11\), under each of the two values the oracle gave us. The left plot shows the probable values our sample would take ff we assume that \(\theta=0.5\). As we would expect, if \(\theta=0.5\), then the probable samples are around 10 (give or take 4). In fact, the data we observed is reasonably probable if \(\theta=0.5\); we would expect to see \(y=11\) about 16% of the time.

On the other hand, if we assume that \(\theta=0.9\) (right plot), the samples we would expect to see would be clustered near the maximum possible value, \(y=20\). As we might expect, the most probable value is \(Y=18\); but all values from 15 to 20 are all relatively probable. Importantly, the value we observed, \(y=11\), is highly improbable if \(\theta=0.9\). If 90% of the population actually endorsed physician-assisted suicide, then we would expect to sample 11 endorsements of 20 only about 0.005% of the time, or about 5 in every one hundred thousand samples.

We can now compare the two values that the oracle gave us directly. If \(\theta=0.5\) then the data we saw would be fairly probable: we would observe 11 samples, out of 20, 16% of the time. If \(\theta=0.9\) we would observe 11 samples only exceedingly rarely, 0.005% of the time. These numbers agree with the intuition that we should favor \(\theta=0.5\) over \(\theta=0.9\).

Let's try to generalize what we've done so far. Recall that we used the binomial probability density function to generate the values 16% and 0.005%. Typically, when we are looking at a probability distribution, we are looking at how probable different values for the data are, given some value of the population parameter (in the case of the binomial, \(\theta\)). In the preceding analysis, however, we kept the data fixed, and varied the parameter \(\theta\) across the two values the oracle gave us. Keeping the data fixed is reasonable; after all, when we collect data, we know exactly what its value is. In actual analyses, the uncertain value is the parameter, not the data. We therefore computed the value of the probability density function, \(Pr(y\mid\theta,N)\), for many values of \(\theta\) while holding the observed data constant. In the case of our example, we evaluated the function\[\binom{20}{11}\theta^{11}(1-\theta)^{9}\] for two different values of \(\theta\), 0.5 and 0.9.

When we consider a probability distribution function "backwards" — that is, for different values of a parameter, while holding the data constant — it is called the "likelihood". We say, for instance, that \(\theta=0.5\) is about 3000 times more "likely" than \(\theta=0.9\), because \[\frac{Pr(y=11\mid\theta=0.5)}{Pr(y=11\mid\theta=0.9)}=\frac{\binom{20}{11}0.5^{11}(1-0.5)^{9}}{\binom{20}{11}0.9^{11}(1-0.9)^{9}} = \frac{.16}{0.00005}=3039\].

We need not, however, stop at only comparing the two values that the oracle gave us. In a real analysis, of course, we would not only have two values to compare. We should look at all possible values of \(\theta\). Some of these values will have relatively high likelihood (like \(\theta=0.5\)) and some will have relatively low likelihood (like \(\theta=0.9\)). If we consider the likelihood function for all possible values of \(\theta\), it will trace out a curve that will tell us the likelihood of each \(\theta\), for the particular data we observed.

Using the two plots below, we can see how the likelihood function works. At the initial settings for the two plots, the left plot shows the probability distribution for our samples \(y\) given that \(\theta=0.5\). Notice that the data we observed, \(y=11\), is highlighted. The plot on the right shows the likelihood function. Notice that the maximum value is \(\theta=11/20\), which is perhaps not surprising. Values of \(\theta\) closer to 11/20 are highly likely, while values further away (especially near 0 or 1) are very unlikely, relative to the most likely values.

If you move your mouse over the plot on the right, you'll see that you change the value of \(\theta\) for which the probability distribution on the right is drawn. If you move your mouse over to \(\theta=0.5\), the probability distribution should look like the binomial in the first figure above, on the left. If you move your mouse to \(\theta=0.9\), the distribution on the right should like the distribution in the first figure above, on the right. The horizontal gray line shows how the probability distribution (left) gives rise to the likelihood (right). Values of \(\theta\) for which the observed data are highly probably yield high likelihood values; values of \(\theta\) for which the observed data are highly improbable yield low likelihood values.

At this point, we note two important things about likelihood.

  • Likelihood is not probability. Although it was created using probabilities (in this case, the probabilities of binomial outcomes), likelihood values are not probabilities (or densities). The likelihood need not sum or integrate to 1, as probability must. One way to think about likelihood is as a "weight" of evidence. The likelihood function uses the data to determine what weight to apply to our prior belief in order to transform it to our posterior belief.
  • Values of likelihood are to be interpreted relative to other values. The absolute magnitude of the likelihood is unimportant. It is meaningless to say, for instance, that "the likelihood for \(\theta=0.5\) is 0.16", without further context. The likelihood function tells us how likely particular values of a parameter are relative to all other values, so "the likelihood for \(\theta=0.5\) is 3000 times the likelihood for \(\theta=0.9\)" is, however, perfectly OK.

We are can now return to our initial statement of Bayes' theorem and concentrate on what the role of the likelihood is. Recall Bayes' theorem:  \[ \mbox{Posterior} \propto \mbox{Likelihood}\times\mbox{Prior}. \] Our initial beliefs about an unknown parameter are represented by the prior distribution. The likelihood function, which is a function of the unknown parameter, serves as a weight, weighting some values higher and some values lower. This weighing is determined by the observed data; values of the parameter that are very close to the observed data will be weighted higher, and values of the parameter that are far away from the data will be weighted lower. The result of weighting the prior distribution by the likelihood is the posterior, which represents our new beliefs about the parameter, properly informed by the data.

One issue that we haven't addressed is how the likelihood function changes as we increase the amount of data we have. It is critical for any statistical method to be sensitive to the data: the more data we have, the greater precision we should be able to attribute to our estimates of parameters. Recall that our likelihood function for \(\theta\), the binomial parameter, is \[\theta^y(1-\theta)^{N-y}.\] Notice that we've dropped the choose function \(\binom{N}{y}\) from the front, because it is unnecessary (it does not change when we change \(\theta\); dropping that term does not affect the relative values of the likelihood, just the absolute magnitudes). It may not be obvious from the likelihood function, but as the amount of data increases, the likelihood function narrows further and further. The plot below shows how this happens. Try changing \(N\) using the slider; as you do so the plot will keep the relative value of \(y\) as close as possible to 11/20.

Notice that the likelihood narrows around a single value as \(N\) gets larger. Another way to say this is that more and more values become implausible (unlikely, relative as we other values) as we collect more data. When we multiply our prior distribution by our (increasingly narrow) likelihood function, the resulting posterior will also become increasingly narrow. We can now see that the likelihood represents the change in our beliefs from before we saw the data to after we saw the data. This is what defines the likelihood's role as representing the evidence from the data.