LearnBayes

Bayes: the science of evidence

Probability of counts: the Binomial distribution

Suppose we are conducting an opinion poll for a political candidate, who we'll call Mr. Q. Mr. Q would like to know how many people support their candidacy over the candidate, Ms. K. For the sake of simplicity, suppose that everyone supports one or the other candidate, and we randomly sample people from the population to determine their opinion. We decide to sample 50 people; out of those 50 people, 36 support Mr Q. If, in reality, public opinion were evenly split, what was the probability that 36 people out of our sample would support Mr. Q?

Questions like the one above occur quite often in research. Estimating ability from the number of items correct on a test; estimating the proportion of the population that holds a particular belief;  estimating the true proportion of females/males in the population; and many more. With respect to Mr. Q's poll, there are several important aspects to the situation that we will focus on.

  1. The number of people sampled was fixed, and was not affected by the outcome of the sampling. We did not, for instance, decide to stop sampling people because we found 36 votes for Mr. Q.
  2. The observations are independent of one another.  If we decided to take a shortcut and asked a person and their friend for convenience, we would violating independence because friends are more likely to share opinions than two random people in the population. But because we randomly sampled, the outcome of one observation was not predictable from the last except through the information that the last observation gave us about the population as a whole. 
  3. The observations are from the same "population". Every person is sampled from the same population. This would not be true if, say, we decided to sample 25 people one neighborhood and 25 people from a different neighborhood — in that case, there are two populations, each (possibly) with different overall opinions about Mr. Q. In our case, we sampled randomly from the whole population.

These three assumptions about our situation are very important. Together, they define what is called a "binomial setting." A binomial setting allows us to model the population with something called the binomial distribution. Knowing the binomial distribution allows us to answer such questions as the one posed above: if we know the race is tied — that is, that each candidate has 50% support — what is the probability of sampling 36 out of 50 in favor of Mr. Q?

To begin answering this question, we note that each sample can be thought of like a coin flip; the probability of someone saying they support Mr. Q is 0.5, as is the probability of someone saying they support Ms. K. Let's first consider a plausible way the samples could have turned out:

QQQQKQQQQQQQQQKKQKQKQKQQKQQKQKKQQKQQKQQQQQQQKQKQQQ

This row represents the samples in order, from the first sample to the last sample. There are 36 Qs, each representing support for Mr. Q, and thus 14 Ks. What was the probability that the samples would be in this order? Basic probability rules allow us to compute the answer. We know, for instance, from the multiplication rule for independent events, that the independence of the events allows us to multiply the individual probabilities of each of the events to obtain the probability of the whole ordering. Let's call the probability that a person says they support Mr. Q by the Greek letter \(\theta\); this implies that the probability that someone supports Ms. K is \(1-\theta\), because probabilities must sum to 1.

Multiplying 36 \(\theta\)s and 14 \((1-\theta)\)s together gives us \[ \theta^{36}(1-\theta)^{14}\] or, if we write it more generally, \[\theta^y(1-\theta)^{N-y}\] where \(y\) is the number of votes that Mr. Q. got and \(N\) is the total number of people sampled (which means that \(N-y=14\) is the number of votes Ms. K got).

In the case at hand, we are assuming that the race is tied; that is, that \(\theta=0.5\). The probability of the result we obtained is thus

\( \begin{eqnarray}\theta^{36}(1-\theta)^{14} &=& 0.5^{36}0.5^{14}\\ &\approx& 8.9\times10^{-16}, \end{eqnarray}\)

which is a very small number. This very small number is the probability of the exact, ordered sequence of samples we obtained. 

Note, however, that this number does not answer our original question. Our original question was "What is the probability of obtaining 36 votes for Mr. Q out of 50 random samples?" This question makes no mention of the ordering of the samples; so the intermediate answer we obtained above is far too specific. There are many more ways of obtaining 36 out of 50 votes than the sequence of Ks and Qs above. If we rearranged the letters in some other way, the resulting sequence would be different, but would have exactly the same probability as the one we obtained. Our question can be slightly rephrased: "What is the probability of obtaining any sequence of samples in which there are 36 votes for Mr. Q, and 14 votes for Ms. K?"

Using our basic probability rules again, we know that the probability of one event occurring out of a set of mutually exclusive events is equal to the sum of the probabilities of these individual events. Obviously, if we obtain one sequence of votes, we could nothave obtained another. And each of the sequences with 36 Qs and 14 Ks all have equal probability. If we knew, then, the total number of these sequences that have 36 Qs and 14 Ks, and we called this number \(C\), then we could know that the probability of obtaining 36 votes for Mr. Q out of 50 total votes is \[ C \times \theta^{y}(1-\theta)^{N-y}.\] So, what is \(C\)? How many ways are there to get 36 Qs and 14 Ks out of 50?

Luckily, the answer to this question is known in probability and statistics. It even has a special name: the choose function: \[\binom{N}{y} = \frac{N!}{y!(N-y)!}\] (The \(!\) denotes the factorial of a number.) The choose function can be found on many calculators, and it tells you how many distinct ways there are of selecting \(y\) out of \(N\) total objects. In our case, the "objects" are votes, but we can still use the choose function to compute \(C\): \[C = \binom{50}{36} = 937,845,656,300\] — a very large number, because there are many different ways to order our 36 Q samples in a sequence of 50.

Now that we know how many different ways of ordering the Qs and Ks, we can compute the probability of obtaining 36 votes for Mr. Q when \(N=50\) and \(\theta=0.5\), which we can denote \(Pr(y=36; N=50, \theta=0.5)\):\[Pr(y=36; N=50, \theta=0.5) = 937,845,656,300 \times 8.9\times10^{-16} \approx 0.0008.\] That is, the probability of obtaining total 36 votes for Mr. Q out of 50, if the race were tied, is about 0.0008. We don't have to restrict ourselves to a particular number of votes for Mr. Q, a specific number of total votes, or a specific value for the population support for Mr. Q, however; we can write a general formula for the probability, given any possible values: \[Pr(y; N, \theta) = \binom{N}{y}\theta^y(1-\theta)^{N-y}\] This is known as the binomial distribution, because it allows us to compute the probabilities of specific outcomes when we have a binomial setting.

With the the equation defined above, we can compute the probabilities of all the possible outcomes when takeing 50 samples and \(\theta=0.5\). The plot below shows the probabilities of all the outcomes. On the x-axis are the possible outcome from 0 votes for Mr. Q to 50 votes for Mr. Q. The y-axis shows the corresponding probability.

As might be expected, the most probable outcome is that Mr. Q will get half the votes, or \(y=25\). However, there is a broad range of probable outcomes other than 25 votes. The number of votes we obtained, 36, is somewhat far from most of the probable votes at 0.0008

The plot below allows you to change the parameters \(N\) and \(\theta\) to see how the distribution changes. Change the values using the slider to see how this affects the probability of specific values of \(y\).

By manipulating the plot above, you should be able to answer the following questions:

  1. If the candidates were tied, and you sampled 20 people instead of 50, what is the probability of obtaining 13 votes for Mr. Q?
  2. If Mr. Q truly had the support of 60% of the population and you sampled 20 people, what would be the probability that you obtained exactly 10 votes for Mr. Q?
  3. If Mr. Q truly had the support of 60% of the population and you sampled 5 people, what would be the probability Mr. Q obtaining a majority of those 5 votes? (Hint: add up the probabilities of the outcomes that represent a majority.)
You are here: Home Statistical Basics Basics Probability of counts: the Binomial distribution