Completing the square

Several of the explanations on this website require a bit of algebra to understand. "Completing the square" [wikipedia] refers to taking an expression of the form \(ax^2 + bx + c\) and, though algebraic manipulation, ending up with an expression of the form \(a(x + d)^2 + e\). Completing the square is an useful algebra trick to know since it arises often when multiplying likelihoods by priors, especially in the context of the Normal distribution. 

In this reference tutorial, we will demonstrate how to complete the square in both univariate (scalar) and multivariate (matrix) contexts. Incidentally, this article also contains the derivation of the posterior distribution of the mean of the normal distribution when the variance is known in both univariate and multivariate contexts.

Completing the square: univariate


Consider the case of observing \(N\) independent samples from a \(\mbox{Normal}(\mu,\sigma^2)\) distribution. Suppose we know the true value of \(\sigma^2\), and we are interested in determining the posterior distribution of \(\mu\). It is conventional to place a conjugate normal prior on \(\mu\). Our model is:


In Bayesian statistics, the posterior is proportional to the likelihood times the prior. Because all observations are independent, our likelihood is the product of \(N\) Normal density functions: one for each \(y_i\). The prior then provides another Normal density function term. After simplifying and dropping terms that are not functions of \(\mu\), we end up with a posterior distribution for \(\mu\) proportional to

\(\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N(y_i - \mu)^2\right\}\exp\left\{-\frac{1}{2\tau^2}(\mu - \mu_0)^2\right\}\)

The first term is the likelihood, and the second term is the prior. Because we desire a posterior that is a simple function of \(\mu\), we need to gather all the terms that include \(\mu\) together; as the posterior is written above, \(\mu\) is scattered across \(N+1\) terms. Completing the square is the trick that will allow us to gather all the \(\mu\) terms into one.

The first step is to expand the squares containing \(\mu\). This yields

\(\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^N\left(y_i^2 - 2\mu y_i - \mu^2\right)\right\}\exp\left\{-\frac{1}{2\tau^2}(\mu^2 - 2\mu\mu_0 + \mu_0^2)\right\}\)

We now distribute the sum, yielding
\(\exp\left\{-\frac{1}{2\sigma^2}\left(\color{red}{\sum_{i=1}^Ny_i^2} - 2\mu \sum_{i=1}^Ny_i + N\mu^2\right)\right\}\exp\left\{-\frac{1}{2\tau^2}(\mu^2 - 2\mu\mu_0 + \color{red}{\mu_0^2})\right\}\)
There are several terms above that do not involve \(\mu\); these are highlighted in red. When we drop those terms, combine all remaining terms within one exponent, and then focus only on what is in the exponent, what remains is
\(-\frac{1}{2}\left(\frac{N}{\sigma^2}\mu^2 + \frac{1}{\tau^2}\mu^2 - 2\mu \frac{\sum_{i=1}^Ny_i}{\sigma^2} - 2\mu\frac{1}{\tau^2}\mu_0\right) \)
We can combine the \(\mu^2\) terms together, and the \(\mu\) terms together:
\(-\frac{1}{2}\left(\left(\frac{N}{\sigma^2} + \frac{1}{\tau^2}\right)\mu^2 - 2\left(\frac{\sum_{i=1}^Ny_i}{\sigma^2} + \frac{1}{\tau^2}\mu_0\right)\mu\right)\)
Finally, for ease of notation, we can use the fact that \(\sum_{i=1}^Ny_i = N\bar{y}\):
\(-\frac{1}{2}\left(\left(\frac{N}{\sigma^2} + \frac{1}{\tau^2}\right)\mu^2 - 2\left(\frac{N}{\sigma^2}\bar{y} + \frac{1}{\tau^2}\mu_0\right)\mu\right)\)
After all of this algebraic simplification, inside the parentheses what we have looks something like \(ax^2 - 2bx\). We can now "complete the square" to obtain something of the form \((x - c)^2\). 

Competing the square

Our first step is to make the notation easier to follow. Let \(a = \frac{N}{\sigma^2} + \frac{1}{\tau^2}\) and \(b = \frac{N}{\sigma^2}\bar{y} + \frac{1}{\tau^2}\mu_0\). Using the new, simplified notation, we have

\(-\frac{1}{2}\left(a\mu^2 - 2b\mu\right)\)

We can move the coefficient \(a\) on \(\mu\) outside the parentheses:

\(-\frac{a}{2}\left(\mu^2 - 2\frac{b}{a}\mu\right)\)

We now add and subtract the same value inside the parentheses. This doesn't change the value at all, since the terms sum to 0:

\(-\frac{a}{2}\left(\mu^2 - 2\frac{b}{a}\mu + \frac{b^2}{a^2} \color{red}{- \frac{b^2}{a^2}}\right)\)

In fact, neither term is a function of \(\mu\), so we can simply drop the term colored in red.

\(-\frac{a}{2}\left(\mu^2 - 2\frac{b}{a}\mu + \frac{b^2}{a^2}\right)\)

The terms within the parentheses are of the form \(x^2 - 2xc + c^2\), which, from the rules learned in algebra, can be simplified to \((x - c)^2\). Applying this to our terms, we obtain:

\(-\frac{a}{2}\left(\mu - \frac{b}{a}\right)^2\)

We have thus completed the square. We are not done, however: this was only the portion of the posterior distribution that was in the exponent. Replacing the terms in the exponent yields:

\(\exp\left\{-\frac{a}{2}\left(\mu - \frac{b}{a}\right)^2\right\}\)

By completing the square, we have revealed that the posterior distribution of \(\mu\) has the form of a normal distribution with a mean of \(b/a\) and a variance of \(1/a\), or

\(\mu\mid y \sim \mbox{Normal}\left(\mu_n, \sigma^2_n\right)\)


\(\begin{eqnarray}\sigma^2_n = \frac{1}{a}&=&\left(\frac{N}{\sigma^2}+\frac{1}{\tau^2}\right)^{-1},\\\mu_n = \frac{b}{a}&=&\sigma^2_n\left(\frac{N}{\sigma^2}\bar{y} + \frac{1}{\tau^2}\mu_0\right).\end{eqnarray}\)

 Completing the square: multivariate

If we are working with multivariate distributions, it is often customary to represent them as functions of vectors and matrices. The density of the \(p\)-variate multivariate normal distribution, for instance, is defined as

\(p(\boldsymbol y\mid \boldsymbol\mu, \boldsymbol\Sigma) = \left(2\pi\right)^{-\frac{p}{2}}\left|\boldsymbol\Sigma\right|^{-\frac{1}{2}}\exp\left\{-\frac{1}{2}\left(\boldsymbol y - \boldsymbol\mu\right)'\boldsymbol\Sigma^{-1}\left(\boldsymbol y - \boldsymbol\mu\right)\right\}\) 

where data \(\boldsymbol y\) and mean \(\boldsymbol \mu\) are \(p\times 1\) vectors, and \(\boldsymbol \Sigma\) is a \(p\times p\) covariance matrix. Consider the case that we have \(N\) observations from a multivariate normal with known covariance, and we wish to find the posterior distribution for the mean vector \(\boldsymbol\mu\). This case is analogous to the univariate case above, and, analogously, we will use a multivariate Normal prior on \(\boldsymbol\mu\). Our model is:

\(\begin{eqnarray}\boldsymbol y_i&\stackrel{iid}{\sim}&\mbox{Multivariate Normal}_p(\boldsymbol\mu,\boldsymbol\Sigma)\\\boldsymbol\mu&\sim&\mbox{Multivariate Normal}_p(\boldsymbol\mu_0,\boldsymbol V)\end{eqnarray}\)

In the previous section, we detailed a lot of the preliminary algebra in deriving the posterior distribution. Since the algebra is analogous in the multivariate case, there is no reason to reproduce it here. After the preliminary algebra and dropping terms that are irrelevant, the terms inside exponent are

\(-\frac{1}{2}\left(\boldsymbol\mu'\left(N\boldsymbol\Sigma^{-1} + \boldsymbol V^{-1}\right)\boldsymbol\mu - \boldsymbol\mu'\left(N\boldsymbol\Sigma^{-1}\bar{\boldsymbol y} + \boldsymbol V^{-1}\boldsymbol\mu_0\right) - \left(N\boldsymbol\Sigma^{-1}\bar{\boldsymbol y} + \boldsymbol V^{-1}\boldsymbol\mu_0\right)'\boldsymbol\mu\right)\)

where \(\bar{\boldsymbol y} = \frac{1}{N}\sum_{i=1}^N\boldsymbol y\). At this point, we would like to complete the square in a fashion analogous to the way we completed the square in the univariate case. To facilitate this, we introduce analogous notation. Let \(\boldsymbol A = N\boldsymbol\Sigma^{-1} + \boldsymbol V^{-1}\), and let \(\boldsymbol b = N\boldsymbol\Sigma^{-1}\bar{\boldsymbol y} + \boldsymbol V^{-1}\boldsymbol\mu_0\). With this notation, the previous expression is equal to

\(-\frac{1}{2}\left(\boldsymbol\mu'\boldsymbol A\boldsymbol\mu - \boldsymbol\mu'\boldsymbol b - \boldsymbol b'\boldsymbol\mu\right).\)

In the univariate case, our next two steps were to bring \(a\) — analogous to \(\boldsymbol A\) here — outside the parentheses, and then to introduce new terms to complete the square. However, we cannot do that here. \(\boldsymbol A\) is between the two vector terms \(\boldsymbol\mu'\) and \(\boldsymbol\mu\); because matrix multiplication does not commute, we cannot multiply the term by anything to move \(\boldsymbol A\). We can, though, introduce a term that is not a function of \(\boldsymbol\mu\) to complete the square:

\(-\frac{1}{2}\left(\boldsymbol\mu'\boldsymbol A\boldsymbol\mu - \boldsymbol\mu'\boldsymbol b - \boldsymbol b'\boldsymbol\mu + \boldsymbol b'\boldsymbol A^{-1}\boldsymbol b\right).\)

In order to deal with the difficultly with \(\boldsymbol A\), we note three things: first, that \(\boldsymbol A\) is symmetric and invertible, because it is the weighted sum of two symmetric, full-rank covariance matrices; and second, that the identity matrix \(\boldsymbol I = \boldsymbol A^{-1}\boldsymbol A = \boldsymbol A\boldsymbol A^{-1}\). Since multiplying by \(\boldsymbol I\) does not change a matrix expression as long as the matrix is the proper dimension, we can rewrite our expression above:

\(-\frac{1}{2}\left(\boldsymbol\mu'\boldsymbol A\boldsymbol\mu - \boldsymbol\mu'\boldsymbol A\boldsymbol A^{-1}\boldsymbol b - \boldsymbol b'\boldsymbol A^{-1}\boldsymbol A\boldsymbol\mu + \boldsymbol b'\boldsymbol A^{-1}\boldsymbol A\boldsymbol A^{-1}\boldsymbol b\right).\)

 We  can now introduce new notation, analogous to the notation at the end of the univariate competing the square section above. Let \(\boldsymbol\Sigma_n = \boldsymbol A^{-1}\) and \(\boldsymbol\mu_n = \boldsymbol A^{-1}\boldsymbol b\). Then the expression above becomes

\(-\frac{1}{2}\left(\boldsymbol\mu'\boldsymbol \Sigma_n^{-1}\boldsymbol\mu - \boldsymbol\mu'\boldsymbol \Sigma_n^{-1}\boldsymbol \mu_n - \boldsymbol \mu_n'\boldsymbol \Sigma_n^{-1}\boldsymbol\mu + \boldsymbol \mu_n'\boldsymbol \Sigma_n^{-1}\boldsymbol \mu_n\right).\)

 It is now obvious that we can factor the above expression into

\(-\frac{1}{2}\left(\boldsymbol\mu - \boldsymbol\mu_n\right)'\boldsymbol\Sigma_n^{-1}\left(\boldsymbol\mu - \boldsymbol\mu_n\right).\)

If you don't see why, try expanding the expression above. Replacing the terms into the exponential function, we see that the posterior distribution is proportional to

\(\exp\left\{-\frac{1}{2}\left(\boldsymbol\mu - \boldsymbol\mu_n\right)'\boldsymbol\Sigma_n^{-1}\left(\boldsymbol\mu - \boldsymbol\mu_n\right)\right\}.\)

Because this is proportional to the probability density function for a multivariate Normal distribution with mean vector \(\boldsymbol\mu_n\) and covariance matrix \(\boldsymbol\Sigma_n\), we know the exact form of the posterior distribution for the multivariate Normal mean vector \(\boldsymbol\mu\) when the prior is also multivariate Normal:

\(\boldsymbol\mu\mid \boldsymbol Y \sim \mbox{Multivariate Normal}_p\left(\boldsymbol\mu_n,\boldsymbol\Sigma_n\right),\)


\(\begin{eqnarray}\boldsymbol\Sigma_n = \boldsymbol A^{-1}&=&\left(N\boldsymbol\Sigma^{-1} + \boldsymbol V^{-1}\right)^{-1},\\\boldsymbol\mu_n = \boldsymbol A^{-1}\boldsymbol b&=&\boldsymbol\Sigma_n\left(N\boldsymbol\Sigma^{-1}\bar{\boldsymbol y} + \boldsymbol V^{-1}\boldsymbol\mu_0\right).\end{eqnarray}\)

This is analogous to the posterior distribution in the univariate case, as can be easily seen by comparing the parameters of the respective posteriors.