Laplace's approximation

Laplace's approximation or the quadratic approximation (QUAP)^[1] provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.^[2]^[3] The approximation is justified by the Bernstein–von Mises theorem, which states that, under regularity conditions, the error of the approximation tends to 0 as the number of data points tends to infinity.^[4]^[5]

For example, consider a regression or classification model with data set $\{x_{n},y_{n}\}_{n=1,\ldots ,N}$ comprising inputs $x$ and outputs $y$ with (unknown) parameter vector $\theta$ of length $D$ . The likelihood is denoted $p({\bf {y}}|{\bf {x}},\theta )$ and the parameter prior $p(\theta )$ . Suppose one wants to approximate the joint density of outputs and parameters $p({\bf {y}},\theta |{\bf {x}})$ . Bayes' formula reads:

p({\bf {y}},\theta |{\bf {x}})\;=\;p({\bf {y}}|{\bf {x}},\theta )p(\theta |{\bf {x}})\;=\;p({\bf {y}}|{\bf {x}})p(\theta |{\bf {y}},{\bf {x}})\;\simeq \;{\tilde {q}}(\theta )\;=\;Zq(\theta ).

The joint is equal to the product of the likelihood and the prior and by Bayes' rule, equal to the product of the marginal likelihood $p({\bf {y}}|{\bf {x}})$ and posterior $p(\theta |{\bf {y}},{\bf {x}})$ . Seen as a function of $\theta$ the joint is an un-normalised density.

In Laplace's approximation, we approximate the joint by an un-normalised Gaussian ${\tilde {q}}(\theta )=Zq(\theta )$ , where we use $q$ to denote approximate density, ${\tilde {q}}$ for un-normalised density and $Z$ the normalisation constant of ${\tilde {q}}$ (independent of $\theta$ ). Since the marginal likelihood $p({\bf {y}}|{\bf {x}})$ doesn't depend on the parameter $\theta$ and the posterior $p(\theta |{\bf {y}},{\bf {x}})$ normalises over $\theta$ we can immediately identify them with $Z$ and $q(\theta )$ of our approximation, respectively.

Laplace's approximation is

p({\bf {y}},\theta |{\bf {x}})\;\simeq \;p({\bf {y}},{\hat {\theta }}|{\bf {x}})\exp {\big (}-{\tfrac {1}{2}}(\theta -{\hat {\theta }})^{\top }S^{-1}(\theta -{\hat {\theta }}){\big )}\;=\;{\tilde {q}}(\theta ),

where we have defined

{\begin{aligned}{\hat {\theta }}&\;=\;\operatorname {argmax} _{\theta }\log p({\bf {y}},\theta |{\bf {x}}),\\S^{-1}&\;=\;-\left.\nabla _{\theta }\nabla _{\theta }\log p({\bf {y}},\theta |{\bf {x}})\right|_{\theta ={\hat {\theta }}},\end{aligned}}

where ${\hat {\theta }}$ is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and $S^{-1}$ is the $D\times D$ positive definite matrix of second derivatives of the negative log joint target density at the mode $\theta ={\hat {\theta }}$ . Thus, the Gaussian approximation matches the value and the log-curvature of the un-normalised target density at the mode. The value of ${\hat {\theta }}$ is usually found using a gradient based method.

In summary, we have

{\begin{aligned}q(\theta )&\;=\;{\cal {N}}(\theta |\mu ={\hat {\theta }},\Sigma =S),\\\log Z&\;=\;\log p({\bf {y}},{\hat {\theta }}|{\bf {x}})+{\tfrac {1}{2}}\log |S|+{\tfrac {D}{2}}\log(2\pi ),\end{aligned}}

for the approximate posterior over $\theta$ and the approximate log marginal likelihood respectively.

The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density. Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,^[6] and for Gaussian processes by Williams and Barber.^[7]

Integrated nested Laplace approximation

Integrated nested Laplace approximation (INLA) is a method for approximate Bayesian inference based on Laplace's approximation.^[8] It is designed for a class of models called latent Gaussian models (LGMs), for which it can be a fast and accurate alternative for Markov chain Monte Carlo methods to compute posterior marginal distributions.^[9]^[10]^[11] Due to its relative speed even with large data sets for certain problems and models, INLA has been a popular inference method in applied statistics, in particular spatial statistics, ecology, seismology, and epidemiology.^[12]^[13]^[14] It is also possible to combine INLA with a finite element method solution of a stochastic partial differential equation to study e.g. spatial point processes and species distribution models.^[15]^[16] The INLA method is implemented in the R-INLA R package.^[17]

Latent Gaussian models

Let ${\boldsymbol {y}}=(y_{1},\dots ,y_{n})$ denote the response variable (that is, the observations) which belongs to an exponential family, with the mean $\mu _{i}$ (of $y_{i}$ ) being linked to a linear predictor $\eta _{i}$ via an appropriate link function. The linear predictor can take the form of a (Bayesian) additive model. All latent effects (the linear predictor, the intercept, coefficients of possible covariates, and so on) are collectively denoted by the vector ${\boldsymbol {x}}$ . The hyperparameters of the model are denoted by ${\boldsymbol {\theta }}$ . As per Bayesian statistics, ${\boldsymbol {x}}$ and ${\boldsymbol {\theta }}$ are random variables with prior distributions.

The observations are assumed to be conditionally independent given ${\boldsymbol {x}}$ and ${\boldsymbol {\theta }}$ : $\pi ({\boldsymbol {y}}|{\boldsymbol {x}},{\boldsymbol {\theta }})=\prod _{i\in {\mathcal {I}}}\pi (y_{i}|\eta _{i},{\boldsymbol {\theta }}),$ where ${\mathcal {I}}$ is the set of indices for observed elements of ${\boldsymbol {y}}$ (some elements may be unobserved, and for these INLA computes a posterior predictive distribution). Note that the linear predictor ${\boldsymbol {\eta }}$ is part of ${\boldsymbol {x}}$ .

For the model to be a latent Gaussian model, it is assumed that ${\boldsymbol {x}}|{\boldsymbol {\theta }}$ is a Gaussian Markov Random Field (GMRF)^[8] (that is, a multivariate Gaussian with additional conditional independence properties) with probability density $\pi ({\boldsymbol {x}}|{\boldsymbol {\theta }})\propto \left|{\boldsymbol {Q_{\theta }}}\right|^{1/2}\exp \left(-{\frac {1}{2}}{\boldsymbol {x}}^{T}{\boldsymbol {Q_{\theta }}}{\boldsymbol {x}}\right),$ where ${\boldsymbol {Q_{\theta }}}$ is a ${\boldsymbol {\theta }}$ -dependent sparse precision matrix and $\left|{\boldsymbol {Q_{\theta }}}\right|$ is its determinant. The precision matrix is sparse due to the GMRF assumption. The prior distribution $\pi ({\boldsymbol {\theta }})$ for the hyperparameters need not be Gaussian. However, the number of hyperparameters, $m=\mathrm {dim} ({\boldsymbol {\theta }})$ , is assumed to be small (say, less than 15).

Approximate Bayesian inference with INLA

In Bayesian inference, one wants to solve for the posterior distribution of the latent variables ${\boldsymbol {x}}$ and ${\boldsymbol {\theta }}$ . Applying Bayes' theorem $\pi ({\boldsymbol {x}},{\boldsymbol {\theta }}|{\boldsymbol {y}})={\frac {\pi ({\boldsymbol {y}}|{\boldsymbol {x}},{\boldsymbol {\theta }})\pi ({\boldsymbol {x}}|{\boldsymbol {\theta }})\pi ({\boldsymbol {\theta }})}{\pi ({\boldsymbol {y}})}},$ the joint posterior distribution of ${\boldsymbol {x}}$ and ${\boldsymbol {\theta }}$ is given by ${\begin{aligned}\pi ({\boldsymbol {x}},{\boldsymbol {\theta }}|{\boldsymbol {y}})&\propto \pi ({\boldsymbol {\theta }})\pi ({\boldsymbol {x}}|{\boldsymbol {\theta }})\prod _{i}\pi (y_{i}|\eta _{i},{\boldsymbol {\theta }})\\&\propto \pi ({\boldsymbol {\theta }})\left|{\boldsymbol {Q_{\theta }}}\right|^{1/2}\exp \left(-{\frac {1}{2}}{\boldsymbol {x}}^{T}{\boldsymbol {Q_{\theta }}}{\boldsymbol {x}}+\sum _{i}\log \left[\pi (y_{i}|\eta _{i},{\boldsymbol {\theta }})\right]\right).\end{aligned}}$ Obtaining the exact posterior is generally a very difficult problem. In INLA, the main aim is to approximate the posterior marginals ${\begin{array}{rcl}\pi (x_{i}|{\boldsymbol {y}})&=&\int \pi (x_{i}|{\boldsymbol {\theta }},{\boldsymbol {y}})\pi ({\boldsymbol {\theta }}|{\boldsymbol {y}})d{\boldsymbol {\theta }}\\\pi (\theta _{j}|{\boldsymbol {y}})&=&\int \pi ({\boldsymbol {\theta }}|{\boldsymbol {y}})d{\boldsymbol {\theta }}_{-j},\end{array}}$ where ${\boldsymbol {\theta }}_{-j}=\left(\theta _{1},\dots ,\theta _{j-1},\theta _{j+1},\dots ,\theta _{m}\right)$ .

A key idea of INLA is to construct nested approximations given by ${\begin{array}{rcl}{\widetilde {\pi }}(x_{i}|{\boldsymbol {y}})&=&\int {\widetilde {\pi }}(x_{i}|{\boldsymbol {\theta }},{\boldsymbol {y}}){\widetilde {\pi }}({\boldsymbol {\theta }}|{\boldsymbol {y}})d{\boldsymbol {\theta }}\\{\widetilde {\pi }}(\theta _{j}|{\boldsymbol {y}})&=&\int {\widetilde {\pi }}({\boldsymbol {\theta }}|{\boldsymbol {y}})d{\boldsymbol {\theta }}_{-j},\end{array}}$ where ${\widetilde {\pi }}(\cdot |\cdot )$ is an approximated posterior density. The approximation to the marginal density $\pi (x_{i}|{\boldsymbol {y}})$ is obtained in a nested fashion by first approximating $\pi ({\boldsymbol {\theta }}|{\boldsymbol {y}})$ and $\pi (x_{i}|{\boldsymbol {\theta }},{\boldsymbol {y}})$ , and then numerically integrating out ${\boldsymbol {\theta }}$ as ${\begin{aligned}{\widetilde {\pi }}(x_{i}|{\boldsymbol {y}})=\sum _{k}{\widetilde {\pi }}\left(x_{i}|{\boldsymbol {\theta }}_{k},{\boldsymbol {y}}\right)\times {\widetilde {\pi }}({\boldsymbol {\theta }}_{k}|{\boldsymbol {y}})\times \Delta _{k},\end{aligned}}$ where the summation is over the values of ${\boldsymbol {\theta }}$ , with integration weights given by $\Delta _{k}$ . The approximation of $\pi (\theta _{j}|{\boldsymbol {y}})$ is computed by numerically integrating ${\boldsymbol {\theta }}_{-j}$ out from ${\widetilde {\pi }}({\boldsymbol {\theta }}|{\boldsymbol {y}})$ .

To get the approximate distribution ${\widetilde {\pi }}({\boldsymbol {\theta }}|{\boldsymbol {y}})$ , one can use the relation ${\begin{aligned}{\pi }({\boldsymbol {\theta }}|{\boldsymbol {y}})={\frac {\pi \left({\boldsymbol {x}},{\boldsymbol {\theta }},{\boldsymbol {y}}\right)}{\pi \left({\boldsymbol {x}}|{\boldsymbol {\theta }},{\boldsymbol {y}}\right)\pi ({\boldsymbol {y}})}},\end{aligned}}$ as the starting point. Then ${\widetilde {\pi }}({\boldsymbol {\theta }}|{\boldsymbol {y}})$ is obtained at a specific value of the hyperparameters ${\boldsymbol {\theta }}={\boldsymbol {\theta }}_{k}$ with Laplace's approximation^[8] ${\begin{aligned}{\widetilde {\pi }}({\boldsymbol {\theta }}_{k}|{\boldsymbol {y}})&\propto \left.{\frac {\pi \left({\boldsymbol {x}},{\boldsymbol {\theta }}_{k},{\boldsymbol {y}}\right)}{{\widetilde {\pi }}_{G}\left({\boldsymbol {x}}|{\boldsymbol {\theta }}_{k},{\boldsymbol {y}}\right)}}\right\vert _{{\boldsymbol {x}}={\boldsymbol {x}}^{*}({\boldsymbol {\theta }}_{k})},\\&\propto \left.{\frac {\pi ({\boldsymbol {y}}|{\boldsymbol {x}},{\boldsymbol {\theta }}_{k})\pi ({\boldsymbol {x}}|{\boldsymbol {\theta }}_{k})\pi ({\boldsymbol {\theta }}_{k})}{{\widetilde {\pi }}_{G}\left({\boldsymbol {x}}|{\boldsymbol {\theta }}_{k},{\boldsymbol {y}}\right)}}\right\vert _{{\boldsymbol {x}}={\boldsymbol {x}}^{*}({\boldsymbol {\theta }}_{k})},\end{aligned}}$ where ${\widetilde {\pi }}_{G}\left({\boldsymbol {x}}|{\boldsymbol {\theta }}_{k},{\boldsymbol {y}}\right)$ is the Gaussian approximation to ${\pi }\left({\boldsymbol {x}}|{\boldsymbol {\theta }}_{k},{\boldsymbol {y}}\right)$ whose mode at a given ${\boldsymbol {\theta }}_{k}$ is ${\boldsymbol {x}}^{*}({\boldsymbol {\theta }}_{k})$ . The mode can be found numerically for example with the Newton-Raphson method.

The trick in the Laplace approximation above is the fact that the Gaussian approximation is applied on the full conditional of ${\boldsymbol {x}}$ in the denominator since it is usually close to a Gaussian due to the GMRF property of ${\boldsymbol {x}}$ . Applying the approximation here improves the accuracy of the method, since the posterior ${\pi }({\boldsymbol {\theta }}|{\boldsymbol {y}})$ itself need not be close to a Gaussian, and so the Gaussian approximation is not directly applied on ${\pi }({\boldsymbol {\theta }}|{\boldsymbol {y}})$ . The second important property of a GMRF, the sparsity of the precision matrix ${\boldsymbol {Q}}_{{\boldsymbol {\theta }}_{k}}$ , is required for efficient computation of ${\widetilde {\pi }}({\boldsymbol {\theta }}_{k}|{\boldsymbol {y}})$ for each value ${{\boldsymbol {\theta }}_{k}}$ .^[8]

Obtaining the approximate distribution ${\widetilde {\pi }}\left(x_{i}|{\boldsymbol {\theta }}_{k},{\boldsymbol {y}}\right)$ is more involved, and the INLA method provides three options for this: Gaussian approximation, Laplace approximation, or the simplified Laplace approximation.^[8] For the numerical integration to obtain ${\widetilde {\pi }}(x_{i}|{\boldsymbol {y}})$ , also three options are available: grid search, central composite design, or empirical Bayes.^[8]

Laplace's approximation

Integrated nested Laplace approximation

Latent Gaussian models

Approximate Bayesian inference with INLA

References

Further reading

Related Articles