Calculus by computer

As we saw in the previous section, almost any time that we want to maximise a likelihood, we need to differentiate it. Very often, merely evaluating a likelihood involves integration (to obtain the marginal distribution of the observable data). Unsurprisingly, there are times when it is useful to perform these tasks by computer. Broadly speaking, numerical integration is computationally costly but stable, while numerical differentiation is cheap but potentially unstable. We will also consider the increasingly important topic of ‘automatic differentiation’.

4.1 Cancellation error

Consider the following R code

a <- 1e16
b <- 1e16 + pi
d <- b - a
so d should be , right? Actually
  d; pi
## [1] 4
## [1] 3.141593
i.e. d is out by 27% as the following shows…
  (d-pi)/pi  ## relative error in difference is huge...
## [1] 0.2732395
This is an example of cancellation error. Almost any time we take the difference of two floating point numbers of similar magnitude (and sign) we find that the accuracy of the result is substantially reduced. The problem is rounding error. You can see what happens by considering computation to 16 places of decimals. Then

hence . (d was 4, above, because the computer uses a binary rather than a decimal representation).

As we will see, derivative approximation is an area where we can not possibly ignore cancellation error, but actually it is a pervasive issue. For example, cancellation error is the reason why is not a good basis for a ‘one-pass’ method for calculating . is a simple one pass alternative, and generally much better. It is also a substantial issue in numerical linear algebra. For example, orthogonal matrix decompositions (used for QR, eigen and singular value decompositions), usually involve an apparently arbitrary choice at each successive rotation involved in the decomposition, about whether some element of the result of rotation should be positive or negative: one of the choices almost always involves substantially lower cancellation error than the other, and is the one taken.

Since approximation of derivatives inevitably involves subtracting similar numbers, it is self evident that cancellation error will cause some difficulty. The basic reason why numerical integration is a more stable process (at least for statisticians who usually integrate only positive functions) is because numerical integration is basically about adding things up, rather than differencing them, and this is more stable. e.g.

f <- a + b
(f-2e16-pi)/(2e16+pi) ## relative error in sum, is tiny...
## [1] 4.292037e-17

4.2 Approximate derivatives by finite differencing

Consider differentiating a sufficiently smooth function with respect to the elements of its vector argument . might be something simple like or something complicated like the mean global temperature predicted by an atmospheric GCM, given an atmospheric composition, forcing conditions etc.

A natural way to approximate the derivatives is to use:

where is a small constant and is a vector of the same dimension as , with zeroes for each element except the , which is 1. In a slightly sloppy notation, Taylor’s theorem tells us that

Re-arranging while noting that we have

Now suppose that is an upper bound on the magnitude of , we have

That is to say, we have an upper bound on the finite difference truncation 1 error, assuming that the FD approximation is calculated exactly, without round-off error.

In reality we have to work in finite precision arithmetic and can only calculate to some accuracy. Suppose that we can calculate to one part in (the best we could hope for here is that is the machine precision). Now let be an upper bound on the magnitude of , and denote the computed value of by . We have

and

combining the preceding two inequalities we get

an upper bound on the cancellation error that results from differencing two very similar quantities using finite precision arithmetic. (It should now be clear what the problem was in section 1.1 example 2.)

Now clearly the dependence of the two types of error on is rather different. We want as small as possible to minimise truncation error, and as large as possible to minimise cancellation error. Given that the total error is bounded as follows,

it makes sense to chose to minimise the bound. That is we should choose

So if the typical size of and its second derivatives are similar then

will not be too far from optimal. This is why the square root of the machine precision is often used as the finite difference interval. However, note the assumptions: we require that and to be calculable to a relative accuracy that is a small multiple of the machine precision for such a choice to work well. In reality problems are not always well scaled and a complicated function may have a relative accuracy substantially lower than is suggested by the machine precision. In such cases, see section 8.6 of Gill, Murray and Wright (1981), or section 4.3.3 below.

4.2.1 Other FD formulae

The FD approach considered above is forward differencing. Centred differences are more accurate, but more costly…

…in the well scaled case is about right.

Higher order derivatives can also be useful. For example

which in the well scaled case will be most accurate for .

4.3 Automatic Differentiation: code that differentiates itself

Differentiation of complex models or other expressions is tedious, but ultimately routine. One simply applies the chain rule and some known derivatives, repeatedly. The main problem with doing this is the human error rate and the limits on human patience. Given that differentiation is just the repeated application of the chain rule and known derivatives, couldn’t it be automated? The answer is yes. There are two approaches. The first is symbolic differentiation using a computer algebra package such as Maple or Mathematica. This works, but can produce enormously unwieldy expressions when applied to a complex function. The problem is that the number of terms in a symbolic derivative can be very much larger than for the expression being evaluated. It is also very difficult to apply to a complex simulation model, for example.

The second approach is automatic differentiation. This operates by differentiating a function based directly on the computer code that evaluates the function. There are several different approaches to this, but many use the features of object oriented programming languages to achieve the desired end. The key feature of an object oriented language, from the AD perspective, is that every data structure, or object in such a language has a class and the meaning of operators such as +, -, * etc depends on the class of the objects to which they are applied. Similarly the action of a function depends on the class of its arguments.

Suppose then, that we would like to differentiate

w.r.t. its real arguments , and . 2 In R the code

(x1*x2*sin(x3) + exp(x1*x2))/x3
would evaluate the function, if x1, x2 and x3 were initialised to be floating point numbers.

Now define a new type of object of class "ad" which has a value (a floating point number) and a "grad" attribute. In the current case this "grad" attribute will be a 3-vector containing the derivatives of the value w.r.t. x1, x2 and x3. We can now define versions of the arithmetic operators and mathematical functions which will return class "ad" results with the correct value and "grad" attribute, whenever they are used in an expression.

Here is an R function to create and initialise a simple class "ad" object

ad <- function(x,diff = c(1,1)) {
## create class "ad" object. diff[1] is length of grad
## diff[2] is element of grad to set to 1.
  grad <- rep(0,diff[1])
  if (diff[2]>0 && diff[2]<=diff[1]) grad[diff[2]] <- 1
  attr(x,"grad") <- grad
  class(x) <- "ad"
  x
}
Here it is in use, initialising x1 to the value 1, giving it a 3 dimensional "grad" attribute, and setting the first element of grad to 1 since .
  x1 <- ad(1,c(3,1))
  x1
## [1] 1
## attr(,"grad")
## [1] 1 0 0
## attr(,"class")
## [1] "ad"
Now the interesting part. I can define versions of mathematical functions and operators that are specific to class "ad" objects and correctly propagate derivatives alongside values. Here is a sin function for class "ad".
sin.ad <- function(a) {
  grad.a <- attr(a,"grad")
  a <- as.numeric(a) ## avoid infinite recursion - change class
  d <- sin(a)
  attr(d,"grad") <- cos(a) * grad.a ## chain rule
  class(d) <- "ad"
  d
}

and here is what happens when it is applied to x1

sin(x1)
## [1] 0.841471
## attr(,"grad")
## [1] 0.5403023 0.0000000 0.0000000
## attr(,"class")
## [1] "ad"
i.e. the value of the result is , while the first element of its "grad" contains the derivative of w.r.t. evaluated at .

Operators can also be overloaded in this way. e.g. here is multiplication operator for class "ad":

"*.ad" <- function(a,b) { ## ad multiplication
  grad.a <- attr(a,"grad")
  grad.b <- attr(b,"grad")
  a <- as.numeric(a)
  b <- as.numeric(b)
  d <- a*b    ## evaluation
  attr(d,"grad") <- a * grad.b + b * grad.a ## product rule
  class(d) <- "ad"
  d
}

Continuing in the same way we can provide a complete library of mathematical functions and operators for the "ad" class. Given such a library, we can obtain the derivatives of a function directly from the code that would simply evaluate it, given ordinary floating point arguments. For example, here is some code evaluating a function.

  x1 <- 1; x2 <- 2; x3 <- pi/2
  (x1*x2*sin(x3)+ exp(x1*x2))/x3
## [1] 5.977259
and here is the same code with the arguments replaced by "ad" objects
  x1 <- ad(1, c(3,1))
  x2 <- ad(2, c(3,2))
  x3 <- ad(pi/2, c(3,3))
  (x1*x2*sin(x3) + exp(x1*x2))/x3
## [1] 5.977259
## attr(,"grad")
## [1] 10.681278  5.340639 -3.805241
## attr(,"class")
## [1] "ad"
you can check that these results are correct (actually to machine accuracy).

This simple propagation of derivatives alongside the evaluation of a function is known as forward mode auto-differentiation. R is not the best language in which to try to do this, and if you need AD it is much better to use existing software libraries in C++, for example, which have done all the function and operator re-writing for you.

4.3.1 Reverse mode AD

If you require many derivatives of a scalar valued function then forward mode AD will have a theoretical computational cost similar to finite differencing, since at least as many operations are required for each derivative as are required for function evaluation. In reality the overheads associated with operator overloading make AD more expensive (alternative strategies also carry overheads). Of course the benefit of AD is higher accuracy, and in many applications the cost is not critical.

An alternative with the potential for big computational savings over finite differencing is reverse mode AD. To understand this it helps to think of the evaluation of a function in terms of a computational graph. Again concentrate on the example function given in Nocedal and Wright (2006):

Any computer evaluating this must break the computation down into a sequence of elementary operations on one or two floating point numbers. This can be thought of as a graph:

where the nodes to are the intermediate quantities that will have to be produced en route from the input values to to the final answer, . The arrows run from parent nodes to child nodes. No child can be evaluated until all its parents have been evaluated.

Simple left to right evaluation of this graph results in this …

Now, forward mode AD carries derivatives forward through the graph, alongside values. e.g. the derivative of a node with respect to input variable, is computed using

(the r.h.s. being evaluated by overloaded functions and operators, in the object oriented approach). The following illustrates this process, just for the derivative w.r.t. .

…Again computation runs left to right, with evaluation of a node only possible once all parent values are known.

Now, if we require derivatives w.r.t. several input variables, then each node will have to evaluate derivatives w.r.t. each of these variables, and this becomes expensive (in terms of the previous graph, each node would contain multiple evaluated derivatives). Reverse mode therefore does something ingenious. It first executes a forward sweep through the graph, evaluating the function and all the derivatives of nodes w.r.t. their parents, as follows:

The reverse sweep then works backwards through the graph, evaluating the derivative of w.r.t. each node, using the following

and the fact that at the terminal node .

The derivatives in grey are those calculated on the reverse sweep. The point here is that there is only one derivative to be evaluated at each node, but in the end we know the derivative of w.r.t. every input variable. Reverse mode can therefore save a large number of operations relative to finite differencing or forward mode. Once again, general purpose AD libraries automate the process for you, so that all you need to be able to write is the evaluation code.

Unfortunately reverse mode efficiency comes at a heavy price. In forward mode we could discard the values and derivatives associated with a node as soon as all its children were evaluated. In reverse mode the values of all nodes and the evaluated derivatives associated with every connection have to be stored during the forward sweep, in order to be used in the reverse sweep. This is a heavy storage requirement. For example, if involved the inversion of a matrix then we would have to store some intermediate node values plus a similar number of evaluated derivatives. That is some 32 Gigabytes of storage before we even consider the requirements for storing the structure of the graph.

4.3.2 A caveat

For AD to work it is not sufficient that the function being evaluated has properly defined derivatives at the evaluated function value. We require that every function/operator used in the evaluation has properly defined derivatives at its evaluated argument(s). This can create a problem with code that executes conditional on the value of some variable. For example the Box-Cox transformation of a positive datum is

If you code this up in the obvious way then AD will never get the derivative of w.r.t. right if .

4.3.3 Using AD to improve FD

When fitting complicated or computer intensive models it can be that AD is too expensive to use for routine derivative calculation during optimisation. However, with the gradual emergence of highly optimised libraries, this situation is changing, and in any case it can still provide a useful means for calibrating FD intervals. A ‘typical’ model run can be auto-differentiated, and the finite difference intervals adjusted to achieve the closest match to the AD derivatives. As optimisation progresses one or two further calibrations of the FD intervals can be carried out as necessary.

4.4 Numerical Integration

Integration is at least as important as optimisation in statistics. Whether integrating random effects out of a joint distribution to get a likelihood, or just evaluating a simpler expectation, it occurs everywhere. Unfortunately integration is generally numerically expensive, and much statistical research is devoted to clever ways of avoiding it. When it can’t be avoided then the game is typically to find ways of approximating an integral of a function with the minimum number of function evaluations.

  • Integration w.r.t. 1 or 2 (maybe even up to 4) variables is not too problematic. There are two ‘deterministic’ lines of attack.

    1. (a)

      Use differential equation solving methods since, for example

      is the solution of

      integrated to .

    2. (b)

      Quadrature rules. These evaluate the function to be integrated at some set of points in the integration domain and combine the resulting values to obtain an approximation to the integral. For example

      is the Midpoint rule for approximate integration. Simpson’s Rule is a higher order variant, while Gaussian Quadrature rules aim for substantially higher order accuracy by careful choice of the locations at which to evaluate.

  • Integration with respect to several-many variables is usually more interesting in statistics and it is difficult. Maintenance of acceptable accuracy with quadrature rules or differential equation methods typically requires an unfeasible number of function evaluations for high dimensional domains of integration. There are two main approaches.

    1. (a)

      Approximate the function to be integrated by one that can be integrated analytically.

    2. (b)

      Take a statistical view of the integration process, and construct statistical estimators of the integral. Note that, for a fixed number of function evaluations ,

      1. i.

        any bias (such as the error in a quadrature rule) scales badly with dimension.

      2. ii.

        any estimator variance need not depend on dimension at all.

      3. iii.

        as usual, we can completely eliminate either bias or variance, but not both.

      These considerations suggest using unbiased estimators of integrals, based on random samples from the function being integrated. The main challenge is then to design statistical sampling schemes and estimators which minimise the estimator variance.

4.4.1 Quadrature rules

In one dimension we might choose to approximate the integral of a continuous function by the area under a set of step functions, with the midpoint of each matching :

where in the final error term is . Let’s see what the error bound means in practice by approximating .

  I.true <- exp(1) - 1
  N <- 10
  I.mp <- sum(exp((1:N-.5)/N))/N
  c(I.mp,I.true,(I.true-I.mp)/I.true)
## [1] 1.7175660865 1.7182818285 0.0004165452
A error may be good enough for many purposes, but suppose we need higher accuracy…
  N <- 100
  I.mp <- sum(exp((1:N-.5)/N))/N
  c(I.mp,I.true,(I.true-I.mp)/I.true)
## [1] 1.718275e+00 1.718282e+00 4.166655e-06
  N <- 1000
  I.mp <- sum(exp((1:N-.5)/N))/N
  c(I.mp,I.true,(I.true-I.mp)/I.true)
## [1] 1.718282e+00 1.718282e+00 4.166667e-08
Exactly as the error bound indicated, each 10 fold increase in the number of function evaluations buys us a 100 fold reduction in the approximation error. Incidentally, notice that in statistical terms this error is not a random error — it is really a bias.

If is cheap to evaluate, accuracy is not paramount and we are only interested in 1 dimensional integrals then we might stop there (or more likely with the slightly more accurate but equally straightforward ‘Simpson’s Rule’), but life is not so simple. If one is very careful about the choice of points at which to evaluate and about the weights used in the integral approximating summation, then some very impressive results can be obtained, at least for smooth enough functions, using Gauss(ian) Quadrature. The basic idea is to find a set such that if is a fixed weight function and is any polynomial up to order then

Note that the set depends on , and , but not on . The hope is that the r.h.s. will still approximate the l.h.s. very accurately if is not a polynomial of the specified type, and there is some impressive error analysis to back this up, for any function with well behaved derivatives (see e.g. Monahan, 2001, section 10.3).

The R package statmod has a nice gauss.quad function for producing the weights, and nodes, , for some standard s. It assumes , , but we can just linearly re-scale our actual interval to use this. Here is the previous example using Gauss Quadrature (assuming is a constant).

library(statmod)
N <- 10
gq <- gauss.quad(N) # get the x_i, w_i
gq # take a look
## $nodes
##  [1] -0.9739065 -0.8650634 -0.6794096 -0.4333954 -0.1488743  0.1488743
##  [7]  0.4333954  0.6794096  0.8650634  0.9739065
## 
## $weights
##  [1] 0.06667134 0.14945135 0.21908636 0.26926672 0.29552422 0.29552422
##  [7] 0.26926672 0.21908636 0.14945135 0.06667134
sum(gq$weights) # for an interval of width 2
## [1] 2
I.gq <- sum(gq$weights*exp((gq$nodes+1)/2))/2
c(I.gq, I.true, (I.true-I.gq)/I.true)
## [1]  1.718282e+00  1.718282e+00 -1.292248e-15
For convenience, we can pack the rescaling logic into a simple function:
gqi <- function(g, N = 10, a = -1, b = 1) {
    gq = gauss.quad(N)
    w = gq$weights*(b-a)/2
    n = a + (b-a)*(gq$nodes + 1)/2
    sum(w * g(n))
    }

gqi(exp, 10, 0, 1)
## [1] 1.718282
Clearly 10 evaluation points was a bit wasteful here, so consider
I.gq <- gqi(exp, 5, 0, 1)
c(I.gq, I.true, (I.true-I.gq)/I.true)
## [1] 1.718282e+00 1.718282e+00 3.805670e-13
even 5 seems slightly wasteful:
I.gq <- gqi(exp, 3, 0, 1)
c(I.gq, I.true, (I.true-I.gq)/I.true)
## [1] 1.718281e+00 1.718282e+00 4.795992e-07
… still more accurate than the midpoint rule with 100 evaluations. This example should be treated with caution, however. is a very smooth function and such impressive results will not hold for less smooth cases.

4.4.2 Quadrature rules for multidimensional integrals

We can integrate w.r.t. several variables by recursive application of quadrature rules. For example consider integrating over a square region, and assume we use basically the same node and weight sequence for both dimensions. We have

and so

Clearly, in principle we can repeat this for as many dimensions as we like. In practice however, to maintain a given level of accuracy, the number of function evaluations will have to rise as where is the number of nodes for each dimension, and is the number of dimensions. Even if we can get away with a 3 point Gaussian quadrature, we will need 1.5 million evaluations by the time we reach a 13 dimensional integral.

Here is a function which takes the node sequence x and weight sequence w of some quadrature rule and uses them to produce the d dimensional set of nodes, and the corresponding weight vector resulting from recursive application of the 1D rule.

mesh <- function(x,d,w=1/length(x)+x*0) {
  n <- length(x)
  W <- X <- matrix(0,n^d,d)
  for (i in 1:d) {
   X[,i] <- x;W[,i] <- w
   x <- rep(x,rep(n,length(x)))
   w <- rep(w,rep(n,length(w)))
  }
 w <- exp(rowSums(log(W))) ## column product of W gives weights
 list(X=X,w=w) ## each row of X gives co-ordinates of a node
}
Here’s how it works.
mesh(c(1,2),3)
## $X
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    2    1    1
## [3,]    1    2    1
## [4,]    2    2    1
## [5,]    1    1    2
## [6,]    2    1    2
## [7,]    1    2    2
## [8,]    2    2    2
## 
## $w
## [1] 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
Let’s look at it in action on , integrated over .
fd <- function(x) {exp(rowSums(x))}
First try the midpoint rule over a two dimensional domain
  N <- 10; d <- 2; N^d # number of function evaluations
## [1] 100
  I.true <- (exp(1)-exp(-1))^d
  mmp <- mesh((1:N-.5)/N*2-1,d,rep(2/N,N))
  I.mp <- sum(mmp$w*fd(mmp$X))
  c(I.mp,I.true,(I.true-I.mp)/I.true)
## [1] 5.506013515 5.524391382 0.003326677
Now over a 5D domain
  N <- 10; d <- 5; N^d # number of function evaluations
## [1] 1e+05
  I.true <- (exp(1)-exp(-1))^d
  mmp <- mesh((1:N-.5)/N*2-1,d,rep(2/N,N))
  I.mp <- sum(mmp$w*fd(mmp$X))
  c(I.mp,I.true,(I.true-I.mp)/I.true)
## [1] 71.136612879 71.731695754  0.008295954
…which illustrates the enormous rate of increase in function evaluations required to achieve a rather disappointing accuracy.

Of course Gauss Quadrature is a better bet

  N <- 4; d <- 2; N^d
## [1] 16
  I.true <- (exp(1)-exp(-1))^d
  gq <- gauss.quad(N)
  mgq <- mesh(gq$nodes,d,gq$weights)
  I.gq <- sum(mgq$w*fd(mgq$X))
  c(I.gq,I.true,(I.true-I.gq)/I.true)
## [1] 5.524390e+00 5.524391e+00 2.511325e-07
with the increase in dimension being slightly less painful …
  N <- 4; d <- 5; N^d
## [1] 1024
  I.true <- (exp(1)-exp(-1))^d
  gq <- gauss.quad(N)
  mgq <- mesh(gq$nodes,d,gq$weights)
  I.gq <- sum(mgq$w*fd(mgq$X))
  c(I.gq,I.true,(I.true-I.gq)/I.true)
## [1] 7.173165e+01 7.173170e+01 6.278311e-07
…even so, by the time we reach 20 dimensions we would be requiring function evaluations, and most functions we might want to integrate will not be so smooth and well behaved that 4 quadrature points give us anything like the sort of accuracy seen in this example.

4.4.3 Approximating the integrand

Beyond a few dimensions, quadrature rules really run out of steam. In statistical work, where we are often interested in integrating probability density functions, or related quantities, it is sometimes possible to approximate the integrand by a function with a known integral. Here, just one approach will be illustrated: Laplace approximation. The typical application of the method is when we have a joint density and want to evaluate the marginal p.d.f. of ,

For a given let be the value of maximising . Then Taylor’s theorem implies that

where is the Hessian of w.r.t. evaluated at . i.e.

and so

From the properties of normal p.d.f.s we have

so

Notice that the basic ingredients of this approximation are the Hessian of the log integrand w.r.t and the values. Calculating the latter is an optimisation problem, which can usually be solved by a Newton type method, given that we can presumably obtain gradients as well as the Hessian. The highest cost here is the Hessian calculation, which scales with the square of the dimension, at worst, or for cheap integrands the determinant evaluation, which scales as the cube of the dimension. i.e. when it works, this approach is much cheaper than brute force quadrature.

4.4.4 Monte-Carlo integration

Consider integrating over some region of volume . Clearly

where uniform over . Generating uniform random vectors, , over , we have the obvious estimator

which defines basic Monte-Carlo Integration. 3 Clearly is unbiased, and if the are independent then

i.e. the variance scales as , and the standard deviation scales as (it’s ). This is not brilliantly fast convergence to , but compare it to quadrature. If we take the midpoint rule then the (bias) error is where is the node spacing. In dimensions then at best we can arrange , implying that the bias is . i.e. once we get above we would prefer the stochastic convergence rate of Monte-Carlo Integration to the deterministic convergence rate of the midpoint rule.

Of course using Simpson’s rule would buy us a few more dimensions, and for smooth functions the spectacular convergence rates of Gauss quadrature may keep it as the best option up to a rather higher , but the fact remains that its bias is still dependent on , while the sampling error of the Monte Carlo method is dimension independent, so Monte Carlo will win in the end.

4.4.5 Stratified Monte-Carlo Integration

One tweak on basic Monte-Carlo is to divide (a region containing) into equal volumed sub-regions, , and to estimate the integral within each from a single drawn from a uniform distribution over . The estimator has the same form as before, but its variance is now

which is usually lower than previously, because the range of variation of within each is lower than within the whole of .

The underlying idea of trying to obtain the nice properties of uniform random variables, but with more even spacing over the domain of integration, is pushed to the limit by Quasi-Monte-Carlo integration, which uses deterministic space filling sequences of numbers as the basis for integration. For fixed , the bias of such schemes grows much less quickly with dimension than the quadrature rules.

4.4.6 Importance sampling

In statistics an obvious variant on Monte-Carlo integration occurs when the integrand factors into the product of a p.d.f. and another term in which case

where the are observations on r.v.s with p.d.f. . A difficulty with this approach, as with simple MC integration, is that a high proportion of the s often fall in places where makes no, or negligible, contribution to the integral. For example, in Bayesian modelling, it is quite possible for the to be have non-vanishing support over only a tiny volume of the parameter space over which has non-vanishing support.

One approach to this problem is importance sampling. Rather than generate the from we generate from a distribution with the same support as , but with a p.d.f., , which is designed to generate points concentrated in the places where the integrand is non-negligible. Then the estimator is

The expected value of the r.h.s. of the above is its l.h.s., so this estimator is unbiased.

What would make a good ? Suppose that . Then , where . So , and the estimator has zero variance. Clearly this is impractical. If we knew then we would not need to estimate, but there are useful insights here anyway. Firstly, the more constant is , the lower the variance of . Secondly, the idea that the density of evaluation points should reflect the contribution of the integrand in the local neighbourhood is quite generally applicable.

4.4.7 Laplace importance sampling

We are left with the question of how to choose ? When the integrand is some sort of probability density function, a reasonable approach is to approximate it by the normal density implied by the Laplace approximation. That is we take to be the p.d.f. of the multivariate normal where denotes the value maximising , while is the Hessian of w.r.t. , evaluated at .

In fact, as we have seen previously, multivariate normal random vectors are generated by transformation of standard normal random vectors . i.e. where is a Cholesky factor of so that . There is then some computational saving in using a change of variable argument to re-express the importance sampling estimator in terms of the density of , rather than itself. Start with

where is the standard normal density for , giving

and leading in turn to the Laplace importance sampling estimator

Some tweaks may be required.

  1. 1.

    It might be desirable to make a little more diffuse than the Laplace approximation suggests, to avoid excessive importance weights being assigned to values in the tails of the target distribution.

  2. 2.

    If the integration is in order to obtain a likelihood, which is to be maximised, then it is usually best to condition on a set of values, in order to avoid random variability in the approximate likelihood, which will make optimisation difficult.

  3. 3.

    Usually the log of the integrand is evaluated rather than the integrand itself, and it is the log of the integral that is required. To avoid numerical under/overflow problems note that

    the r.h.s. will not overflow or underflow when computed in finite precision arithmetic (for reasonable ).

  4. 4.

    As with simple Monte-Carlo integration, some variance reduction is achievable by stratification. The can be obtained by suitable transformation of random vectors from a uniform distribution on . Instead can be partitioned into equal volume parts, with one uniform vector generated in each…

Tweak 3. is an example of a general technique for avoiding numerical underflow and overflow which has many applications in statistical computing. It is therefore worth examining more carefully.

4.4.8 The log–sum–exp trick

In statistical computing, we rarely want to work with raw probabilities and likelihoods, but instead work with their logarithms. Log-likelihoods are much less likely to numerically underflow or overflow than raw likelihoods. For likelihood computations involving iid observations, this is often convenient, since a product of raw likelihood terms becomes a sum of log-likelihood terms. However, sometimes we have to evaluate the sum of raw probabilities or likelihoods. The obvious way to do this would be to exponentiate the logarithms, evaluate the sum, and then take the log of the result to get back to our log scale. However, the whole point to working on the log scale is to avoid numerical underflow and overflow, so we know that exponentiating raw likelihoods is a dangerous operation which is likely to fail in many problems. Can we do the log–sum–exp operation safely? Yes, we can.

Suppose that we have log (unnormalised) weights (typically probabilities or likelihoods), , but we want to compute the (log of the) sum of the raw weights. We can obviously exponentiate the log weights and sum them, but this carries the risk that all of the log weights will underflow (or, possibly, overflow). So, we want to compute

without the risk of underflow. We can do this by first computing and then noting that

We are now safe, since cannot be bigger than zero, and hence no term can overflow. Also, since we know that for some , we know that at least one term will not underflow to zero. So we are guaranteed to get a sensible finite answer in a numerically stable way. Note that exactly the same technique works for computing a sample mean (or other weighted sum) as opposed to just a simple sum (as above).

4.4.9 An example

Consider a Generalised Linear Mixed Model:

where is diagonal and depends on a small number of parameters. Here, are the fixed effects and are the random effects, with covariate matrices and , respectively. From the model specification it is straightforward to write down the joint probability function of and , based on the factorisation . i.e.

but likelihood based inference actually requires that we can evaluate the marginal p.f. for any values of the model parameters. Now

and evaluation of this requires numerical integration.

To tie things down further, suppose that the fixed effects part of the model is simply where is a continuous covariate, while the random effects part is given by two factor variables, and , each with two levels. So a typical row of is a vector of 4 zeros and ones, in some order. Suppose further that the first two elements of relate to and the remainder to and that . The following simulates 50 observations from such a model.

set.seed(0);n <- 50
x <- runif(n)
z1 <- sample(c(0,1),n,replace=TRUE)
z2 <- sample(c(0,1),n,replace=TRUE)
b <- rnorm(4) ## simulated random effects
X <- model.matrix(~x)
Z <- cbind(z1,1-z1,z2,1-z2)
eta <- X%*%c(0,3) + Z%*%b
mu <- exp(eta)
y <- rpois(n, mu)
Here is a function evaluating (log is used here to avoid undesirable underflow to zero)
lf.yb0 <- function(y, b, theta, X, Z, lambda) {
  beta <- theta[1:ncol(X)]
  theta <- theta[-(1:ncol(X))]
  eta <- X%*%beta + Z%*%b
  mu <- exp(eta)
  lam <- lambda(theta, ncol(Z))
  sum(y*eta - mu - lfactorial(y)) - sum(lam*b^2)/2 +
      sum(log(lam))/2 - ncol(Z)*log(2*pi)/2
}

theta contains the fixed effect parameters, i.e. and the parameters of . So in this example, theta is of length 4, with the first 2 elements corresponding to and the last two are . lambda is the name of a function for turning the parameters of into its leading diagonal elements, for example

var.func <- function(theta, nb) c(theta[1], theta[1],
                                    theta[2], theta[2])
Actually, for integration purposes, lf.yb0 can usefully be re-written to accept a matrix b where each column is a different at which to evaluate the log p.d.f. Suppose lf.yb is such a function. Lets try evaluating at the simulated data, and the theta vector used for simulation, using the various integration methods covered above.

Quadrature rules

First try brute force application of the midpoint rule. We immediately have the slight problem of the integration domain being infinite, but since the integrand will vanish at some distance from , we can probably safely restrict attention to the domain .

  theta <- c(0,3,1,1)
  nm <- 10;m.r <- 10
  bm <- mesh((1:nm-(nm+1)/2)*m.r/nm,4,rep(m.r/nm,nm))
  lf <- lf.yb(y,t(bm$X),theta,X,Z,var.func)
  log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
## [1] -116.5303
Of course, we don’t actually know how accurate this is, but we can get some idea by doubling nm, in which case

  log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
## [1] -118.6234
Clearly we are not doing all that well, even with 10000-160000 function evaluations.

On very smooth functions Gauss quadrature rules were a much better bet, so let’s try a naive application of these again.

  nm <- 10
  gq <- gauss.quad(nm)
  w <- gq$weights*m.r/2 ## rescale to domain
  x <- gq$nodes*m.r/2
  bm <- mesh(x,4,w)
  lf <- lf.yb(y,t(bm$X),theta,X,Z,var.func)
  log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
## [1] -136.95
— this is a long way from the midpoint estimate: what happens if we double nm to 20?

  log(sum(bm$w*exp(lf-max(lf)))) + max(lf)
## [1] -125.3369
Increasing further still indicates problems. The difficulty is that the integrand is not very smooth, in the way that was, and it is not well approximated by a tensor product of high order polynomials. The midpoint rule actually works better here. In fact this integral could be performed reasonably accurately by skilful use of Gauss quadrature, but the example illustrates that naive application can be much worse than doing something simpler.

Laplace approximation

To use the Laplace approximation we need to supplement lf.yb with functions glf.yb and hlf.yb which evaluate the gradient vector and Hessian of the w.r.t. . Given these functions we first find by Newton’s method

b <- rep(0,4)
while(TRUE) {
  b0 <- b
  g <- glf.yb(y, b, theta, X, Z, var.func)
  H <- hlf.yb(y, b, theta, X, Z, var.func)
  b <- b - solve(H, g)
  b <- as.vector(b)
  if (sum(abs(b-b0)) < 1e-5*sum(abs(b)) + 1e-4) break
}
after which b contains and H the negative of the required . It remains to evaluate the approximation
lf.yb0(y,b,theta,X,Z,var.func) + 2*log(2*pi) -
    0.5*determinant(H,logarithm=TRUE)$modulus
## [1] -116.9646
## attr(,"logarithm")
## [1] TRUE
This is reasonably close to the truth, but the method itself provides us with no way of knowing how close.

Monte Carlo

Simple brute force Monte Carlo Integration follows

  N <- 100000
  bm <- matrix((runif(N*4)-.5)*m.r,4,N)
  lf <- lf.yb(y, bm, theta, X, Z, var.func)
  log(sum(exp(lf-max(lf)))) + max(lf) - log(N) + log(m.r^4)
## [1] -119.8479
## [1] -119.3469
The next replicate gave -119.3, and that degree of variability is with function evaluations. Stratification improves things a bit, but let’s move on.

Intelligent Monte Carlo

Using the original factorisation of the joint likelihood, we can re-write our required marginal likelihood

as the expectation of the complete data likelihood wrt the random effects distribution. This gives us a natural Monte Carlo approach to estimation.

lf.ygb <- function(y, b, theta, X, Z, lambda) {
  beta = theta[1:ncol(X)]
  eta = Z%*%b + as.vector(X%*%beta)
  colSums(y*eta - exp(eta) - lfactorial(y))
}
N = 100000
lam = var.func(theta[3:4])
bm = matrix(rnorm(4*N, 0, sqrt(lam)), nrow=4)
llv = lf.ygb(y, bm, theta, X, Z, var.func)
max(llv) + log(mean(exp(llv - max(llv))))
## [1] -119.2791
## [1] -119.0786
The second replicate is consistent with the first, suggesting relatively low variance. This approach is simple and often works well, but it does rely on a degree of consistency between the complete data likelihood and the (prior) random effects distribution. Laplace importance sampling does not need this.

Laplace Importance Sampling

Using Laplace importance sampling allows the construction of an efficient unbiased estimate of the marginal likelihood. It is not strictly necessary here, since the previous Monte Carlo approach is arguably simpler. But it is very useful for more challenging problems.

N <- 10000
  R <- chol(-H)
  z <- matrix(rnorm(N*4),4,N)
  bm <- b + backsolve(R,z)
  lf <- lf.yb(y,bm,theta,X,Z,var.func)
  zz.2 <- colSums(z*z)/2
  lfz <- lf + zz.2
  log(sum(exp(lfz - max(lfz)))) + max(lfz) + 2 * log(2*pi) -
      sum(log(diag(R))) - log(N)
## [1] -119.3229
## [1] -119.2882
## [1] -119.2774
Stratification can reduce this variability still further, and be aware that while we know that the estimator is unbiased, it is not on the log scale, of course.

Conclusions

The example emphasises that integration is computationally expensive, and that multi-dimensional integration is best approached by approximating the integrand with something analytically integrable or by constructing unbiased estimators of the integral. The insight that variance is likely to be minimised by concentrating the ‘design points’ of the estimator on where the integrand has substantial magnitude is strongly supported by this example, and leads rather naturally, beyond the scope of this course, to MCMC methods. Note that in likelihood maximisation settings you might also consider avoiding intractable integration via the EM algorithm (see e.g. Lange, 2010 or Monahan, 2001).

4.5 Further reading on computer integration and differentiation

  • Chapter 8 of Nocedal, J. & S.J. Wright (2006) Numerical Optimization (2nd ed.), Springer, offers a good introduction to finite differencing and AD, and the AD examples above are adapted directly from this source.

  • Gill, P.E. , W. Murray and M.H. Wright (1981) Practical Optimization Academic Press, Section 8.6 offers the treatment of finite differencing that you should read before using it on anything but the most benign problems. This is one of the few references that tells you what to do if your objective is not ‘well scaled’.

  • Griewank, A. (2000) Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation SIAM, is the reference for Automatic Differentiation. If you want to know exactly how AD code works then this is the place to start. It is very clearly written and should be consulted before using AD on anything seriously computer intensive.

  • http://www.autodiff.org is a great resource for information on AD and AD tools.

  • If you want to see AD in serious action, take a look at e.g. Kalnay, E (2003) Atmospheric Modelling, Data Assimilation and Predictability.

  • R is not the best language for computational statistics, and automatic differentiation is not particularly well supported. There are better automatic differentiation options in Julia and Python. The JAX library for Python is particularly interesting - https://jax.readthedocs.io/

  • Although most widely-used AD libraries are based on object-oriented programming approaches, arguably the most elegant approaches exploit functional programming languages with strong static type systems. See, for example: Elliott, C. (2018) The simple essence of automatic differentiation. Proc. ACM Program. Lang. 2, ICFP, Article 70. DOI: https://doi.org/10.1145/3236765, and the Dex language - https://github.com/google-research/dex-lang

  • Chapter 10 of Monahan, J.F (2001) Numerical Methods of Statistics, offers more information on numerical integration, including much of the detail skipped over here, although it is not always easy to follow, if you don’t already have quite a good idea what is going on.

  • Lange, K (2010) Numerical Analysis for Statisticians, second edition, Springer. Chapter 18 is good on one dimensional quadrature.

  • Ripley, B.D. (1987, 2006) Stochastic Simulation Chapters 5 and section 7.4 are good discussions of stochastic integration and Quasi Monte Carlo integration.

  • Robert, C.P. and G. Casella (1999) Monte Carlo Statistical Methods Chapter 3 (and beyond) goes into much more depth than these notes on stochastic integration.

  • Press et al. (2007) Numerical Recipes (3rd ed), Cambridge, has a nice chapter on numerical integration, but with emphasis on 1D.



  1. So called because it is the error associated with truncating the Taylor series approximation to the function.
  2. This example is taken from Nocedal and Wright, (2006).
  3. Note that in practice, if is a complicated shape, it may be preferable to generate uniformly over a simpler enclosing region , and just set the integrand to zero for points outside itself.