13  Kriging

13.1 Introduction

Kriging is concerned with using measurements of a latent field at a number of sites in order to make predictions for the value of the field at other (nearby) sites. It is named after D. G. Krige, who devised it for mining applications, though much of the theory was developed later by G. Matheron. Some of the terminology around kriging, and spatial statistics more generally, reflects some of the early mining applications (such as “nugget”).

13.2 Simple Kriging

We begin by considering the simplest form of kriging, where both the mean and covariance function of the latent field are known. This is a very similar setup to the problem of Gaussian process regression considered in Section 11.6, but here we do not make any distributional assumptions, and instead look to consider a linear predictor for the field at an unknown location of interest, and choose a form for that predictor that is optimal in some appropriate sense. We will use similar notation to Section 11.6, and consider \(z_i=z(\mathbf{s}_i)\) to be the measured value of \(Z_i=Z(\mathbf{s}_i),\ i=1,\ldots,n\), and \(Y=Z(\mathbf{s}^\star)\) to be an unobserved value at a particular site of interest, \(\mathbf{s}^\star\). We will use the notation \(\mathbb{E}[Y]=\mu_U\), \(\mathbb{E}[\mathbf{Z}]=\pmb{\mu}_O\), \(\mathbb{V}\mathrm{ar}[Y]=\sigma^2_U\), \(\mathbb{C}\mathrm{ov}[\mathbf{Z}, Y]=\pmb{\sigma}_{OU}\), and \(\mathbb{V}\mathrm{ar}[\mathbf{Z}]=\mathsf{\Sigma}_O+\sigma^2\mathbb{I}\) (that is, we allow a nugget variance in the observation process). We look to predict \(Y\) using a linear predictor of the form \[ \hat{Y} = \alpha + \pmb{\beta}^\top\mathbf{Z} \] for \(\alpha\) and \(\pmb{\beta}\) minimising the expected quadratic loss function \[ \mathcal{L}(\alpha, \pmb{\beta}) = \mathbb{E}\left[\left(\hat{Y}-Y\right)^2\right]. \] Such an estimator is sometimes known as the best linear unbiased predictor (BLUP). Expanding the loss function we get \[\begin{align*} \mathcal{L}(\alpha, \pmb{\beta}) &= \mathbb{E}\left[\left(\alpha + \pmb{\beta}^\top\mathbf{Z} - Y\right)^2\right] \\ &= \mathbb{E}\left[\alpha + \pmb{\beta}^\top\mathbf{Z} - Y\right]^2 + \mathbb{V}\mathrm{ar}\left[\alpha + \pmb{\beta}^\top\mathbf{Z} - Y\right] \\ &= \alpha^2 + (\pmb{\beta}^\top\pmb{\mu}_O)^2 + \mu_U^2 + 2\alpha\pmb{\beta}^\top\pmb{\mu}_O - 2\alpha\mu_U - 2\mu_U\pmb{\beta}^\top\pmb{\mu}_O + \\ & \qquad\qquad \pmb{\beta}^\top(\mathsf{\Sigma}_O + \sigma^2\mathbb{I})\pmb{\beta} + \sigma^2_U - 2\pmb{\beta}^\top\pmb{\sigma}_{OU} \end{align*}\] after some algebra. Differentiating wrt \(\alpha\) and equating to zero gives \[ \alpha = \mu_U - \pmb{\beta}^\top\pmb{\mu}_O. \] Differentiating wrt \(\pmb{\beta}\), subbing in for \(\alpha\), and equating to zero gives \[ \pmb{\beta} = (\mathsf{\Sigma}_O + \sigma^2\mathbb{I})^{-1}\pmb{\sigma}_{OU}. \] Consequently, our BLUP takes the form \[ \hat{Y} = \mu_U + \pmb{\sigma}_{OU}^\top(\mathsf{\Sigma}_O + \sigma^2\mathbb{I})^{-1}(\mathbf{Z}-\pmb{\mu}_O). \] We can evaluate the predictive uncertainty associated with this BLUP as \[\begin{align*} \mathbb{V}\mathrm{ar}[\hat{Y}-Y] &= \mathbb{V}\mathrm{ar}[\mu_U + \pmb{\sigma}_{OU}^\top(\mathsf{\Sigma}_O + \sigma^2\mathbb{I})^{-1}(\mathbf{Z}-\pmb{\mu}_O) - Y] \\ &= \sigma_U^2 - \pmb{\sigma}_{OU}^\top(\mathsf{\Sigma}_O + \sigma^2\mathbb{I})^{-1}\pmb{\sigma}_{OU}. \end{align*}\] Consequently, we see that simple kriging is exactly equivalent to Gaussian process regression (in the case of known mean and covariance function), but the interpretation is different.

The equivalence between multivariate Gaussian analysis and optimal linear prediction is quite general, and not specific to the example of simple kriging. Many of the results that we have derived in this course using multivariate normal theory can also be derived from the perspective of optimal linear prediction. In particular, the Kalman filter that we derived in Chapter 7 can also be derived from an optimal linear prediction viewpoint, and that is in fact how Rudolph Kalman originally presented it. Both approaches can be viewed from both frequentist and Bayesian perspectives. The Gaussian analysis tends to have more of a Bayesian flavour, since it corresponds to conditioning of a fully specified model, but can also be used in frequentist settings. Similarly, the BLUP tends to be given a frequentist (repeated sampling) interpretation, but can also be given a Bayesian interpretation, in which case it is typically called the Bayes linear estimate. The theory of kriging is almost always developed from an optimal linear prediction perspective.

13.3 Ordinary Kriging

Simple kriging relies on the mean function being known. Unfortunately, in most realistic scenarios this is not the case. Further, estimating the mean from data is very problematic, especially in the typical case where there is only one realisation of the process. In this case, the covariance function will also typically be unknown, and difficult to estimate without first estimating the mean. It is therefore desirable to develop a kriging approach for the case of an intrinsically stationary process with unknown mean, relying only on knowledge of the (semi-)variogram. The variogram is much easier to estimate from data than the covariance function, and we will look at methods for this in Section 13.5.

In order to derive our ordinary kriging estimator, we will need the useful identity, \[ \left(z_0 - \sum_{i=1}^n w_iz_i\right)^2 = \sum_{i=1}^n w_i(z_0-z_i)^2 - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n w_iw_j(z_i-z_j)^2, \tag{13.1}\] which is true for all \(z_0,z_1,\ldots,z_n\in\mathbb{R}\) and weights \(w_1,w_2,\ldots,w_n\in\mathbb{R}\) such that \(\sum_{i=1}^n w_i=1\). It can be straightforwardly verified by expanding and simplifying both sides and checking that they are equal.

We assume that our field is intrinsically stationary with semi-variogram \(\gamma(\cdot)\) and unknown mean, \(\mu\). Given data \(\mathbf{Z}=(Z_1,\ldots,Z_n)^\top\), we seek an estimate for \(Y=Z(\mathbf{s}^\star)\) of the form \[ \hat{Y} = \mathbf{w}^\top\mathbf{Z}, \] since in the case of unknown mean, unbiasedness for any mean will force the intercept to be zero. Then unbiasedness requires \[ \mu = \mathbb{E}[\hat{Y}] = \mathbb{E}[\mathbf{w}^\top\mathbf{Z}] = \mathbf{w}^\top\mu\mathbf{1}, \] that is \[ \mathbf{w}^\top\mathbf{1}= \sum_{i=1}^n w_i = 1. \] So unbiasedness requires that our estimator is a weighted average of the data. Let us now consider the expected quadratic loss function \[\begin{align*} \mathcal{L}(\mathbf{w}) &= \mathbb{E}[(Y-\hat{Y})^2] \\ &= \mathbb{E}\left[\left(Y-\sum_{i=1}^n w_i Z_i\right)^2\right] \\ &= \mathbb{E}\left[ \sum_{i=1}^n w_i(Y-Z_i)^2 - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n w_iw_j(Z_i-Z_j)^2 \right] \\ &= 2\sum_{i=1}^n w_i\gamma(\mathbf{s}^\star-\mathbf{s}_i) - \sum_{i=1}^n\sum_{j=1}^nw_iw_j\gamma(\mathbf{s}_i-\mathbf{s}_j) \\ &= 2\mathbf{w}^\top\pmb{\gamma} - \mathbf{w}^\top\mathsf{\Gamma}\mathbf{w}, \end{align*}\] where \(\pmb{\gamma}\) has \(i\)th element \(\gamma(\mathbf{s}^\star-\mathbf{s}_i)\), \(\mathsf{\Gamma}\) has \((i,j)\)th element \(\gamma(\mathbf{s}_i-\mathbf{s}_j)\), and we used Equation 13.1 to go from line 2 to line 3. If we directly minimise this wrt \(\mathbf{w}\) in order to obtain our “best” linear predictor, we will violate the unbiasedness condition we assumed in its derivation. We must therefore carefully minimise it subject to this constraint. So we use a Lagrange multiplier with the Lagrangian \[ \mathcal{L}(\mathbf{w}, \lambda) = 2\mathbf{w}^\top\pmb{\gamma} - \mathbf{w}^\top\mathsf{\Gamma}\mathbf{w} -2\lambda(\mathbf{w}^\top\mathbf{1}- 1). \] The gradient wrt \(\mathbf{w}\) gives \[ \mathsf{\Gamma}\mathbf{w} + \lambda\mathbf{1}= \pmb{\gamma} \] and the derivative wrt \(\lambda\) is obviously our unbiasedness constraint, giving a total of \(n+1\) equations in \(n+1\) unknowns. We can solve these for \(\mathbf{w}\) explicitly, but in practice we often write them in the form \[ \begin{pmatrix}\mathsf{\Gamma}&\mathbf{1}\\\mathbf{1}^\top&0\end{pmatrix} \binom{\mathbf{w}}{\lambda} = \binom{\pmb{\gamma}}{1}, \] or \[ \mathsf{\Gamma}_0 \mathbf{w}_0 = \pmb{\gamma}_0, \] and solve as \(\mathbf{w}_0 = \mathsf{\Gamma}_0^{-1}\pmb{\gamma}_0\), with our optimal \(\mathbf{w}\) the first \(n\) elements of \(\mathbf{w}_0\). Alternatively, we can explicitly solve for \(\mathbf{w}\) to get \[ \mathbf{w} = \mathsf{\Gamma}^{-1}\left(\pmb{\gamma} + \frac{1-\mathbf{1}^\top\mathsf{\Gamma}^{-1}\pmb{\gamma}}{\mathbf{1}^\top\mathsf{\Gamma}^{-1}\mathbf{1}}\mathbf{1}\right) \] and hence BLUP \[ \hat{Y} = \left(\pmb{\gamma} + \frac{1-\mathbf{1}^\top\mathsf{\Gamma}^{-1}\pmb{\gamma}}{\mathbf{1}^\top\mathsf{\Gamma}^{-1}\mathbf{1}}\mathbf{1}\right)^\top\mathsf{\Gamma}^{-1}\mathbf{Z}. \] The predictive uncertainty is \[\begin{align*} \mathbb{V}\mathrm{ar}[Y-\hat{Y}] &= \mathbb{E}[(Y-\hat{Y})^2] \\ &= 2\mathbf{w}^\top\pmb{\gamma} - \mathbf{w}^\top\mathsf{\Gamma}\mathbf{w}. \end{align*}\] We can evaluate this by substituting in our computed \(\mathbf{w}\). Depending on how we have computed \(\mathbf{w}\), it may be convenient to re-write this as \[\begin{align*} 2\mathbf{w}^\top\pmb{\gamma} - \mathbf{w}^\top\mathsf{\Gamma}\mathbf{w} &= 2\mathbf{w}^\top\pmb{\gamma} - \mathbf{w}^\top(\pmb{\gamma}-\lambda\mathbf{1}) \\ &= \mathbf{w}^\top\pmb{\gamma} + \lambda \\ &= \mathbf{w}_0^\top\pmb{\gamma}_0. \end{align*}\] Substituting in for \(\mathbf{w}_0\) gives \[ \pmb{\gamma}_0^\top\mathsf{\Gamma}_0^{-1}\pmb{\gamma}_0. \]

13.4 Co-Kriging

Although we are mainly concerned with univariate random fields, it is instructive to consider how our approach to kriging may change when we have a multivariate observation at a collection of sites, or equivalently, observations on a collection of (coupled) univariate random fields. We assume that we have \(k\) intrinsically stationary random fields with means \(\mu_1,\mu_2,\ldots,\mu_k\), and as for ordinary kriging, we assume that these means are unknown. We consider the scenario of observing all \(k\) fields at \(n\) spatial locations, \(\mathbf{s}_1,\ldots,\mathbf{s}_n\). We store the observations in an \(n\times k\) matrix, \(\mathsf{Z}\). The rows, \(\mathbf{Z}_i\) represent the multivariate observation at spatial location \(\mathbf{s}_i\), and the columns, \(\mathbf{Z}^{(j)}\), the observations on field \(j\).

We assume, wlog, that we are interested in making a prediction for field 1 at an unobserved location, \(\mathbf{s}^\star\). Call this \(Y=Z_1(\mathbf{s}^\star)\). Clearly we could just use ordinary kriging using only the data on the first field, \(\mathbf{Z}^{(1)}\). However, if the fields are correlated, this is likely to be sub-optimal, due to ignoring the information in the data on the other fields. Utilising all of the data to make the BLUP prediction is the topic of co-kriging. Similarly to ordinary kriging, we seek a BLUP of the form \[ \hat{Y} = \sum_{i=1}^n\sum_{j=1}^k w_{ij}Z_{ij} = \sum_{j=1}^k {\mathbf{w}^{(j)}}^\top\mathbf{Z}^{(j)}, \] where now \(\mathsf{W}\) is an \(n\times k\) matrix of weights with elements \(w_{ij}\). For unbiasedness we want \(\mathbb{E}[\hat{Y}]=\mu_1\), but \[ \mathbb{E}[\hat{Y}] = \sum_{j=1}^k \mathbb{E}[\mathbf{Z}^{(j)}]^\top\mathbf{w}^{(j)}= \sum_{j=1}^k \mu_j\mathbf{1}^\top\mathbf{w}^{(j)}, \] and so the only way that this can be unbiased for any means is to have \[ \mathbf{1}^\top\mathbf{w}^{(1)} =1 \quad \text{and} \quad \mathbf{1}^\top\mathbf{w}^{(j)}=0,\quad j=2,\ldots,k. \] This gives us \(k\) constraints on the weight matrix, \(\mathsf{W}\).

It turns out that co-kriging is much easier to understand in the context of a known covariance function, \(C:\mathcal{D}\times\mathcal{D}\rightarrow M_{k\times k}\), given by \[ C(\mathbf{s},\mathbf{s}') = \mathbb{C}\mathrm{ov}[\mathbf{Z}(\mathbf{s}),\mathbf{Z}(\mathbf{s}')]. \] We therefore present co-kriging in this scenario. We want to minimise the expected quadratic loss function \[ \mathcal{L}(\mathsf{W}) = \mathbb{E}\left[\left(Y-\hat{Y}\right)^2\right] = \mathbb{E}\left[\left(Z_1(\mathbf{s}^\star) - \sum_{i=1}^n\sum_{j=1}^k w_{ij}Z_j(\mathbf{s}_i) \right)^2\right]. \] Again, directly minimising this will violate our unbiasedness constraints, and so we introduce \(k\) Lagrange multipliers and minimise the Lagrangian \[ \mathcal{L}(\mathsf{W}, \pmb{\lambda}) = \mathcal{L}(\mathsf{W}) - 2\lambda_1(\mathbf{1}^\top\mathbf{w}^{(1)} - 1) - 2\sum_{j=2}^k \lambda_j \mathbf{1}^\top\mathbf{w}^{(j)}. \] Clearly, differentiating this wrt \(\lambda_j\), \(j=1,\ldots,k\) gives our \(k\) unbiasedness constraints. OTOH, differentiating wrt \(w_{lm},\ l=1,\ldots,n,\ m=1,\ldots,k\) and equating to zero gives (after a little algebra), \[ \sum_{i=1}^n\sum_{j=1}^n w_{ij} C_{jm}(\mathbf{s}_i,\mathbf{s}_l) - \lambda_l = C_{1m}(\mathbf{s}^\star,\mathbf{s}_l). \] Together these give us \((n+1)k\) linear equations in \((n+1)k\) unknowns. We can solve these in the usual way.

We will also probably be interested in the predictive variance. A straightforward but rather tedious computation gives \[\begin{align*} \mathbb{V}\mathrm{ar}[Y-\hat{Y}] &= \mathbb{E}[(Y-\hat{Y})^2] \\ &= C_{11}(\mathbf{s}^\star,\mathbf{s}^\star) - \sum_{i=1}^n\sum_{j=1}^k w_{ij}C_{1j}(\mathbf{s}^\star,\mathbf{s}_i) + \lambda_1. \end{align*}\]

13.5 Empirical estimation of (co)variograms

13.5.1 Empirical covariograms and variograms

13.5.2 Fitting variograms to data

13.5.3 Likelihood estimation of covariograms

13.6 Kriging in R