[,1] [,2] [,3]
[1,] 0.50 0.25 0.00
[2,] 0.25 0.50 0.25
[3,] 0.00 0.25 0.50
[,1] [,2] [,3]
[1,] 1.0 -0.5 0.0
[2,] -0.5 1.0 -0.5
[3,] 0.0 -0.5 1.0
We do not always want to model random fields over a continuously varying spatial domain. Sometimes we just want models for a process at a discrete set of sites. Typically the set of sites will be finite, but sometimes it will be useful to consider the process at a countably infinite set of sites. The sites may be irregularly distributed in space (\(\mathbf{s}\in\mathcal{D}\subset\mathbb{R}^n\)), or may be located on some kind of regular grid. If the sites are located on a regular square grid, we will typically assume that each site \(\mathbf{s}\in\mathcal{D}\subset\mathbb{Z}^d\) for \(d=1,2\) or \(3\). We typically prefer to model on a discrete lattice of spatial locations when interpolation between sites doesn’t make sense. This is often (but not always) because the value of the field at each site is areal, representing the total, average or integral of some quantity over some finite region represented by the site. For example, if the sites represent regions, such as counties, it does not make sense to try and interpolate between the values for one county and another adjacent county. Of course, sites will be correlated, with the degree of correlation depending in some way on how far apart the sites are. But sites may be conditionally independent of other sites given knowledge of nearby (adjacent) sites. In this case some kind of neighbourhood structure, in the form of a graph, may be useful for understanding the conditional independence relationships between the sites. For this, it is helpful to know something about the theory of graphical models.
Variables of multivariate distributions and data sets are often correlated in complex ways. It would be convenient if many variables were independent, leading to a sparse independence structure, as this would lead to many simplifications of much multivariate theory and methodology. Unfortunately variables tend not to be independent in practice. It is more often the case that variables are conditionally independent. Although not quite as strong a property as independence, this too turns out to be enough to lead to considerable simplification of a general dependence structure, and is useful for both conceptual and theoretical understanding, and also for computational efficiency. Sparse conditional independence structures can be represented using graphs, and this leads to the theory of graphical models. But before we explore graphical models, we must first ensure that we understand the notion of conditional independence.
Let us begin by reminding ourselves of the essential notion of independence of events and random variables. For events \(A\) and \(B\), we say that \(A\) and \(B\) are independent, and write \(A\perp B\), if \[\mathbb{P}[A\cap B]= \mathbb{P}[A]\mathbb{P}[B].\] Note that for \(\mathbb{P}[B]>0\) we get \[\mathbb{P}[A|B] = \frac{\mathbb{P}[A\cap B]}{\mathbb{P}[B]} = \mathbb{P}[A],\] which captures the intuitive notion of independence, that learning the outcome of \(B\) should not affect the probability of \(A\).
The notion extends straightforwardly to (continuous) random quantities, where now we say that variables \(X\) and \(Y\) are independent and write \(X\perp Y\) if their joint density factorises as \[f_{X,Y}(x,y) = f_X(x)f_Y(y).\] Again, for \(f_Y(y)>0\) we have \[f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} = f_X(x),\] capturing the intuitive notion that the conditional distribution of \(X\) given \(Y=y\) should not depend on \(y\). That is, the conditional and marginal distributions should be the same.
It turns out that independence is a special case of the more general notion of conditional independence. For events \(A\), \(B\) and \(C\), we say that \(A\) and \(B\) are conditionally independent given \(C\), and write \(A{\perp\hspace{-0.8ex}\perp}B|{C}\), if \[\mathbb{P}[A\cap B|C] = \mathbb{P}[A|C]\mathbb{P}[B|C].\] Now, when \(\mathbb{P}[B|C]>0\) we have \[\mathbb{P}[A|B\cap C] = \frac{\mathbb{P}[A\cap B|C]}{\mathbb{P}[B|C]} = \mathbb{P}[A|C],\] and so conditional on \(C\), learning the outcome of \(B\) does not change the probability of \(A\).
This notion again generalises easily to (continuous) random quantities, so that for random variables \(X\), \(Y\) and \(Z\), we have that \(X\) and \(Y\) are conditionally independent given \(Z\), written \(X{\perp\hspace{-0.8ex}\perp}Y|{Z}\) if the conditional density of \(X\) and \(Y\) given \(Z=z\) factorises as \[f_{X,Y|Z}(x,y|z) = f_{X|Z}(x|z)f_{Y|Z}(y|z).\] So then if \(f_{Y|Z}>0\) we have \[f_{X|Y,Z}(x|y,z) = \frac{f_{X,Y|Z}(x,y|z)}{f_{Y|Z}(y|z)} = f_{X|Z}(x|z).\] There are numerous other important properties of this factorisation for the joint density of \(X\), \(Y\) and \(Z\), \(f_{X,Y,Z}(x,y,z)\).
Proposition
If \(X{\perp\hspace{-0.8ex}\perp}Y|{Z}\) then the joint density factorises as \[f_{X,Y,Z}(x,y,z)=f_Z(z)f_{X|Z}(x|z)f_{Y|Z}(y|z).\]
This factorisation property turns out to be key to understanding directed acyclic graph (DAG) models.
Proof. \[f_{X,Y,Z}(x,y,z)= f_Z(z)f_{X,Y|Z}(x,y|z) =f_Z(z)f_{X|Z}(x|z)f_{Y|Z}(y|z).\] \(\Box\)
Proposition
If \(X{\perp\hspace{-0.8ex}\perp}Y|{Z}\) and \(f_Z(z)>0\), then the joint density factorises as \[ f_{X,Y,Z}(x,y,z) = \frac{f_{X,Z}(x,z)f_{Y,Z}(y,z)}{f_Z(z)}. \tag{14.1}\]
This factorisation property turns out to be important for understanding undirected graphical models.
Proof. \[\begin{aligned} f_{X,Y,Z} &= f_Z(z)f_{X|Z}(x|z)f_{Y|Z}(y|z) \\ &= f_Z(z)\frac{f_{X,Z}(x,z)}{f_Z(z)}\frac{f_{Y,Z}(y,z)}{f_Z(z)} \\ &= \frac{f_{X,Z}(x,z)f_{Y,Z}(y,z)}{f_Z(z)}. \end{aligned}\] \(\Box\)
Proposition
Assuming that \(f_Z(z)>0\), we have that \(X{\perp\hspace{-0.8ex}\perp}Y|{Z}\) if and only if the joint density factorises in the form \[f_{X,Y,Z}(x,y,z) = h(x,z)k(y,z),\] for some bivariate functions \(h(\cdot,\cdot)\) and \(k(\cdot,\cdot)\).
Proof. First note that the forward implication follows from Equation 14.1. That is, if \(X{\perp\hspace{-0.8ex}\perp}Y|{Z}\), then there exist functions \(h(\cdot,\cdot)\) and \(k(\cdot,\cdot)\) such that \[f_{X,Y,Z}(x,y,z) = h(x,z)k(y,z).\] For example, we could choose \(h(x,z)=f_{X,Z}(x,z)\) and \(k(y,z)=f_{Y,Z}(y,z)/f_Z(z)\), but there are obviously many other possibilities.
The reverse direction is less clear. We assume that we have the factorisation \[f_{X,Y,Z}(x,y,z) = h(x,z)k(y,z),\] and want to prove that \[f_{X,Y|Z}(x,y|z) = f_{X|Z}(x|z)f_{Y|Z}(y|z).\] This can be demonstrated with straightforward algebraic manipulation, but is a little bit fiddly.
Note that although we may have been thinking about univariate random quantities \(X\), \(Y\) and \(Z\), nothing we have discussed is specific to the univariate case, and all applies similarly to the case of multivariate random quantities \(\mathbf{X}\), \(\mathbf{Y}\) and \(\mathbf{Z}\).
The above factorisation results are very powerful, and can be used to prove a number of general algebraic properties associated with independence and conditional independence of random quantities.
Proposition
Proof. The proofs of these results are mainly straightforward, and left as an exercise.\(\Box\)
It turns out that when we have a large number of variables with many associated conditional independence statements, it is useful to represent known statements using a graph. The graphs used can be directed or undirected, and can be given several different interpretations. We begin with a reminder of some basic properties of undirected graphs, and then later examine how such graphs may be associated with conditional independence properties.
A graph \(\mathcal{G}\) is a tuple \(\mathcal{G}=(V,E)\), where \(V=\{v_1,v_2,\ldots,v_n\}\) is a finite set of vertices (or nodes), and \(E\) is a finite set of undirected edges, with each edge \(e\in E\) being of the form \(e=\{v_i,v_j\}\), \(v_i,v_j\in V\), \(i\not= j\). There are clearly \(\binom{n}{2}=n(n-1)/2\) possible edges of this form, and we write \(\binom{V}{2}\) for the set of all such edges, so that \(E\subseteq \binom{V}{2}\). A graph is said to be complete if \(E=\binom{V}{2}\).
Consider the graph \(\mathcal{G}=(V,E)\) where \(V=\{A,B,C\}\) and \(E=\{\{A,C\},\{B,C\}\}\). Note that this graph is not complete, since the edge \(\{A,B\}\) is missing.
Vertices \(v_i,v_j\in V\) are said to be adjacent, written \(v_i\sim v_j\), if \(\{v_i,v_j\}\in E\). Any subset of vertices \(U\subseteq V\) induces a subgraph \(\mathcal{G}_U=(U,F)\), where \[F = \{\{u,v\}\in E|u,v\in U\}.\]
For the graph considered in the previous example, nodes \(A\) and \(C\) are adjacent (\(A\sim C\)), but nodes \(A\) and \(B\) are not. Similarly, the subgraph \(\mathcal{G}_{\{A,C\}}\) is complete, but the subgraph \(\mathcal{G}_{\{A,B\}}\) is not.
A subset \(U\subseteq V\) is a (maximal) clique if it is maximally complete. That is, the associated subgraph \(\mathcal{G}_U\) is complete, and for all strict supersets of \(U\), \(W\) (so that \(U\subset W\subseteq V\)), the induced subgraph \(\mathcal{G}_W\) is not complete.
Again, considering again the previous example, the graph \(\mathcal{G}\) has two cliques, \(\{A,C\}\) and \(\{B,C\}\).
The previous example is somewhat uninteresting in that the cliques correspond with edges. So consider now the graph \(\mathcal{G}=(V,E)\) where \(V=\{D,E,F,G\}\) and \[E=\{\{D,E\},\{D,F\},\{E,F\},\{E,G\}\}.\]
Drawing the graph reveals that there are two cliques: \(\{D,E,F\}\) and \(\{E,G\}\).
Note that the set of cliques defines the graph. That is, complete knowledge of all cliques of the graph allows construction of the graph.
The neighbours of a node \(v\in V\), often known as the boundary of \(v\), written \(\operatorname{bd}(v)\), is \(\{u\in V|u\sim v\}\). The closure of \(v\), written \(\operatorname{cl}(v)\), is given by \(\operatorname{cl}(v)=v\cup\operatorname{bd}(v)\). A path is a sequence \(x_0,x_1,\ldots,x_m\in V\) such that \(x_{i-1}\sim x_i,\ i=1,2,\ldots,m\). A graph is connected if there exists a path between every pair of vertices. A component is a maximal connected subgraph. A cycle is a (non-trivial) path starting and ending at the same vertex. A connected graph is a tree if it does not contain any cycles. A leaf is any node of a tree connected to just one edge. A forest is a graph with components which are all trees.
For any undirected graph \(\mathcal{G}=(V,E)\) with vertex subsets \(A,B,C\subseteq V\), we say that \(C\) separates \(A\) and \(B\) if all paths from a node in \(A\) to a node in \(B\) pass through a node in \(C\).
Now, given a collection of conditional independence statements relating to a collection of random variables, we can define a graph by associating nodes with variables, and using the edge structure to encode the conditional independence structure. This is useful for both visualisation and computational analysis. It turns out that there are several different ways that we can do this, but they all turn out to be equivalent in practice in most cases.
Given a set of random variables \(X_1,X_2,\ldots,X_n\), we can define an associated graph \(\mathcal{G}=(V,E)\), where \(V=\{X_1,X_2,\ldots,X_n\}\), and \(E\) is a set of edges.
Definition
We say that \(\mathcal{G}\) has the factorisation property (F) if the joint density of the random variables factorises in the form \[f_X(\mathbf{x}) = \prod_{c\in\mathcal{C}} \phi_c(\mathbf{x}_c),\] for some functions \(\phi_c(\cdot)\), where \(\mathcal{C}\) denotes the set of cliques associated with the graph \(\mathcal{G}\).
Definition
We say that \(\mathcal{G}\) has the global Markov property (G) if for any disjoint vertex subsets \(A\), \(B\) and \(C\) such that \(C\) separates \(A\) and \(B\) in \(G\) we have \(A{\perp\hspace{-0.8ex}\perp}B|{C}\).
Definition
We say that \(\mathcal{G}\) has the local Markov property (L) if for all \(v\in V\) we have \[v{\perp\hspace{-0.8ex}\perp}V\backslash \operatorname{cl}(v)|{\operatorname{bd}(v)}.\]
Definition
We say that \(\mathcal{G}\) has the pairwise Markov property (P) if for all \(u,v\in V\) such that \(u\not\sim v\) we have \[u{\perp\hspace{-0.8ex}\perp}v|{V\backslash \{u,v\}}.\]
We have just presented four different ways that one can associate an undirected graph with a conditional independence structure of a set of random variables. If these different interpretations all turn out to be fundamentally different, this is potentially confusing. Fortunately, it turns out that these different interpretations are all closely linked.
Proposition
(F) \(\Rightarrow\) (G) \(\Rightarrow\) (L) \(\Rightarrow\) (P)
This important result tells us that if a graph satisfies the factorisation property (F), then the other properties all follow.
Proof. Let’s start with the simplest result, (G) \(\Rightarrow\) (L): This is trivial, since by definition, \(\operatorname{bd}(v)\) separates \(v\) from \(V\backslash\operatorname{cl}(v)\).
Now let us consider (L) \(\Rightarrow\) (P):
Suppose (L), so that we know \[v{\perp\hspace{-0.8ex}\perp}V\backslash\operatorname{cl}(v)|{\operatorname{bd}(v)}.\] Now using Equation 14.4 we can “copy” vertices from \(V\backslash\operatorname{cl}(v)\) to the conditioning set as desired in order to deduce that \[v{\perp\hspace{-0.8ex}\perp}V\backslash\operatorname{cl}(v)|{V\backslash\{u,v\}}.\] Finally, since \(u\) and \(v\) are not adjacent, we know that \(u\in V\backslash\operatorname{cl}(v)\), and so using Equation 14.3 we conclude that \[v{\perp\hspace{-0.8ex}\perp}u|{V\backslash\{u,v\}},\] and (P) is satisfied.
Finally, let us consider (F) \(\Rightarrow\) (G): Assume (F), and suppose that \(A\), \(B\) and \(C\) are disjoint subsets of \(V\) with \(C\) separating \(A\) and \(B\). The graph \(\mathcal{G}_{V\backslash C}\) can’t be connected, so let \(\tilde{A}\) be the components of \(\mathcal{G}_{V\backslash C}\) containing vertices in \(A\), so that \(A\subseteq \tilde{A}\) and \(B\cap \tilde{A}=\emptyset\) due to the separation assumption. Now put \(\tilde{B}=V\backslash(\tilde{A}\cup C)\) so that \(B\subseteq \tilde{B}\), and \(V\) is the disjoint union of \(\tilde{A}\), \(\tilde{B}\) and \(C\). Again, due to the separation assumption, every \(c\in\mathcal{C}\) must be such that \(c\subseteq\tilde{A}\cup C\) or \(c\subseteq\tilde{B}\cup C\). Call the set of all cliques in \(\tilde{A}\cup C\), \(\mathcal{C}_A\), and the rest \(\mathcal{C}_B\). Using (F), we have \[f(\mathbf{x}) = \prod_{c\in\mathcal{C}}\phi_c(\mathbf{x}_c) = \prod_{c\in\mathcal{C}_A}\phi_c(\mathbf{x}_c)\prod_{c\in\mathcal{C}_B}\phi_c(\mathbf{x}_c) = h(\mathbf{x}_{\tilde{A}\cup C})k(\mathbf{x}_{\tilde{B}\cup C}),\] and so \(\tilde{A}{\perp\hspace{-0.8ex}\perp}\tilde{B}|{C}\). Now using Equation 14.3 we conclude first that \(A{\perp\hspace{-0.8ex}\perp}\tilde{B}|{C}\), since \(A\subseteq\tilde{A}\) and then that \(A{\perp\hspace{-0.8ex}\perp}B|{C}\), since \(B\subseteq\tilde{B}\). That is, (G) is satisfied.
Proposition (Hammersley-Clifford)
If \(f_X(\mathbf{x})>0,\ \forall \mathbf{x}\in\mathbb{R}^n\), then (P) \(\Rightarrow\) (F).
This important (and difficult) result is stated without proof. However, the implication is that in most practical cases (positive density), all four interpretations of a graph of conditional independence structures are equivalent, and may be considered interchangeably.
Corollory
If all densities are positive, then (F) \(\Leftrightarrow\) (G) \(\Leftrightarrow\) (L) \(\Leftrightarrow\) (P).
We now turn attention to the simplest parameterised class of graphical models — Gaussian graphical models (GGMs). These are simply multivariate normal distributions where the variables are subject to conditional independence properties. It turns out to be very straightforward to characterise GGMs in terms of the pairwise Markov property (P), but to understand why, we must first understand a little more about the MVN, and its parametrisation in terms of the precision matrix. If \(\mathbf{X}\) is a random vector with variance matrix \(\mathsf{\Sigma}\), then \(\mathsf{Q}=\mathbb{V}\mathrm{ar}[\mathbf{X}]^{-1}\) is the precision matrix for . So, if \(\mathbf{X}\sim \mathcal{N}(\pmb{\mu},\mathsf{\Sigma})\), we can put \(\mathsf{Q}=\mathsf{\Sigma}^{-1}\), and then \(\mathbf{X}\sim \mathcal{N}(\pmb{\mu},\mathsf{Q}^{-1})\). Note that we can express the density of the MVN in terms of \(\pmb{\mu}\) and \(\mathsf{Q}\) as \[f(\mathbf{x}) = (2\pi)^{-p/2}|\mathsf{Q}|^{1/2}\exp\left\{-\frac{1}{2}(\mathbf{x}-\pmb{\mu})^\top\mathsf{Q}(\mathbf{x}-\pmb{\mu}) \right\}.\] Writing the density in this way allows us to partition \(\mathbf{x}\), \(\pmb{\mu}\), and \(\mathsf{Q}\), and derive the conditional distribution of the first part of \(\mathbf{x}\) given the rest.
Proposition
For \(\mathbf{X}\sim \mathcal{N}(\pmb{\mu},\mathsf{Q}^{-1})\), partition \[\mathbf{X}=\binom{\mathbf{X}_1}{\mathbf{X}_2},\ \mathbf{x}=\binom{\mathbf{x}_1}{\mathbf{x}_2},\ \pmb{\mu}=\binom{\pmb{\mu}_1}{\pmb{\mu}_2},\ \mathsf{Q} = \begin{pmatrix}\mathsf{Q}_{11}&\mathsf{Q}_{12}\\\mathsf{Q}_{21}&\mathsf{Q}_{22}\end{pmatrix}.\] Then \[(\mathbf{X}_1|\mathbf{X}_2=\mathbf{x}_2) \sim \mathcal{N}(\pmb{\mu}_{1|2},\mathsf{Q}_{11}^{-1}),\] where \[\pmb{\mu}_{1|2} = \pmb{\mu}_1 - \mathsf{Q}_{11}^{-1}\mathsf{Q}_{12}(\mathbf{x}_2-\pmb{\mu}_2).\]
\[\begin{aligned} f(\mathbf{x}_1|\mathbf{x}_2) &\propto f(\mathbf{x}) \\ &\propto \exp\left\{ -\frac{1}{2} (\mathbf{x}-\pmb{\mu})^\top\mathsf{Q}(\mathbf{x}-\pmb{\mu}) \right\} \\ &\propto \exp\left\{ -\frac{1}{2}\left[ (\mathbf{x}_1-\pmb{\mu}_1)^\top\mathsf{Q}_{11}(\mathbf{x}_1-\pmb{\mu}_1) + \right.\right. \\ &\qquad + (\mathbf{x}_2-\pmb{\mu}_2)^\top\mathsf{Q}_{22}(\mathbf{x}_2-\pmb{\mu}_2) \\ &\qquad \left. + 2(\mathbf{x}_1-\pmb{\mu}_1)^\top\mathsf{Q}_{12}(\mathbf{x}_2-\pmb{\mu}_2) \right]\Big\} \\ &\propto \exp\left\{ -\frac{1}{2}\left[ \mathbf{x}_1^\top\mathsf{Q}_{11}\mathbf{x}_1-2\mathbf{x}_1^\top\mathsf{Q}_{11}\pmb{\mu}_1 +2\mathbf{x}_1^\top\mathsf{Q}_{12}(\mathbf{x}_2-\pmb{\mu}_2) \right]\right\} \\ &\propto \exp\left\{ -\frac{1}{2}\left[ \mathbf{x}_1^\top\mathsf{Q}_{11}\mathbf{x}_1-2\mathbf{x}_1^\top\mathsf{Q}_{11}\pmb{\mu}_{1|2}\right]\right\} \\ &\propto \exp\left\{ -\frac{1}{2}\left[ (\mathbf{x}_1-\pmb{\mu}_{1|2})^\top\mathsf{Q}_{11}(\mathbf{x}_1-\pmb{\mu}_{1|2}) \right]\right\}. \\ \end{aligned}\]
\(\Box\)
Using this result, it is straightforward to see why zeros of \(\mathsf{Q}\) correspond to pairwise conditional independence statements.
First consider the case \(q_{12}=0\ (=q_{21})\). This is wlog, since we can always re-order the variables to ensure that the zero of interest is in this position. Partition \(\mathbf{X}\) as \[\mathbf{X}=\binom{\mathbf{X}_1}{\mathbf{X}_2},\ \text{ where } \mathbf{X}_1=\binom{X_1}{X_2}\ \text{ and } \mathbf{X}_2=\begin{pmatrix}X_3\\X_4\\\vdots\\X_p\end{pmatrix}.\] Then \(\mathbf{X}_1|\mathbf{X_2}=\mathbf{x_2}\) has variance \(\mathsf{Q}_{11}^{-1}\), which must be diagonal, since \(\mathsf{Q}_{11}\) is. Now this means that \(X_1\) and \(X_2\) are conditionally uncorrelated, but for the MVN, uncorrelated and independent are equivalent. Consequently \(X_1{\perp\hspace{-0.8ex}\perp}X_2|{(X_3,\ldots,X_p)}\), and so \(q_{12}=0\) leads directly to this CI statement. In general, we have the following result.
If \(\mathbf{X}\sim \mathcal{N}(\pmb{\mu},\mathsf{Q}^{-1})\), then for \(i\not= j\) we have \[q_{ij}=0\ (=q_{ji}) \Leftrightarrow X_i{\perp\hspace{-0.8ex}\perp}X_j|{\mathbf{X}\backslash\{X_i,X_j\}}.\]
Therefore, there is a direct correspondence between the zero structure of \(\mathsf{Q}\) and an undirected graph with the pairwise Markov property (P), where zeros in \(\mathsf{Q}\) correspond to missing edges in the graph. Further, provided that \(\mathsf{Q}\) (or equivalently, the variance matrix, \(\mathsf{\Sigma}\)) is strictly positive definite, the densities are all positive, and we can conclude that all of our properties (F), (G), (L) and (P) are satisfied by the associated graph.
We know that the MVN has the important and interesting property that \(\mathbb{V}\mathrm{ar}[\mathbf{X}_1|\mathbf{X}_2=\mathbf{x}_2]\) does not depend on the observed value of \(\mathbf{x}_2\). This allows us to consider the conditional variance matrix without reference to the observed values of the variables we are conditioning on. We define the partial (co)variance matrix for \(\mathbf{X}_1\) (given \(\mathbf{X}_2\)) to be the conditional covariance matrix \(\mathsf{Q}_{11}^{-1}\). In particular, in the case \(\mathbf{X}_1=(X_1,X_2)^\top\), we have \[\mathsf{Q}_{11}^{-1}=\begin{pmatrix}q_{11}&q_{12}\\q_{12}&q_{22}\end{pmatrix}^{-1} = \frac{1}{q_{11}q_{22}-q_{12}^2}\begin{pmatrix}q_{22}&-q_{12}\\-q_{12}&q_{11}\end{pmatrix},\] and so we define the partial covariance between \(X_1\) and \(X_2\) to be \(-q_{12}/(q_{11}q_{22}-q_{12}^2)\). The corresponding partial correlation is therefore given by \(-q_{12}/\sqrt{q_{11}q_{22}}\). In general, the partial correlation between \(X_i\) and \(X_j\), for \(i\not= j\) is given by \[\frac{-q_{ij}}{\sqrt{q_{ii}q_{jj}}}.\] Consequently, we can define the partial correlation matrix to be the matrix of partial correlations, and this is clearly given by \[\operatorname{PCorr}(\mathbf{X})=2\mathbb{I}-\mathsf{D}^{-1/2}\mathsf{Q}\mathsf{D}^{-1/2},\ \text{ where } \mathsf{D}=\operatorname{diag}\{q_{11},\ldots,q_{pp}\}.\] Note that in the typical case of positive definite \(\mathsf{Q}\), the zeros of \(\mathsf{Q}\) match those of \(\operatorname{PCorr}(\mathbf{X})\). Consequently, we can understand the CI structure of a GGM by studying the zero structure of either the precision matrix or the partial correlation matrix.
Suppose that \(\mathbf{X}=(X_1,X_2,X_3)^\top\sim \mathcal{N}(\mathbf{0},\mathsf{\Sigma})\), where \[\mathsf{\Sigma}=\begin{pmatrix}3&-2&1\\-2&4&-2\\1&-2&3\end{pmatrix}.\] Are there any conditional independence relations between the variables?
Note that as there are no zeros in the variance matrix, \(\mathsf{\Sigma}\), there are no marginal independence relations between the variables. To look for conditional independence relations, we can compute the corresponding partial correlation matrix. We could do this by hand, by first using Gaussian elimination to compute the precision matrix, but it will be easier to use R.
[,1] [,2] [,3]
[1,] 0.50 0.25 0.00
[2,] 0.25 0.50 0.25
[3,] 0.00 0.25 0.50
[,1] [,2] [,3]
[1,] 1.0 -0.5 0.0
[2,] -0.5 1.0 -0.5
[3,] 0.0 -0.5 1.0
We now see that the partial correlation in position (1,3) and (3,1) is zero, and this implies the pairwise Markov property, \(X_1{\perp\hspace{-0.8ex}\perp}X_3|{X_2}\). Thus, the associated undirected graph will be a chain with \(X_2\) separating \(X_1\) and \(X_3\).
A Markov random field is simply an undirected graphical model where the variables represent the value of some process at a site in space. There are many ways to specify a MRF. Often the local Markov property (L) will be exploited. For each site, a set of neighbours can be considered, such that the site is judged to be conditionally independent of all other sites given the neighbours. In this case, the site will have edges with all of its neighbours, but no others. This defines the graphical structure of the model. For example, for areal data, the neighbours may be the sites representing the areas that have a common border with the area represented by the site of interest. Relatedly, on a regular square lattice in two dimensions, the neighbourhood could be the four nearest neighbours (up, down, left, right). Once the graphical structure is established, specifying the full joint distribution of the field at all sites requires additional work.
In the context of time series modelling, we used the the direction of time to specify the distribution of the current time point given information about the past. In spatial statistics we do not have a natural ordering of the sites, so this kind of sequential specification of the joint distribution is not natural or convenient. Nevertheless, we would like to be able to specify the joint distribution of the field in terms of some “local” probability distributions. With this in mind, we define the local characteristic at site \(\mathbf{s}_i\in\mathcal{D}\) to be the full conditional distribution, \[ Z_i|\mathbf{Z}_{-i} = \mathbf{z}_{-i}. \] It is a very interesting and important question as to whether a set of local characteristics fully and uniquely determines a joint distribution, or even whether a given set of local characteristics is consistent with any joint distribution. We will return to this question in Chapter 15. For now, we simply observe that for a MRF satisfying the local Markov property (L) these local characteristics simplify greatly to \[ Z_i|\operatorname{bd}(Z_i), \] and that this simplification is central to the practical utility of local characteristics.
A GMRF is a Markov random field whose joint distribution is multivariate normal (or Gaussian). Put another way, it is a graphical Gaussian model (GGM) where the variables represent points in space. Exactly how to go about specifying such models in an appropriate way (using local characteristics) is the subject of Chapter 15.
The AR(1) is a simple example of a 1D GMRF. We know that its joint density factorises as \[ f(\mathbf{x}) = \mathcal{N}\left(x_0;0,\frac{\sigma^2}{1-\phi^2}\right)\prod_{t=1}^n \mathcal{N}(x_t;\alpha x_{t-1}, \sigma^2). \] It therefore has the factorisation property (F) wrt the graph with nodes \(X_t\) and cliques/edges of the form \(\{X_{t-1}, X_{t}\}\). Since it has the property (F), we know it also has properties (G), (L), and (P). Since this GMRF has property (P), we know that the precision matrix will be sparse. Here, since only adjacent nodes are connected, the precision matrix will be tridiagonal.
The local characteristic at time \(t\) is the full conditional \((X_t|\mathbf{X}_{-t}=\mathbf{x}_{-t})\), but due to the local Markov property (L), this is just \((X_t|x_{t-1},x_{t+1})\). Considering the factorisation of the joint density and the fact that any terms not involving \(x_{t}\) are just multiplicative constants, we write the density for the local characteristic as \[ f_t(x_t|x_{t-1},x_{t+1}) \propto \mathcal{N}(x_t;\alpha x_{t-1},\sigma^2)\mathcal{N}(x_{t+1};\alpha x_t, \sigma^2). \] We can simplify this algebraically, or use normal theory to get \[ f_t(x_t|x_{t-1},x_{t+1}) = \mathcal{N}\left(x_t; \frac{\alpha}{\alpha^2+1}(x_{t-1}+x_{t+1}), \frac{\sigma^2}{\alpha^2+1}\right). \] Note that the conditional variances immediately determine the diagonal elements of the precision matrix. So here, \[ q_{ii} = \frac{\alpha^2+1}{\sigma^2}, \quad i=1,2,\ldots,n-1. \] We will consider the off-diagonal elements in the next chapter.
A random \(n\)-dimensional vector \(\mathbf{X}\) is said to have a Gibbs distribution if its joint density can be written in the form \[ f(\mathbf{x}) = \frac{1}{z} \exp\{-V(\mathbf{x})\}, \] for some appropriate \(V:\mathbb{R}^n\rightarrow \mathbb{R}\), known as the potential or energy function. The constant \(z\) is the normalising constant of the distribution, and is sometimes known as the evidence, or marginal likelihood, or (in physics) partition function, depending on the context.
It is clear that provided a given joint density is positive (\(f(\mathbf{x})>0\ \forall\mathbf{x}\in\mathbb{R}^n\)), then it can be regarded as a Gibbs distribution with \(V(\mathbf{x}) = -\log[f(\mathbf{x})]\). In this case we will have \(z=1\), but if the joint density is known only up to a multiplicative constant, then the same potential can be used, but the normalising constant \(z\) will need to be determined by integration or other means. If the random vector \(\mathbf{x}\) represents the value of a field at a lattice of spatial locations, then the joint distribution is considered to be a Gibbs random field.
A Gibbs random field that is also a MRF will satisfy the factorisation property (F) wrt some graph with clique set \(\mathcal{C}\). In that case we can write the potential function as the sum \[ V(\mathbf{x}) = \sum_{c\in\mathcal{C}} \psi_c(\mathbf{x}_c), \] where \[ \psi_c(\mathbf{x}_c) = -\log\phi_c(\mathbf{x}_c),\quad c\in\mathcal{C} \] are known as the clique potentials. GMRFs are Gibbs random fields with quadratic clique potentials.
For our AR(1) example, the (unnormalised) clique potentials are \[ \psi_{\{t-1,t\}}(x_{t-1},x_t) = \frac{(x_t - \alpha x_{t-1})^2}{2\sigma^2}. \]
For a random field defined on the infinite regular square lattice \(\mathcal{D} = \mathbb{Z}^n\) (\(n=1,2\) or \(3\)), the concept of stationarity makes sense: the joint distribution must be invariant under an arbitrary shift \(\mathbf{h}\in\mathbb{Z}^n\). For \(n=1\) we have a situation exactly equivalent to stationary time series models, and we have already seen in Chapter 5 how spectral methods can be useful for characterising allowable covariance functions. For \(n=2\) we need to extend the idea of the DTFT to 2d space. Having seen how the (continuous) Fourier transform and discrete Fourier transform generalise, it should be clear how to generalise the DTFT. For a function \(f:\mathbb{Z}^2\rightarrow\mathbb{C}\), we define its Fourier transform \(\hat{f}:[-\frac{1}{2},\frac{1}{2}]^2\rightarrow\mathbb{C}\) by \[ \hat{f}(\nu_x,\nu_y) = \sum_{j=-\infty}^\infty \sum_{k=-\infty}^\infty f(j,k)e^{-2\pi\mathrm{i}(\nu_x j + \nu_y k)}, \] and this can be inverted with \[ f(j, k) = \int_{-\frac{1}{2}}^\frac{1}{2} \mathrm{d}\nu_x \int_{-\frac{1}{2}}^\frac{1}{2} \mathrm{d}\nu_y\, \hat{f}(\nu_x, \nu_y) e^{2\pi\mathrm{i}(\nu_x j + \nu_y k)}. \] We need a convention for how indices map to positions in space. Here we have chosen to keep our coordinates consistent with the continuous analogue and use \(f(j, k)\) to mean the value of the field in column \(j\) and row \(k\), with columns moving to the right with increasing \(j\) and rows moving up with increasing \(k\). Note that is obviously different to the usual convention for indexing rows and columns in a rectangular matrix. Note that we can adopt any convention we like as long as we understand, interpret it correctly, and are consistent.
We can consider the lattice process induced by a weighted white noise process, and following similar arguments to those we have already seen, we find that the induced process is stationary and Gaussian with covariance function \[ C(\mathbf{h}) = C(j, k) = \mathbb{C}\mathrm{ov}[Z(l,m), Z(l+j, m+k)] \] given by \[ C(j, k) = \int_{-\frac{1}{2}}^\frac{1}{2} \mathrm{d}\nu_x \int_{-\frac{1}{2}}^\frac{1}{2} \mathrm{d}\nu_y\, S(\nu_x, \nu_y) e^{2\pi\mathrm{i}(\nu_x j + \nu_y k)}, \] where \(S\) is a (real, non-negative) spectral density function defined on the unit square. This is an inverse FT, and so the spectral density can be determined from the covariogram via \[ S(\nu_x, \nu_y) = \sum_{j=-\infty}^\infty \sum_{k=-\infty}^\infty C(j,k)e^{-2\pi\mathrm{i}(\nu_x j + \nu_y k)}. \] Any reasonable (integrable) real non-negative spectral density function induces a valid stationary covariogram on the integer lattice. In this context of lattice random fields, this variant of Bochner’s theorem is often referred to as Herglotz’s theorem.