MATH4341: Spatio-temporal statistics

Term 2: Spatial modelling and analysis

Computer Lab 8: Inference for lattice models

Load up RStudio via your favourite method. If you can’t use RStudio for some reason, most of what we do in these labs can probably be done using WebR, but the instructions will assume RStudio. You will also want to have the web version of the lecture notes to refer to.

Question 1

If you haven’t already done so, reproduce all of the plots and computations from Chapter 16 of the lecture notes by copying and pasting each code block in turn. Make sure you understand what each block is doing and that you can reproduce all of the plots.

Question 2

Today we will be again using the New York, spData::nydata. We will focus on the actual data of interest, relating to incidence rate of leukemia in the population, Z. First, load some libraries and look at the data.

         AREANAME     AREAKEY        X        Y POP8 TRACTCAS  PROPCAS
1 Binghamton city 36007000100 4.069397 -67.3533 3540     3.08 0.000870
2 Binghamton city 36007000200 4.639371 -66.8619 3560     4.08 0.001146
3 Binghamton city 36007000300 5.709063 -66.9775 3739     1.09 0.000292
4 Binghamton city 36007000400 7.613831 -65.9958 2784     1.07 0.000384
5 Binghamton city 36007000500 7.315968 -67.3183 2571     3.06 0.001190
6 Binghamton city 36007000600 8.558753 -66.9344 2729     1.06 0.000388
  PCTOWNHOME PCTAGE65P        Z  AVGIDIST PEXPOSURE
1  0.3277311 0.1466102  0.14197 0.2373852  3.167099
2  0.4268293 0.2351124  0.35555 0.2087413  3.038511
3  0.3377396 0.1380048 -0.58165 0.1708548  2.838229
4  0.4616048 0.1188937 -0.29634 0.1406045  2.643366
5  0.1924370 0.1415791  0.45689 0.1577753  2.758587
6  0.3651786 0.1410773 -0.28123 0.1726033  2.848411
hist(nydata$Z, col=4, main="Transformed leukemia rate, Z")

But for a spatial analysis, we would like the full geometry.

nyd = st_read(system.file("shapes/NY8_bna_utm18.gpkg",
                          package="spData"))
Reading layer `sf_bna2_utm18' from data source 
  `/home/darren/R/x86_64-pc-linux-gnu-library/4.5/spData/shapes/NY8_bna_utm18.gpkg' 
  using driver `GPKG'
Simple feature collection with 281 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 357628 ymin: 4649538 xmax: 480360.3 ymax: 4808317
Projected CRS: UTM Zone 18, Northern Hemisphere
plot(nyd[,"Z"])

Using the geometry, we can create an adjacency graph.

nyd_nb = spdep::poly2nb(nyd)
nyd_nb
Neighbour list object:
Number of regions: 281 
Number of nonzero links: 1632 
Percentage nonzero weights: 2.066843 
Average number of links: 5.807829 
n = length(nyd_nb)
n
[1] 281
nyd_nb[1:3]
[[1]]
[1]  2 13 14 15 47 48 49 50

[[2]]
[1]  1  3 13 35 47 48

[[3]]
[1]  2  4  5 12 13 35
plot(st_geometry(nyd), border="lightgrey")
plot(nyd_nb, sf::st_geometry(nyd), pch=19, cex=0.7,
     add=TRUE, col="blue")

If you have a StadiaMaps API key, you can view the spatial context of the data. If you don’t have an API key, skip the next two code blocks. They are not required for what follows. We can start with a plain map of the region.

nyd_ll = st_transform(nyd, "EPSG:4326") # long and lat
nyd_bb = st_bbox(nyd_ll)
names(nyd_bb) = c("left", "bottom", "right", "top")
nyd_map = get_map(nyd_bb, maptype="stamen_terrain", source="stadia")
ggmap(nyd_map) +
    labs(x="Long", y="Lat")

Then we can overlay the data geometry.

ggmap(nyd_map) +
    geom_sf(data = sf::st_geometry(nyd_ll), fill=NA,
            linewidth=0.5, inherit.aes=FALSE) +
    labs(x="Long", y="Lat")

So we see that the data relates to part of New York State (rather than New York City), and that the biggest city in the region covered by the data is Syracuse, NY, which also seems to be associated with the highest rates of Leukemia.

Let us now investigate spatial modelling of the leukemia rate, Z.

Task A

Use Moran’s test and Geary’s test to check that there is spatial association in Z.

Hopefully you detected the presence of spatial association. Let’s begin with some spatial smoothing of the data. We need a spatial prior, and the only examples we know much about are SAR and CAR. Let’s start with an equally weighted SAR model. First set up an appropriate normalised adjacency matrix.

num = sapply(nyd_nb, length)
head(num)
[1] 8 6 6 4 6 6
length(num)
[1] 281
adjW = nb2mat(nyd_nb)
isSymmetric(adjW)
[1] FALSE
adjW[1:5,1:5]
          1         2         3         4         5
1 0.0000000 0.1250000 0.0000000 0.0000000 0.0000000
2 0.1666667 0.0000000 0.1666667 0.0000000 0.0000000
3 0.0000000 0.1666667 0.0000000 0.1666667 0.1666667
4 0.0000000 0.0000000 0.2500000 0.0000000 0.2500000
5 0.0000000 0.0000000 0.1666667 0.1666667 0.0000000
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

We will use this to construct a SAR prior using a spatial dependency parameter of 0.999, and a residual variance of 0.2. Note that we could have a residual variance that is dependent on the number of neighbours, as we have to with CAR, but since we don’t have to for SAR, let’s not bother.

sigma = 0.2 # residual error
rho = 0.999 # spatial dependence parameter (< 1 for diagonal dominance)
B = adjW*rho # sensible SAR weight matrix
Q = (1/(sigma^2))*((diag(n) - t(B)) %*% (diag(n) - B)) # SAR precision
Q[1:5, 1:5]
           1          2         3          4          5
1 32.8657641 -5.0841336  1.386112  0.0000000  0.6930562
2 -5.0841336 28.4893404 -7.425745  0.8992548  1.3861125
3  1.3861125 -7.4257452 29.537800 -9.5069952 -5.3795109
4  0.0000000  0.8992548 -9.506995 27.2853673 -9.0201375
5  0.6930562  1.3861125 -5.379511 -9.0201375 30.3296026
image(t(Q)[,nrow(Q):1], col=topo.colors(100))

One advantage of SAR models over CAR is that we get a symmetric precision matrix for free. Before we start smoothing, we could generate a sample from this prior in order to check that everything seems to be in order.

Lt = chol(Q)
image(t(Lt)[,nrow(Lt):1], col=topo.colors(100))

set.seed(3)
z = backsolve(Lt, rnorm(n))
nyd$sar = z
plot(nyd[,"sar"])

This all looks fine, and the fill-in we get for the Cholesky factor doesn’t seem too bad. So now let’s spatially smooth.

Q_star = Q + diag(n) # obs precision 1 since z scores
b_star = nyd$Z # mean 0 and obs precision 1 since z scores
mu_star = solve(Q_star, b_star)
nyd$sar_s = mu_star
plot(nyd[,"sar_s"])

hist(nyd$sar_s, col=4)

This seems to show a sensible spatial smoothing, suggestive of elevated levels around more densely populated areas, and the histogram of smoothed values indicates that the overall shrinkage has been relatively modest (suggesting that most of the shrinkage has been spatially localised). We can further investigate by comparing smoothed values against raw.

plot(nyd$Z, nyd$sar_s, col=4, pch=19)
abline(0,1, col=3, lwd=2)
abline(0,0, col=2, lwd=1.5)
abline(v=0, col=2, lwd=1.5)

This confirms some overall shrinkage, but that the broad pattern has been preserved.

In this analysis we guessed at a residual variance of 0.2, and just assumed a conditional precision of 1 (due to Z being normalised), but neither of these assumptions are necessarily optimal. We could attempt to learn them both from data.

# MVN density, parameterised by mean and precision
dmvnormp = function(x, mu, Q, log=TRUE) {
    Lt = chol(Q)
    ld = -0.5*length(mu)*log(2*pi) + sum(log(diag(Lt))) -
        0.5*sum((x-mu)*(Q %*% (x-mu)))
    if (log)
        ld
    else
        exp(ld)
}

# log-likelihood of a conditioned GMRF
gmrfll = function(mu, Q, F, Lambda, y) {
    Q_star = Q + t(F) %*% (Lambda %*% F)
    b_star = Q %*% mu + t(F) %*% (Lambda %*% y)
    mu_star = solve(Q_star, b_star)
    dmvnormp(mu, mu, Q) + dmvnormp(y, F %*% mu, Lambda) -
        dmvnormp(mu, mu_star, Q_star)
}

Qstd = ((diag(n) - t(B)) %*% (diag(n) - B)) # precision without variance scaling

nyd_ll = function(lpar) {
    par = exp(lpar); sigma = par[1]; lambda = par[2]
    gmrfll(rep(0, n), Qstd/(sigma^2), diag(n), lambda*diag(n), nyd$Z)
}

nyd_ll(c(0.2, 1))
[1] -437.5948
opt = optim(log(c(0.2, 1)), nyd_ll, control=list(fnscale=-1))
pars = exp(opt$par)
pars
[1] 0.09336081 2.31632205

Maximum likelihood gives us some optimised parameters, which we can use to re-do the smoothing.

sigma = pars[1]; lambda = pars[2]
Q = Qstd/(sigma^2)
Q_star = Q + lambda*diag(n)
b_star = rep(lambda, n)*nyd$Z
mu_star = solve(Q_star, b_star)
nyd$sar_s2 = mu_star
plot(nyd[,"sar_s2"])

There are minor differences to the previous smoothing, but the overall picture remains similar.

Task B

Conduct a spatial smoothing using an equally weighted CAR prior rather than SAR. Use maximum likelihood to estimate the CAR conditional variance parameter, \(\kappa\) (kappa), and the observational precision, \(\lambda\) (lambda).

Question 3

Let us now investigate spatial regression for the variable of interest, Z. Let’s start with a simple spatial smoothing (with SAR residuals).

library(spatialreg)
mod = spautolm(Z ~ 1,
               listw = nb2listw(nyd_nb), data = nyd)
summary(mod)

Call: spautolm(formula = Z ~ 1, data = nyd, listw = nb2listw(nyd_nb))

Residuals:
      Min        1Q    Median        3Q       Max 
-1.659136 -0.433952 -0.065391  0.434972  4.666104 

Coefficients: 
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.206410   0.066981 -3.0816 0.002059

Lambda: 0.38776 LR test value: 22.793 p-value: 1.8042e-06 
Numerical Hessian standard error of lambda: 0.076231 

Log likelihood: -297.4879 
ML residual variance (sigma squared): 0.47256, (sigma: 0.68743)
Number of observations: 281 
Number of parameters estimated: 3 
AIC: 600.98
nyd$sr_s = fitted(mod)
plot(nyd[,"sr_s"])

Task C

Identify an appropriate spatial regression model for this data. The potentially interesting and useful covariates are PCTOWNHOME, PCTAGE65P, and PEXPOSURE. See ?nydata for (slightly) further details. When you identify a model that you are happy with, plot both the fitted values and the residuals in their spatial context. What does it all mean?

Question 4 (for keen beans)

Task D

Re-do the CAR-based spatial smoothing from Task B, but using sparse matrices and sparse matrix operations from the Matrix package. Use the estimated variance parameters that you have already obtained.