8  State space modelling

8.1 Introduction

In Chapter 7 we introduced the DLM and showed how a model of DLM form could be used for filtering, smoothing, forecasting, etc. However, we haven’t yet discussed how to go about “building” an appropriate DLM model (in particular, specifying \(\mathsf{G}_t, \mathsf{F}_t, \mathsf{W}_t, \mathsf{V}_t\)) for a given time series. That is the topic of this chapter. We will use the dlm R package to illustrate the concepts. We will start by looking at some simple models for long-term trends, then consider seasonal effects, then the problem of combining trends and seasonal effects, before going on to think about the incorporation of ARMA components. We will focus mainly on (time invariant models for) univariate time series, but the principles are quite similar for multivariate time series. Our main running example will be (the log of) J&J earnings, considered briefly in Chapter 1.

library(astsa)
library(dlm)
y = log(jj)
tsplot(y, col=4, lwd=2, main="Log of J&J earnings")

8.2 Polynomial trend

8.2.1 Locally constant model

In the previous chapter we saw that the choice \(\mathsf{G}=1\), \(\mathsf{F}=1\) corresponds to noisy observation of a Gaussian random walk. This random walk model for the underlying (hidden) state is known as the locally constant model in the context of DLMs, and sometimes (for reasons to become clear), as the first order polynomial trend model. Such a model is clearly inappropriate for the J&J example, since the trend appears to be (at least, locally) linear.

8.2.2 Locally linear model

We can model a time series with a locally linear trend by choosing \(\mathsf{F}=(1, 0)\), \[ \mathsf{G} = \begin{pmatrix}1&1\\0&1\end{pmatrix},\quad \mathsf{W}=\begin{pmatrix}w_1&0\\0&w_2\end{pmatrix}. \] This is known as the locally linear model, or sometimes as the second order polynomial trend model. To see why this corresponds to a linear trend, it is perhaps helpful to write the hidden state vector as \(\mathbf{X}_t=(\mu_t, \tau_t)^\top\). So then \(y_t\) is clearly just a noisy observation of \(\mu_t\) (the current level), and \(\mu_t=\mu_{t-1}+\tau_{t-1}+\omega_{1,t}\), so the current level increases by a systematic amount \(\tau_{t-1}\) (the trend), and is also corrupted by noise. On the other hand, \(\tau_t=\tau_{t-1}+\omega_{2,t}\) is just a random walk (typically with very small variance, so that the trend changes very slowly).

We can create such a model using the dlm::dlmModPoly function, by providing the order required (2 for locally linear), and the diagonal of \(\mathsf{V}\) and \(\mathsf{W}\) (off-diagonal elements are assumed to be zero).

mod = dlmModPoly(2, dV=0.1^2, dW=c(0.01^2, 0.01^2))
mod
$FF
     [,1] [,2]
[1,]    1    0

$V
     [,1]
[1,] 0.01

$GG
     [,1] [,2]
[1,]    1    1
[2,]    0    1

$W
      [,1]  [,2]
[1,] 1e-04 0e+00
[2,] 0e+00 1e-04

$m0
[1] 0 0

$C0
      [,1]  [,2]
[1,] 1e+07 0e+00
[2,] 0e+00 1e+07
ss = dlmSmooth(y, mod)
tsplot(y, col=4, lwd=2, main="Log of J&J earnings")
lines(ss$s[,1], col=2, lwd=2)

This captures the trend quite well, but not the strong seasonal effect. We will return to this issue shortly.

8.2.3 Higher-order polynomial models

This strategy extends straightforwardly to higher order polynomials. For example, a locally quadratic (order 3) model can be constructed using \(\mathsf{F}=(1,0,0)\) and \[ \mathsf{G} = \begin{pmatrix}1&1&0\\0&1&1\\0&0&1\end{pmatrix}. \] A diagonal \(\mathsf{W}\) is typically adopted, sometimes with some of the diagonal elements set to zero. Models beyond order 3 are rarely used in practice.

8.3 Seasonal models

Many time series exhibit periodic behaviour with known period. There are two commonly used approaches used to modelling such seasonal behaviour using DLMs. Both involve using a cyclic matrix, \(\mathsf{G}\), with period \(s\) (eg. \(s=12\) for monthly data), so that \(\mathsf{G}^s=\mathbb{I}\). We will look briefly at both.

8.3.1 Seasonal effects

Arguably the most straightforward way to model seasonality is to explicitly model \(s\) separate seasonal effects, and rotate and evolve them, as required. So, for a purely seasonal model, call the \(s\) seasonal effects \(\alpha_1,\alpha_2,\ldots,\alpha_s\), and make these the components of the state vector, \(\mathbf{X}_t\). Then choosing the \(1\times s\) matrix \(\mathsf{F}=(1,0,\ldots,0)\) and \(\mathsf{G}\) the \(s\times s\) cyclic permutation matrix \[ \mathsf{G} = \begin{pmatrix}\mathbf{0}^\top&1\\\mathbb{I}&\mathbf{0}\end{pmatrix} \] gives the basic evolution structure. For example, in the case \(s=4\) (quarterly data), we have \[ \mathsf{G} = \begin{pmatrix} 0&0&0&1\\ 1&0&0&0\\ 0&1&0&0\\ 0&0&1&0\end{pmatrix}, \] and so \[ \mathsf{G}\begin{pmatrix}\alpha_4\\\alpha_3\\\alpha_2\\\alpha_1\end{pmatrix} = \begin{pmatrix}\alpha_1\\\alpha_4\\\alpha_3\\\alpha_2\end{pmatrix}, \] and the seasonal effects get rotated through the state vector each time, repeating after \(s\) time points. If we want the seasonal effects to be able to evolve gradually over time, we can allow this by choosing (say) \[ \mathsf{W} = \operatorname{diag}\{w,0,\ldots,0\}, \] for some \(w>0\).

This structure would be fine for a purely seasonal model, but in practice a seasonal component is often used as part of a larger model that also models the overall mean of the process. In this case there is an identifiability issue, exactly analogous to the problem that arises in linear models with a categorical covariate. Adding a constant value to each of the seasonal effects and subtracting the same value from the overall mean leads to exactly the same model. So we need to constrain the effects to maintain identifiability of the overall mean. This is typically done by ensuring that the effects sum to zero. There are many ways that such a constraint could be encoded with a DLM, but one common approach is to notice that in this case there are only \(s-1\) degrees of freedom for the effects (eg. the final effect is just minus the sum of the others), and so we can model the effects with an \(s-1\) dimensional state vector. We can then update this with the \((s-1)\times(s-1)\) evolution matrix \[ \mathsf{G} = \begin{pmatrix}-\mathbf{1}^\top&-1\\\mathbb{I}&\mathbf{0}\end{pmatrix}. \] We will illustrate for \(s=4\). Suppose that our hidden state is \(\mathbf{X}_t=(\alpha_4,\alpha_3,\alpha_2)^\top\). We know that we can reconstruct the final effect, \(\alpha_1=-(\alpha_4+\alpha_3+\alpha_2)\), so this vector captures all of our seasonal effects. But now \[ \begin{pmatrix}-1&-1&-1\\1&0&0\\0&1&0\end{pmatrix}\begin{pmatrix}\alpha_4\\\alpha_3\\\alpha_2\end{pmatrix} = \begin{pmatrix}\alpha_1\\\alpha_4\\\alpha_3\end{pmatrix}, \] and again, the final effect (now \(\alpha_2\)) can be reconstructed from the rest. This approach also rotates through our \(s\) seasonal effects, but now constrains the sum of the effects to be fixed at zero. This is how the function dlm::dlmModSeas constructs a seasonal model.

8.3.2 Fourier components

An alternative approach to modelling seasonal effects, which can be more parsimonious, is to use Fourier components. Start by assuming that we want to model our seasonal effects with a single sine wave of period \(s\). Putting \(\omega=2\pi/s\), we can use the rotation matrix \[ \mathsf{G} = \begin{pmatrix}\cos\omega & \sin\omega\\-\sin\omega&\cos\omega\end{pmatrix} \] and the observation matrix \(\mathsf{F}=(1,0)\). Applying \(\mathsf{G}\) repeatedly to a 2-vector \(\mathbf{X}_t\) will give an oscillation of period \(s\), with phase and amplitude determined by the components of \(\mathbf{X}_t\). This gives a very parsimonious way of modelling seasonal effects, but may be too simple. We can add in other Fourier components by defining \[ \mathsf{H}_k = \begin{pmatrix}\cos k\omega & \sin k\omega\\-\sin k\omega&\cos k\omega\end{pmatrix}, \quad k=1,2,\ldots, \] (so \(\mathsf{H}_k=\mathsf{G}^k\)) and setting \(\mathsf{F}=(1,0,1,0,\ldots)\), \[ \mathsf{G} = \operatorname{blockdiag}\{\mathsf{H}_1,\mathsf{H}_2,\ldots\}. \] In principle we can go up to \(k=s/2\). Then we will have \(s\) degrees of freedom, and we will essentially be just representing our seasonal effects by their discrete Fourier transform. Typically we will choose \(k\) somewhat smaller than this, omitting higher frequencies in order to obtain a smoother, more parsimonious representation. We can use a diagonal \(\mathsf{W}\) matrix in order to allow the phase and amplitude of each harmonic to slowly evolve. This is how the dlm::dlmModTrig function constructs a seasonal model.

8.4 Model superposition

We can combine DLM models to make more complex DLMs in various ways. One useful approach is typically known as model superposition. Suppose that we have two DLM models with common observation dimension, \(m\), and state dimensions \(p_1,\ p_2\), respectively. If model \(i\) is \(\{\mathsf{G}_{it}, \mathsf{F}_{it}, \mathsf{W}_{it}, \mathsf{V}_{it}\}\), then the “sum” of the two models can be defined by \[ \Big\{ \operatorname{blockdiag}\{ \mathsf{G}_{1t}, \mathsf{G}_{2t} \}, (\mathsf{F}_{1t},\mathsf{F}_{2t}), \operatorname{blockdiag}\{ \mathsf{W}_{1t}, \mathsf{W}_{2t} \}, \mathsf{V}_{1t}+\mathsf{V}_{2t} \Big\}, \] having observation dimension \(m\) and state dimension \(p_1+p_2\). It is clear that this “sum” operation can be applied to more than two models with observation dimension \(m\), and that the “sum” operation is associative, so DLM models of this form comprise a semigroup wrt this operation. Combining models in this way provides a convenient way of building models including both overall trends as well as seasonal effects. This sum operation can be used to combine DLM models in the dlm package by using the + operator.

For our running example, we can fit a model including a locally linear trend and a quarterly seasonal effect as follows.

mod = dlmModPoly(2, dV=0.1^2, dW=c(0.01^2, 0.01^2)) +
       dlmModSeas(4, dV=0, dW=c(0.02^2, 0, 0))
mod
$FF
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    1    0    0

$V
     [,1]
[1,] 0.01

$GG
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    1    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0   -1   -1   -1
[4,]    0    0    1    0    0
[5,]    0    0    0    1    0

$W
      [,1]  [,2]  [,3] [,4] [,5]
[1,] 1e-04 0e+00 0e+00    0    0
[2,] 0e+00 1e-04 0e+00    0    0
[3,] 0e+00 0e+00 4e-04    0    0
[4,] 0e+00 0e+00 0e+00    0    0
[5,] 0e+00 0e+00 0e+00    0    0

$m0
[1] 0 0 0 0 0

$C0
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,] 1e+07 0e+00 0e+00 0e+00 0e+00
[2,] 0e+00 1e+07 0e+00 0e+00 0e+00
[3,] 0e+00 0e+00 1e+07 0e+00 0e+00
[4,] 0e+00 0e+00 0e+00 1e+07 0e+00
[5,] 0e+00 0e+00 0e+00 0e+00 1e+07
ss = dlmSmooth(y, mod) # smoothed states
so = ss$s[-1,] %*% t(mod$FF) # smoothed observations
so = ts(so, start=start(y), freq=frequency(y))
tsplot(y, col=4, lwd=2, main="Log of J&J earnings")
lines(ss$s[,1], col=3, lwd=1.5)
lines(so, col=2, lwd=1.2)

Once we are happy that the model is a good fit to the data, we can use it for forecasting.

fit = dlmFilter(y, mod)
fore = dlmForecast(fit, 16) # forecast 16 time points
pred = ts(c(tail(y, 1), fore$f), 
  start=end(y), frequency=frequency(y))
upper = ts(c(tail(y, 1), fore$f + 2*sqrt(unlist(fore$Q))), 
  start=end(y), frequency=frequency(y))
lower = ts(c(tail(y, 1), fore$f - 2*sqrt(unlist(fore$Q))), 
  start=end(y), frequency=frequency(y))
all = ts(c(y, upper[-1]), start=start(y),
  frequency = frequency(y))
tsplot(all, ylab="log(earnings)",
  main="Forecasts with 2SD intervals")
lines(y, col=4, lwd=1.5)
lines(pred, col=2, lwd=2)
lines(upper, col=2)
lines(lower, col=2)

8.4.1 Example: monthly births

For an additional example, we will look at monthly live birth data for the US.

y = birth
tsplot(y, col=4, lwd=1.5, main="Monthly births")

This seems to have a random walk nature, but with a strong seasonal component. We will assume that we do not want to use a full seasonal model, and instead use a seasonal model based on Fourier components with two harmonics. We may be unsure as to what variances to assume, and so want to estimate those using maximum likelihood. We could fit this using the dlm package as follows.

buildMod = function(lpar)
  dlmModPoly(1, dV=exp(lpar[1]), dW=exp(lpar[2])) +
    dlmModTrig(12, 2, dV=0, dW=rep(exp(lpar[3]), 4))
opt = dlmMLE(y, parm=log(c(100, 1, 1)),
        build=buildMod)
opt
$par
[1]  4.482990  1.925763 -3.228793

$value
[1] 1116.91

$counts
function gradient 
      14       14 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
mod = buildMod(opt$par)
ss = dlmSmooth(y, mod)
so = ss$s[-1,] %*% t(mod$FF) # smoothed observations
so = ts(so, start(y), frequency = frequency(y))
tsplot(y, col=4, lwd=1.5, main="Monthly births")
lines(ss$s[,1], col=3, lwd=1.5)
lines(so, col=2, lwd=1.2)

The optimised parameters seem to correspond to a good fit to the data.

8.5 ARMA models in state space form

We have seen in previous chapters how ARMA models define a useful class of discrete time Gaussian processes that can effectively capture the kinds of auto-correlation structure that we often see in time series. They can be useful building blocks in DLM models. To use them in this context, we need to know how to represent them in state space form.

8.5.1 AR models

In fact, we have already seen a popular way to represent an AR\((p)\) model as a VAR(1), and this is exactly what we need for state space modelling. So, to build a DLM representing noisy observation of a hidden AR\((p)\) model we specify \(\mathsf{F}=(1,0,\ldots,0)\), \[ \mathsf{G} = \begin{pmatrix} \phi_1&\cdots&\phi_{p-1}&\phi_p\\ 1&\cdots&0&0\\ \vdots&\ddots&\ddots&\vdots\\ 0&\cdots&1&0 \end{pmatrix}, \] \(\mathsf{W}=\operatorname{diag}(\sigma^2,0,\ldots,0)\), \(\mathsf{V}=(v)\). As a reminder of why this works, we see that if we define \(\mathbf{X}_t = (X_t,X_{t-1},\ldots,X_{t-p})^\top\), where \(X_t\) is our AR\((p)\) process, then \[ \mathsf{G} \mathbf{X}_{t-1} + \begin{pmatrix}\varepsilon_t\\0\\\vdots\\0\end{pmatrix} = \mathbf{X}_t, \] and so the VAR(1) update preserves the stationary distribution of our AR\((p)\) model. However, this VAR(1) representation of an AR\((p)\) model is by no means unique. In fact, it turns out that replacing the above \(\mathsf{G}\) with \(\mathsf{G}^\top\) gives a VAR(1) model which also has the required AR\((p)\) model as the marginal distribution of the first component. This is not so intuitive, but generalises more easily to the ARMA case, so it is worth understanding. For this case we instead define the state vector to be \[ \mathbf{X}^\star_t = \begin{pmatrix} X_t\\ \phi_2X_{t-1}+\phi_3X_{t-2}+\ldots+\phi_pX_{t-p+1}\\ \vdots\\ \phi_{p-1}X_{t-1}+\phi_pX_{t-2}\\ \phi_pX_{t-1} \end{pmatrix}, \] where \(X_t\) is our AR\((p)\) process of interest, and note that only the first component of this state vector has the distribution we require. But with this definition we have \[ \mathsf{G}^\top\mathbf{X}^\star_{t-1} + \begin{pmatrix}\varepsilon_t\\0\\\vdots\\0\end{pmatrix} = \mathbf{X}^\star_t, \] and so this VAR(1) update preserves the distribution of this state vector. Consequently, the first component of the VAR(1) model defined this way will be marginally our AR\((p)\) model of interest. This representation is commonly encountered in practice, since the same approach works for ARMA models, where it is necessary to carry information about “old” errors in the state vector.

8.5.2 ARMA models

Representing an arbitrary ARMA model as a VAR(1) is also possible, and again the representation is not unique. However, the second approach that we used for AR\((p)\) models can be adapted very easily to the ARMA case. Consider first an ARMA\((p, p-1)\) model, so that we have one fewer MA coefficients than AR coefficients. Using the \(p\times p\) auto-regressive matrix \[ \mathsf{G} = \begin{pmatrix} \phi_1 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ \phi_{p-1} & 0 &\ddots & 1 \\ \phi_p & 0 & \cdots & 0 \end{pmatrix}, \] together with the error vector \[ \pmb{\varepsilon}_t = \begin{pmatrix}1\\\theta_1\\\vdots\\\theta_{p-1}\end{pmatrix}\varepsilon_t, \] leads to a hidden state vector with first component marginally the ARMA\((p,p-1)\) process of interest. Note that the error covariance matrix is \[ \mathsf{W} = \begin{pmatrix}1\\\theta_1\\\vdots\\\theta_{p-1}\end{pmatrix} \begin{pmatrix}1\\\theta_1\\\vdots\\\theta_{p-1}\end{pmatrix}^\top\sigma^2. \] To understand why this works, define our state vector to be \[ \mathbf{X}_t = \begin{pmatrix} X_t\\ \phi_2X_{t-1}+\phi_3X_{t-2}+\ldots+\phi_pX_{t-p+1}+\theta_1\varepsilon_{t}+\cdots+\theta_{p-1}\varepsilon_{t-p+2}\\ \vdots\\ \phi_{p-1}X_{t-1}+\phi_pX_{t-2}+\theta_{p-2}\varepsilon_t+\theta_{p-1}\varepsilon_{t-1}\\ \phi_pX_{t-1}+\theta_{p-1}\varepsilon_t \end{pmatrix}. \] We can then directly verify that \[ \mathsf{G}\mathbf{X}_{t-1} + \pmb{\varepsilon}_t = \mathbf{X}_t, \] and so this VAR(1) update preserves the distribution of our state vector. Consequently, the first component is marginally ARMA, and so using \(\mathsf{F}=(1,0,\cdots,0)\) will pick out this component for incorporation into the model of observation process.

We have considered the case of ARMA\((p,p-1)\), but note that any ARMA\((p,q)\) model can be represented as an ARMA\((p^\star,p^\star-1)\) model by choosing \(p^\star=\max\{p,q+1\}\) and padding coefficient vectors with zeros as required. This is exactly the approach used by the function dlm::dlmModARMA.

8.5.3 Example: SOI

Let us return to the SOI example we looked at at the end of Chapter 7. We can model this using a locally constant model, but also with a seasonal effect. Further, we can investigate any residual correlation structure by including an AR(2) component. We can fit this as follows.

y = soi
buildMod = function(param)
  dlmModPoly(1, dV=exp(param[1]), dW=exp(param[2])) +
    dlmModARMA(ar=c(param[3], param[4]), 
     sigma2=exp(param[5]), dV=0) +
    dlmModTrig(12, 2, dV=0, dW=rep(exp(param[6]), 4))
opt = dlmMLE(y, parm=c(log(0.1^2), log(0.01^2), 
  0.2, 0.1, log(0.1^2), log(0.01^2)), build=buildMod)
opt
$par
[1] -3.100868e+00 -9.242014e+00  8.792923e-01 -7.119263e-06 -4.572246e+00
[6] -1.010190e+01

$value
[1] -310.9818

$counts
function gradient 
      51       51 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
mod = buildMod(opt$par)
ss = dlmSmooth(y, mod)
so = ss$s[-1,] %*% t(mod$FF) # smoothed observations
so = ts(so, start(y), freq=frequency(y))
tsplot(y, col=4, lwd=1.5, main="SOI")
lines(ss$s[,1], col=3, lwd=1.5)
lines(so, col=2)

This looks to be a good fit, suggesting a gradual slow decreasing trend, but inspection of the optimised parameters (in particular param[5], the log of \(\sigma^2\) in the ARMA component), suggests that the AR(2) component is not necessary.