7  Dynamic linear models (DLMs)

7.1 Introduction

The set up here is similar to the previous chapter. We have a “hidden” Markov chain model, Xt, and a conditionally independent observation process Yt leading to an observed time series y1,,yn. Again, we want to use the observations to learn about this hidden process. But this time, instead of a finite state Markov chain, the hidden process is a linear Gaussian model. Then, analytic tractability requires that the observation process is also linear and Gaussian. In that case, we can use standard properties of the multivariate normal distribution to do filtering, smoothing, etc., using straightforward numerical linear algebra.

For a hidden state of dimension p and an observation process of dimension m, the model can be written in conditional form as Xt|Xt1N(GtXt1,Wt),Yt|XtN(FtXt,Vt),t=1,2,,n, for (possibly time dependent) matrices Gt,Ft,Wt,Vt of appropriate dimensions, all (currently) assumed known. Note that these matrices are often chosen to be independent of time, in which case we drop the time subscript and write them G,F,W,V. The model is initialised with X0N(m0,C0), for prior parameters m0 (a p-vector), C0 (a p×p matrix), also assumed known. We can also write the model in state-space form as Xt=GtXt1+ωωt,ωωtN(0,Wt),Yt=FtXt+ννt,ννtN(0,Vt), from which it is possibly more clear that in the case of time independent G, W, the hidden process evolves as a VAR(1) model. The observation process is also quite flexible, allowing noisy observation of some linear transformation of the hidden state. Note that for a univariate observation process (m=1), corresponding to a univariate time series, Ft will be row vector representing a linear functional of the (vector) hidden state.

7.2 Filtering

The filtering problem is the computation of (Xt|Y1:t=y1:t), for t=1,2,,n. Since everything in the problem is linear and Gaussian, the filtering distributions will be too, so we can write (Xt|Y1:t=y1:t)N(mt,Ct), for some mt, Ct to be determined. The forward recursions we use to compute these are known as the Kalman filter.

Since we know m0, C0, we assume that we know mt1, Ct1 for some t>0, and proceed to compute mt, Ct. Just as for HMMs, this is conceptually simpler to break down into two steps.

Predict step

Since Xt=GtXt1+ωωt and Xt is conditionally independent of Y1:(t1) given Xt1, we will have E[Xt|Y1:(t1)=y1:(t1)]=E{E[Xt|Xt1]|Y1:(t1)=y1:(t1)}=E{GtXt1|Y1:(t1)=y1:(t1)}=GtE{Xt1|Y1:(t1)=y1:(t1)}=Gtmt1. Similarly, Var[Xt|Y1:(t1)=y1:(t1)]=Var{E[Xt|Xt1]|Y1:(t1)=y1:(t1)} +E{Var[Xt|Xt1]|Y1:(t1)=y1:(t1)}=Var{GtXt1|Y1:(t1)=y1:(t1)}+E{Wt|Y1:(t1)=y1:(t1)}=GtCt1Gt+Wt. We can therefore write (Xt|Y1:(t1)=y1:(t1))N(m~t,C~t), where we define m~t=Gtmt1,C~t=GtCt1Gt+Wt.

Update step

Using Yt=FtXt+ννt and some basic linear normal properties, we deduce (XtYt|Y1:(t1)=y1:(t1))N([m~tFtm~t],[C~tC~tFtFtC~tFtC~tFt+Vt]). If we define ft=Ftm~t and Qt=FtC~tFt+Vt this simplifies to (XtYt|Y1:(t1)=y1:(t1))N([m~tft],[C~tC~tFtFtC~tQt]). We can now use the multivariate normal conditioning formula to obtain mt=m~t+C~tFtQt1[ytft]Ct=C~tC~tFtQt1FtC~t. The weight that is applied to the discrepancy between the new observation and its forecast is often referred to as the (optimal) Kalman gain, and denoted Kt. So if we define Kt=C~tFtQt1 we get mt=m~t+Kt[ytft]Ct=C~tKtQtKt.

7.2.1 The Kalman filter

Let us now summarise the steps of the Kalman filter. Our input at time t is mt1, Ct1 and yt. We then compute:

  • m~t=Gtmt1,C~t=GtCt1Gt+Wt
  • ft=Ftm~t,Qt=FtC~tFt+Vt
  • Kt=C~tFtQt1
  • mt=m~t+Kt[ytft],Ct=C~tKtQtKt.

Note that the only inversion involves the m×m matrix Qt (there are no p×p inversions). So, in the case of a univariate time series, no matrix inversions are required.

Also note that it is easy to handle missing data within the Kalman filter by carrying out the predict step and skipping the update step. If the observation at time t is missing, simply compute m~t and C~t as usual, but then skip this update step, since there is no observation to condition on, returning m~t and C~t as mt and Ct.

7.2.2 Example

To keep things simple, we will just implement the Kalman filter for time-invariant G,W,F,V.

kFilter = function(G, W, F, V)
  function(mC, y) {
    m = G %*% mC$m
    C = (G %*% mC$C %*% t(G)) + W
    f = F %*% m
    Q = (F %*% C %*% t(F)) + V
    K = t(solve(Q, F %*% C))
    m = m + (K %*% (y - f))
    C = C - (K %*% Q %*% t(K))
    list(m=m, C=C)
  }

So, if we provide G,W,F,V to the kFilter function, we will get back a function for carrying out one step of the Kalman filter. As a simple illustration, suppose that we are interested in the long-term trend in the Southern Oscillation Index (SOI).

library(astsa)
tsplot(soi, col=4)

We suppose that a simple random walk model is appropriate for the (hidden) long term trend (so G=1, and note that the implied VAR(1) model is not stationary), and that the observations are simply noisy observations of that hidden trend (so F=1). To complete the model, we suppose that the monthly change in the long term trend has standard deviation 0.01, and that the noise variance associated with the observations has standard deviation 0.5. We can now create the filter and apply it, using the Reduce function and initialising with m0=0 and C0=100.

advance = kFilter(1, 0.01^2, 1, 0.5^2)
Reduce(advance, soi, list(m=0, C=100))
$m
            [,1]
[1,] -0.03453493

$C
           [,1]
[1,] 0.00495025

This gives us mn and Cn, which might be what we want if we are simply interested in forecasting the future (see next section). However, more likely we want to keep the full set of filtered states and plot them over the data.

fs = Reduce(advance, soi, list(m=0, C=100), acc=TRUE)
fsm = sapply(fs, function(s) s$m)
fsTs = ts(fsm[-1], start=start(soi), freq=frequency(soi))
tsplot(soi, col=4)
lines(fsTs, col=2, lwd=2)

We will look at more interesting DLM models in the next chapter.

7.3 Forecasting

Making forecasts from a DLM is straightforward, since it corresponds to the computation of predictions, and we have already examined this problem in the context of the predict step of a Kalman filter. Suppose that we have a time series of length n, and have run a Kalman filter over the whole data set, culminating in the computation of mn and Cn. We can now define our k-step ahead forecast distributions as (Xn+k|Y1:n=y1:n)N(m^n(k),Cn(k))(Yn+k|Y1:n=y1:n)N(f^n(k),Qn(k)), for moments m^n(k), Cn(k), f^n(k), Qn(k) to be determined. Starting from m^n(0)=mn and Cn(0)=Cn, we can recursively compute subsequent forecasts for the hidden states via m^n(k)=Gn+km^n(k1),Cn(k)=Gn+kCn(k1)Gn+k+Wn+k. Once we have a forecast for the hidden state, we can compute the moments of the forecast distribution via f^n(k)=Fn+km^n(k),Qn(k)=Fn+kCn(k)Fn+k+Vn+k.

7.4 Marginal likelihood

For many applications (including parameter estimation) it is useful to know the marginal likelihood of the data. We can factorise this as L(y1:n)=t=1nπ(yt|y1:(t1)), but as part of the Kalman filter we compute the component distributions as π(yt|y1:(t1))=N(yt;ft,Qt), so we can compute the marginal likelihood as part of the Kalman filter at little extra cost. In practice we compute the log-likelihood (y1:t)=t=1nlogN(yt;ft,Qt), We might use a function (such as mvtnorm::dmvnorm) to evaluate the log density of a multivariate normal, but if not, we can write it explicitly in the form (y1:t)=mn2log2π12t=1n[log|Qt|+(ytft)Qt1(ytft)]. Note that this is very straightforward in the case of a univariate time series (scalar Qt), but more generally, there is a good way to evaluate this using the Cholesky decomposition of Qt. The details are not important for this course.

7.4.1 Example

We can easily modify our Kalman filter function to compute the marginal likelihood, and apply it to our example problem as follows.

kFilterML = function(G, W, F, V)
  function(mCl, y) {
    m = G %*% mCl$m
    C = (G %*% mCl$C %*% t(G)) + W
    f = F %*% m
    Q = (F %*% C %*% t(F)) + V
    ll = mvtnorm::dmvnorm(y, f, Q, log=TRUE)
    K = t(solve(Q, F %*% C))
    m = m + (K %*% (y - f))
    C = C - (K %*% Q %*% t(K))
    list(m=m, C=C, ll=mCl$ll+ll)
  }

advance = kFilterML(1, 0.01^2, 1, 0.5^2)
Reduce(advance, soi, list(m=0, C=100, ll=0))
$m
            [,1]
[1,] -0.03453493

$C
           [,1]
[1,] 0.00495025

$ll
[1] -237.2907

7.5 Smoothing

For the smoothing problem, we want to calculate the marginal at each time given all of the data, Xt|Y1:n=y1:n,t=1,2,,n. We will use a backward recursion to achieve this, known in this context as the Rauch-Tung-Striebel (RTS) smoother. Again, since everything is linear and Gaussian, the marginals will be, so we have (Xt|Y1:n=y1:n)N(st,St), for st, St to be determined. We know from the forward filter that sn=mn and Sn=Cn, so we assume that we know st+1, St+1 for some t<n and that we want to know st, St.

Since Xt+1=Gt+1Xt+ωωt+1, we deduce that (XtXt+1|Y1:t=y1:t)N([mtm~t+1],[CtCtGt+1Gt+1CtC~t+1]). We can use normal conditioning formula to deduce that (Xt|Xt+1,Y1:t=y1:t)N(mt+CtGt+1C~t+11[Xt+1m~t+1],CtCtGt+1C~t+11Gt+1Ct). If we define the smoothing gain Lt=CtGt+1C~t+11, then this becomes (Xt|Xt+1,Y1:t=y1:t)N(mt+Lt[Xt+1m~t+1],CtLtC~t+1Lt). But now, since Y(t+1):n is conditionally independent of Xt given Xt+1, this distribution is also the distribution of (Xt|Xt+1,Y1:n=y1:n). So now we can compute (Xt|Y1:n=y1:n) by marginalising out Xt+1. First, st=E[Xt|Y1:n=y1:n]=E[E[Xt|Xt+1,Y1:n=y1:n]|Y1:n=y1:n]=E[mt+Lt[Xt+1m~t+1]|Y1:n=y1:n]=mt+Lt[st+1m~t+1] Now, St=Var[Xt|Y1:n=y1:n]=E{Var[Xt|Xt+1,Y1:n=y1:n]|Y1:n=y1:n}+Var{E[Xt|Xt+1,Y1:n=y1:n]|Y1:n=y1:n}=E{CtLtC~t+1Lt|Y1:n=y1:n}+Var{mt+Lt[Xt+1m~t+1]|Y1:n=y1:n}=CtLtC~t+1Lt+LtSt+1Lt=Ct+Lt[St+1C~t+1]Lt

7.5.1 The RTS smoother

To summarise, we receive as input st+1 and St+1, and compute Lt=CtGt+1C~t+11st=mt+Lt[st+1m~t+1]St=Ct+Lt[St+1C~t+1]Lt, re-using computations from the forward Kalman filter as appropriate. Note that the RTS smoother does involve the inversion of a p×p matrix.

7.6 Sampling

If we want to understand the joint distribution of the hidden states given the data, generating samples from it will be useful. So, we want to generate exact samples from (X1:n|Y1:n=y1:n). Just as for HMMs, it is convenient to do this backwards, starting from the final hidden state. We assume that we have run a Kalman filter forward, but now do sampling on the backward pass. This strategy is known in this context as forward filtering, backward sampling (FFBS). We know that the distribution of the final hidden state is (Xn|Y1:n=y1:n)N(mn,Cn), so we can sample from this to obtain a realisation of xn. So now assume that we are given xt+1 for some t<n, and want to simulate an appropriate xt. From the smoothing analysis, we know that (Xt|Xt+1=xt+1,Y1:n=y1:n)N(mt+Lt[xt+1m~t+1],CtLtC~t+1Lt). But since Xt is conditionally independent of all future Xs, s>t+1, given Xt+1, this is also the distribution of (Xt|X(t+1):n=x(t+1):n,Y1:n=y1:n) and so we just sample from this, and so on, in a backward pass.

7.7 Parameter estimation

Just as for HMMs, there are a wide variety of Bayesian and likelihood-based approaches to parameter estimation. Here, again, we will keep things simple and illustrate basic numerical optimisation of the marginal log-likelihood of the data.

7.7.1 Example

For our SOI running example, suppose that we are happy with the model structure, but uncertain about the variance parameters V and W. We can do maximum likelihood estimation for those two parameters by first creating a function to evaluate the likelihood for a given parameter combination.

mll = function(wv) {
  advance = kFilterML(1, wv[1], 1, wv[2])
  Reduce(advance, soi, list(m=0, C=100, ll=0))$ll
}

mll(c(0.01^2, 0.5^2))
[1] -237.2907

We can see that it reproduces the same log-likelihood as we got before when we evaluate it at our previous guess for V and W. But can we do better than this? Let’s see what optim can do.

optim(c(0.01^2, 0.5^2), mll, control=list(fnscale=-1))
$par
[1] 0.05696905 0.03029240

$value
[1] -144.0333

$counts
function gradient 
      91       NA 

$convergence
[1] 0

$message
NULL

So optim has found parameters with a much higher likelihood, corresponding to a much better fitting model. Note that these optimised parameters apportion the noise more equally between the hidden states and the observation process, and so lead to smoothing the hidden state much less. This may correspond to a better fit, but perhaps not to the fit that we really want. In the context of Bayesian inference, a strong prior on V and W could help to regularise the problem.

7.8 R package for DLMs

We have seen that it is quite simple to implement the Kalman filter and related computations such as the marginal log-likelihood of the data. However, a very naive implementation such as we have given may not be the most efficient, nor the most numerically stable. Many different approaches to implementing the Kalman filter have been developed over the years, which are mathematically equivalent to the basic Kalman filter recursions, but are more efficient and/or more numerically stable than a naive approach. The dlm R package is an implementation based on the singular value decomposition (SVD), which is significantly more numerically stable than a naive implementation. We will use this package for the study of more interesting DLMs in Chapter 8. More information about this package can be obtained with help(package="dlm") and vignette("dlm", package="dlm"). It is described more fully in Petris, Petrone, and Campagnoli (). We can fit our simple DLM to the SOI data as follows.

library(dlm)
mod = dlm(FF=1, GG=1, V=0.5^2, W=0.01^2, m0=0, C0=100)
fs = dlmFilter(soi, mod)
tsplot(soi, col=4, main="Filtered states")
lines(fs$m, col=2, lwd=2)

We can also compute the smoothed states.

ss = dlmSmooth(soi, mod)
tsplot(soi, col=4, main="Smoothed states")
lines(ss$s, col=2, lwd=2)

These are very smooth. We can also optimise the parameters, similar to the approach we have already seen. But note that the dlmMLE optimiser requires that the vector of parameters being optimised is unconstrained, so we use exp and log to map back and forth between a constrained and unconstrained space.

buildMod = function(lwv)
  dlm(FF=1, GG=1, V=exp(lwv[2]), W=exp(lwv[1]),
    m0=0, C0=100)
  
opt = dlmMLE(soi, parm=c(log(0.5^2), log(0.01^2)),
  build=buildMod)
opt
$par
[1] -2.865240 -3.496717

$value
[1] -272.2459

$counts
function gradient 
      35       35 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
exp(opt$par)
[1] 0.05696943 0.03029668

After mapping back to the constrained space, we see that we get the same optimised values as before (the log-likelihood is different, but that is just due to dropping of unimportant constant terms). We can check the smoothed states for this optimised model as follows.

mod = buildMod(opt$par)
ss = dlmSmooth(soi, mod)
tsplot(soi, col=4, main="Smoothed states")
lines(ss$s, col=2, lwd=1.5)

This confirms that the smoothed states for the optimised model are not very smooth. This suggests that our simple random walk model for the data is perhaps not very good.