N— title: “Gibbs Sampling” —

The DLM is simple to specify and the distribution of the latent-state can be learned on-line using the exact filtering recursions given in the Kalman Filter. It remains to determine the values of the static parameters, \(V\) and \(W\). A Gibbs sampler can be used for this purpose. Consider a general DLM given by the following equation:

\[\begin{aligned} y_t &= F_t x_t + v_t, &v_t &\sim \mathcal{N}(0, V), \\ x_t &= G_t x_{t-1} + v_t, &w_t &\sim \mathcal{N}(0, W), \\ x_0 &\sim \mathcal{N}(m_0, C_0). \end{aligned}\]

The unknown parameters of the model are the system noise matrix, \(W\) and the observation noise matrix, \(V\). The observation noise matrix is assumed to be diagonal, meaning the measurement noise of each process is considered to be independent. Assume that the parameters of the initial state (\(m_0\) and \(C_0\)) are known, then the unknown parameters can be written as \(\theta = (V, W)\). The joint distribution of all the random variables in the DLM can be written as:

\[p(\textbf{X}_{0:T}, \textbf{Y}_{1:T}, \theta) = p(\theta) p(\textbf{x}_0) \prod_{t=1}^T p(\textbf{y}_t| \textbf{x}_t, \theta) p(\textbf{x}_t | \textbf{x}_{t-1}, \theta)\]

This factorisation of the random variables makes it clear to see how a Gibbs sampling approach could be used.

Observation Noise Matrix: d-Inverse Gamma

First consider the observation variance matrix, this matrix is diagonal and hence we only need to learn the variances. The following steps are simplified by considering only one time series in the observation vector. The prior distribution of the observation variance, \(V\), is the Inverse Gamma distribution:

\[p(V) = \textrm{InverseGamma}(\alpha, \beta)\]

The likelihood of \(y_t\) is Gaussian, with mean \(F^T \textbf{x}_t\) and variance \(V\). The Inverse Gamma distribution is conjugate to the Normal distribution with known mean and unknown variance. The full conditional distribution of the observation variance is:

\[\begin{align*} p(V | y_{1:T}, \textbf{x}_{0:T}) &\propto p(x_0) p(V) \prod_{t=1}^Tp(y_t | V, x_t) p(x_t|x_{t-1}) \\ &= V^{-\alpha-1}\exp\left( -\frac{\beta}{V} \right)(2\pi V)^{-T/2} \exp \left\{ -V^{-1} \sum_{t=1}^T(y_t - F_t \textbf{x}_t)^2 \right\} \\ &= V^{-(\alpha + T/2) - 1} \exp \left\{ -\frac{1}{V}\left(\beta + \frac{1}{2}\sum_{t=1}^T(y_t - F_t \textbf{x}_t)^2\right) \right\} \end{align*}\]

We recognise the full conditional distribution of the observation variance as the Inverse Gamma distribution:

\[p(V | y_{1:T}, \textbf{x}_{1:T}) = \textrm{InverseGamma}\left(\alpha + \frac{T}{2}, \beta + \frac{1}{2}\sum_{i=1}^T(y_t - F_t\textbf{x}_t)^2\right).\]

System Noise Matrix: Inverse Wishart Prior

For the system noise matrix, the Inverse Wishart distribution can be used as the conjugate prior for the Multivariate Normal likelihood for \(\textbf{x}_t\) with unknown covariance, \(W\) and known mean \(G_t\textbf{x}_{t-1}\). The prior on W is written as:

\[p(W) \sim \mathcal{W}^{-1}(\mathbf{psi}, \nu)\]

The PDF of the inverse wishart distribution is given as follows:

\[p(W) = \frac{\left|{\mathbf\Psi}\right|^{\frac{\nu}{2}}}{2^{\frac{\nu p}{2}}\Gamma_p(\frac{\nu}{2})} \left|W\right|^{-\frac{\nu+p+1}{2}}\textrm{exp}\left\{-\frac{1}{2}\operatorname{tr}({\mathbf\Psi}W^{-1})\right\}.\] Where \(p\) is the dimension of the matrix \(W\), in this \(p = 2\). The posterior distribution for the parameter \(W\) given the values of the state, \(x_{1:T}\) can be written as:

\[\begin{align*}p(W|x_{1:T}) &\propto p(x_0)p(W)\prod_{i=1}^T p(x_i|x_{i-1}, W) \\ &= \left|W\right|^{-\frac{\nu+p+1}{2}}\textrm{exp}\left\{-\frac{1}{2}\operatorname{tr}({\mathbf\Psi}W^{-1})\right\} \\ &\times\left|{W}\right|^{-\frac{T}{2}}\textrm{exp}\left\{ -\frac{1}{2}\sum_{i=1}^T(x_i - Gx_{i-1})^TW^{-1}(x_i-Gx_{i-1})\right\} \\ &= \left|W\right|^{\frac{\nu + T + 2 + 1}{2}}\textrm{exp}\left\{ -\frac{1}{2}\operatorname{tr}(\mathbf{\Psi}W^{-1}) - \operatorname{tr}\left(\frac{1}{2}\sum_{i=1}^T(x_i - Gx_{i-1})(x_i-Gx_{i-1})^TW^{-1}\right)\right\} \end{align*}\]

The final line uses standard results from Multivariate statistics and the following rule for matrix traces:

A Matrix trace is invariant under cyclic permutations, \(\operatorname{tr}(ABC) = \operatorname{tr}(CAB) = \operatorname{tr}(BCA)\)

This can be futher simplified to:

\[p(W|x_{1:T}) \propto \left|W\right|^{\frac{\nu + T + 2 + 1}{2}}\textrm{exp}\left\{ -\frac{1}{2}\operatorname{tr}((\mathbf{\Psi} + \sum_{i=1}^T(x_i - Gx_{i-1})(x_i-Gx_{i-1})^T) W^{-1}) \right\}.\]

This final simplification uses another couple of facts about the matrix trace:

The sum of the trace is equal to the trace of the sum, \(\operatorname{tr}(A + B) = \operatorname{tr}(A) + \operatorname{tr}(B)\)
\(\operatorname{tr}(cA) = c\operatorname{tr}(A)\) for a constant “c”

We recognise the posterior distribution for the system noise covariance as the Inverse Wishart distribution:

\[p(W|x_{1:T}) = \mathcal{W}^{-1}(\nu + T, \mathbf{\Psi} + \sum_{i=1}^T(x_i - Gx_{i-1})(x_i-Gx_{i-1})^T)\].

Now in order to perform gibbs sampling, we alternate between drawing values of \(V, W\) from these posterior distributions and values of the state \(x_0,\dots,x_T\) using forward filtering backward sampling.