Probabilistic Programming in Scala

Jonny Law

2018-12-15 Sat 00:00

Created: 2018-12-18 Tue 08:26

1 Introduction to Bayesian Inference

1.1 Bayesian Statistics

  • Bayesian statistics is concerned with expressing uncertainty using probability
  • Provides a framework for subjective beliefs and updating - reflecting how people reason in real life

1.2 Bayes Theorem

  • Bayes theorem allows us to determine the probability of a hypothesis being true by collecting data related to the hypothesis

\(p(H|y) = \frac{P(y|H)p(H)}{\int p(y)}\)

1.3 Finding the integral

  • The denominator of Bayes theorem is often intractable
  • Sampling based inference methods can be used to approximate the posterior distribution

2 Probabilistic Programming

2.1 What is Probabilistic Programming?

  • Efficiently expressing Bayesian statistical models and performing inference
  • Provide a consistent, flexible modelling language for specifying prior beliefs and likelihooods
  • Abstract away the inference algorithms from the user

2.2 Examples of PPLs

2.3 Why?

  • Small data
  • Transparent, interpretable models
  • Incorporate expert judgment required in many areas of business

3 Statistical Computation

3.1 Conjugacy

  • The prior distribution is the same distribution as the posterior distribution and can be derived analytically
  • Only applicable for some models

3.2 Gibbs Sampling

  • Markov chain Monte Carlo (MCMC) method which exploits conditional probability to derive conditionally conjugate distributions
  • Sample from each conditional conjugate distribution in turn to create a Markov chain representing draws from the full posterior distribution
  • Restricts the form of the prior distribution to a conjugate family

3.3 Metropolis-Hastings

  • General MCMC method with no restriction on prior distributions
  • New parameters are proposed from some proposal distribution, \(\psi^\star \sim p(\psi^\star|\psi)\) and accepted according to the Metropolis-Hastings ratio \(A = \frac{p(\psi^\star)\pi(Y|\psi^\star)q(\psi|\psi^\star)}{p(\psi)\pi(Y|\psi)q(\psi^\star|\psi)}\)
  • \(p(\psi)\) is the prior distribution, \(\pi(Y|\psi)\) is the likelihood, \(q(\psi|\psi^\star)\) is the probability of moving from \(\psi^\star\) to \(\psi\)

3.3.1 Metropolis-Hastings

def mhStep[P](posterior: P => Double, 
              proposal: P => Dist[P]) = { p: P =>
  for {
    ps <- proposal(p)
    a = posterior(ps) - proposal(p).logPdf(ps) - 
      posterior(p) + proposal(ps).logPdf(p)
    u <- Dist.uniform(0, 1)
    next = if (log(u) < a) ps else p
  } yield next
}

3.3.2 Metropolis-Hastings

  • Metropolis-Hastings is simple to implement and guaranteed to converge if it's left long enough - but no one wants to wait forever
  • The optimal acceptance ratio is 0.234 - most moves are rejected
  • A random-walk proposal centered at the previous parameter is often a default choice \(p(\psi^\star|\psi) \sim \mathcal{N}(\psi | \Sigma)\)
  • The cost of an independent sample from the stationary distribution is \(\mathcal{O}(d^2)\) for a \(d\) dimensional parameter space

3.4 Hamiltonian Monte Carlo

  • Can we use gradient information from the un-normalised log posterior?
  • Improved proposal based on Hamilton's Equations:

    \begin{align} \frac{\mathrm{d}p}{\mathrm{d}t} &= -\frac{\partial \mathcal{H}}{\partial q}, \\ \frac{\mathrm{d}q}{\mathrm{d}t} &= +\frac{\partial\mathcal{H}}{\partial p} \end{align}
  • \(\boldsymbol{p}\) is the momentum, equal to \(m\dot{\boldsymbol{q}}\)
  • \(\boldsymbol{q}\) is the particle position

3.4.1 Hamiltonian Monte Carlo

  • The static parameters correspond to the position in Hamilton's equations, the momentum is an auxiliary parameter
  • The joint density of the parameters and momentum can be written as: \(p(\psi, \phi) \propto \exp \left\{ \log p(\psi|y) - \frac{1}{2}\phi^T\phi \right\}\)
  • A special discretisation of Hamilton's equations is used which exactly conserves energy called a leapfrog step

3.4.2 The Leapfrog step

\begin{align*} \phi_{t+\varepsilon/2} &= \phi_{t-1} + \frac{\varepsilon}{2} \nabla_\psi\log p(y|\psi_{t-1}), \\ \psi_{t+\varepsilon} &= \psi_{t-1} + \varepsilon \phi_{t+\varepsilon/2}, \\ \phi_{t+\varepsilon} &= \phi_{t+\varepsilon/2} + \frac{\varepsilon}{2} \nabla\log p(y|\psi_{t+\varepsilon}). \end{align*}

3.4.3 Hamiltonian Monte Carlo

  • The leapfrog has a tuning parameter, the step size \(\varepsilon\)
  • Only continuous distributions can be used since the un-normalised log-posterior must be differentiable
  • Non conjugate prior distributions can be used, like Metropolis-Hastings
  • HMC is more computationally efficient, requiring \(O(d^\frac{5}{4})\) for an independent sample from the posterior distribution of a \(d\) dimensional parameter space, the optimal acceptance rate is 0.65
  • Calculating derivatives is tedious and error-prone

3.4.4 HMC algorithm in Scala

def step(psi: DenseVector[Double]): Rand[DenseVector[Double]] = {
  for {
    phi <- priorPhi
    (propPsi, propPhi) = leapfrogs(eps, gradient, l, psi, phi)
    a = logAcceptance(propPsi, propPhi, psi, phi, ll, priorPhi)
    u <- Uniform(0, 1)
    next = if (log(u) < a) {
      propPsi
    } else {
      psi
    }
  } yield next
}

3.4.5 The Leapfrog step

def leapfrog(
  eps: Double,
  gradient: DenseVector[Double] => DenseVector[Double])(
  psi: DenseVector[Double],
  phi: DenseVector[Double]) = {
  val p1 = phi + eps * 0.5 * gradient(psi)
  val t1 = psi + eps * p1
  val p2 = p1 + eps * 0.5 * gradient(t1)
  (t1, p2)
}

3.4.6 Multiple leapfrog steps

def leapfrogs(
  eps: Double,
  gradient: DenseVector[Double] => DenseVector[Double],
  l: Int,
  psi: DenseVector[Double],
  phi: DenseVector[Double]) = {
    if (l == 0) {
      (theta, phi)
    } else {
      val (t, p) = leapfrog(eps, gradient, theta, phi)
      leapfrogs(eps, gradient, l-1, t, p)
    }
  }

3.5 Tuning Hamiltonian Monte Carlo

  • The step size \(\varepsilon\) and the number of leapfrog steps \(l\) are tuning parameters which can be determined with pilot runs aiming for the optimal acceptance rate 0.65
  • The Dual averaging and the NUTS algorithm can be used to determine an appropriate step size number of steps
  • eHMC is another algorithm for automatically selecting the step size

4 Functional Programming

4.1 Good things

  • Pure Functions
  • Function Composition
  • Immutable Data Structures
  • Static Types with type inference
  • Predictable, correct programs

4.2 Higher Order Functions

  • Let's apply a function to a list
val xs = Array(1,2,3,4,5)
var i = 0
while (i < xs.size) {
  xs(i) = xs(i) + 1
  i += 1
}
xs
xs: Array[Int] = Array(1, 2, 3, 4, 5)
i: Int = 0
res16: Array[Int] = Array(2, 3, 4, 5, 6)

4.2.1 Map

  • Maps, create a copy of the collection with the updated values
xs map (_ + 1)
res18: Array[Int] = Array(3, 4, 5, 6, 7)
def map[A, B](fa: List[A])(f: A => B): List[B]

4.2.2 Reduction

  • Folds, apply a binary operation to a collection using the previous result
xs.foldLeft(0)(_ + _)
res20: Int = 20
def foldLeft[A, B](fa: List[A])(z: B)(f: (B, A) => B): B

4.2.3 flatMap

  • Apply a function which returns a collection, to a collection then flatten it (sometimes called bind)
xs flatMap (x => List(x, x + 1, x + 2))
res22: Array[Int] = Array(2, 3, 4, 3, 4, 5, 4, 5, 6, 5, 6, 7, 6, 7, 8)
def flatMap[A, B](fa: List[A])(f: A => List[B]): List[B])

4.3 Polymorphism

  • Sometimes static types are associated with verbosity
  • Type inference and ad-hoc Polymorphism can help
  • This function will add together all elements in a list which have a numeric type
def sum[A: Numeric](xs: List[A]): A = 
  xs.foldLeft(_ + _)

4.4 Typeclasses

  • A typeclass is an abstract implementation of a class
trait Numeric[A] {
 def compare(x: T, y: T): Int
 def fromInt(x: Int): T
 def minus(x: T, y: T): T
 def negate(x: T): T
 def plus(x: T, y: T): T
 def times(x: T, y: T): T
 def toDouble(x: T): Double
 def toFloat(x: T): Float
 def toInt(x: T): Int
 def toLong(x: T): Long 
}

4.5 Typeclasses

  • Concrete members of a typeclass can be provided using implicit definitions
implicit def numericInt = new Numeric[Int] { ... }
  • Type safety is retained and we don't have to write functions twice

5 Category Theory

5.1 What is a Category?

  • A category \(\mathcal{C}\) consists of objects \(\textrm{obj}(\mathcal{C})\) and arrows, or morphisms between categories, \(\textrm{hom}(\mathcal{C})\)
  • Morphisms compose, for \(f: X \rightarrow Y\) and \(g: Y \rightarrow Z\), then \(h: X \rightarrow Z\) is in \(\textrm{hom}(\mathcal{C})\) given by \(g \circ f\)
  • Objects must have identity morphisms, written \(\textrm{id}_X: X \rightarrow X\)

5.2 Functors

  • A functor is a mapping between categories which preserves structure
  • \(\mathcal{C}\) and \(\mathcal{D}\) are categories, then a functor \(F:\mathcal{C} \rightarrow \mathcal{D}\):
    • Associates \(X \in \mathcal{C}\) to an object \(F(X) \in \mathcal{D}\)
    • And each morphism, \(f:X \rightarrow Y\) in \(\mathcal{C}\) to a morphism in \(\mathcal{D}\), \(F(f): F(X) \rightarrow F(Y)\).
  • Satisfying
    • \(F(\textrm{id}_X) = \textrm{id}_{F(X)}\) for each \(X \in \mathcal{C}\)
    • \(F(g \circ f) = F(g) \circ F(f)\) for all morphisms, \(f, g \in \mathcal{C}\)

5.3 Natural Transformation

  • A functor is a morphism between two categories, a natural transformation is a morphism between functors
  • \(X \in \mathcal{C}\), \(F, G: \mathcal{C} \rightarrow \mathcal{D}\) are functors, then a natural transformation \(\alpha: F(X) \Rightarrow G(X)\) is a family of morphisms such that:
    • \(\forall X \in \mathcal{C}\) then \(\alpha_X: F(X) \rightarrow G(X)\) is a morphism in \(\mathcal{D}\)
    • for each \(f \in \mathcal{C}\) then \(\alpha_Y \circ F(f) = G(f) \circ \alpha_X\).

5.3.1 Natural Transformation

natural_transformation.png

Figure 1: Commutative diagram for the second natural transformation

5.4 Monads

  • A monad is an endofunctor, \(T: \mathcal{C} \rightarrow \mathcal{C}\) with two natural transformations
    • \(\eta: \textrm{Id}_{\mathcal{c}} \rightarrow T\)
    • \(\mu: T \circ T \rightarrow T\)
  • Such that \(\mu \circ T \mu = \mu \circ \mu T\)
  • and \(\mu \circ T\eta = \mu \circ \eta T = \textrm{Id}_T\)

5.4.1 Monads

Table 1: Commutative diagrams for the monad laws
monad_laws.png monad_law_2.png

5.5 Why?

  • Types and functions form a category, called Hask, every functor is hence an endofunctor, \(F: \texttt{Hask} \rightarrow \texttt{Hask}\)
  • Principled abstractions for functional programming
  • Testing mathematical laws instead of individual functions
  • Verifying the correctness of programs

5.5.1 Distribution is a Monad

  • A monad provides a context for an effect
  • Can preserve immutability be encapsulating random draws in a monad Rand[A] representing a distribution over the type A
  • Unit is the dirac distribution
  • flatMap represents a joint or marginal distribution
def coinFlip(n: Int): Rand[Int] = Beta(3, 3).flatMap(p => Binomial(n, p))

5.5.2 Syntactic Sugar

  • For comprehension provides is syntactic sugar for chains of flatMap and map
def coinFlip(n: Int): Rand[Int] = for {
  p <- Beta(3, 3)
} yield Binomial(n, p)

6 Automatic Differentiation

6.1 What?

  • Calculate the exact derivative of a function at a point
  • Not symbolic differentiation
  • Not numeric differentiation

6.2 Forward Mode AD

  • Consider the function \(f(x) = x^2 + 2x + 5\) with derivative \(f^\prime(x) = 2x + 2\)
  • We wish to calculate the derivative of a \(f\) at a specific value of \(x\), suppose \(x = 5, f(5) = 40, f^\prime(5) = 12\)

6.3 Dual Numbers

  • To perform forward mode AD specify the dual number to \(x = 5\), \(x^\prime = 5 + \varepsilon\) then calculate \(f(x^\prime)\):

    \begin{align*} f(5 + \varepsilon) &= (5 + \varepsilon)^2 + 2(5 + \varepsilon) + 5 \\ &= 25 + 10\varepsilon + \varepsilon^2 + 10 + 2\varepsilon + 5 \\ &= 40 + 12\varepsilon \end{align*}
  • Number of computations depends on the dimension of the input space, ie. the dimension of the parameters

6.4 Implementation

  • Create a new class representing a dual number:
case class Dual(real: Double, eps: Double)
  • Define an instance of the Numeric[Dual] typeclass

6.4.1 Dual Operators

def plus(x: Dual, y: Dual) =
  Dual(x.real + y.real, x.eps + y.eps)
def minus(x: Dual, y: Dual): Dual =
  Dual(y.real - x.real, y.eps - x.eps)
def times(x: Dual, y: Dual) =
  Dual(x.real * y.real, x.eps * y.real + y.eps * x.real)
def div(x: Dual, y: Dual) =
  Dual(x.real / y.real, 
  (x.eps * y.real - x.real * y.eps) / (y.real * y.real))

6.4.2 Dual Usage

def logPos(ys: Vector[Double])(mu: Dual) = 
  (-0.5 * 0.125 * mu * mu) - 0.125 * ys.
    map(_ - mu).
    map(x => x * x).
    reduce(_ + _)

6.4.3 Extension to Gradients

  • Define a new Dual class
case class DualV(real: Double, dual: Vector[Double])
  • The dual argument can be confused between variables, each variable in a computation must have the same dimension dual

6.5 Reverse Mode AD

  • Reverse mode AD scales in the dimension of the output space
  • Faster than forward mode for \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) when \(n > m\)

7 Putting it all together

7.1 Building our own PPL

  • Embedded DSL for model building using Monads
  • Fast generic inference schemes for sampling from posterior distributions
  • Automatic differentiation for gradient based samplers

7.2 Linear Regression

\begin{equation} y_i = \beta^T x_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma). \end{equation}

7.2.1 Linear Regression

val model = for {
  b0 <- Normal(0.0, 5.0).param
  b1 <- Normal(0.0, 5.0).param
  b2 <- Normal(0.0, 5.0).param
  sigma <- Gamma(2.0, 2.0).param
  _ <- Predictor.fromDoubleVector { xs =>
    {
      val mean = b0 + b1 * xs.head + b2 * xs(1)
      Normal(mean, sigma)
    }
  }
  .fit(x zip y)
} yield Map("b0" -> b0, "b1" -> b1, "b2" -> b2, "sigma" -> sigma) 

model.sample(EHMC(10), 5000, 10000 * 20, 20)

7.2.2 Linear Regression

ehmc_lm.png

Figure 2: Diagnostic plots for simulated data from the linear regression model

7.3 Stochastic Volatility

  • A heteroskedastic time series with latent log-volatility \(\alpha_t\)
\begin{align} y_t &= \sigma_t\exp\left(\frac{\alpha_t}{2}\right), &\sigma_t &\sim \mathcal{N}(0, 1), \\ \textrm{d}\alpha_t &= \phi(\alpha_t - \mu)\textrm{d}t + \sigma \textrm{d}W_t. \end{align}

7.3.1 Stochastic Volatility

sv_ou_sims.png

Figure 3: Simulation from a stochastic volatility model with Ornstein-Uhlenbeck latent-state

7.3.2 Stochastic Volatility

val prior = for {
  phi1 <- Beta(5.0, 2.0).param
  phi = 2 * phi1 - 1
  mu <- Normal(0.0, 2.0).param
  sigma <- LogNormal(2.0, 2.0).param
  x0 <- Normal(mu, sigma * sigma / (1 - phi * phi)).param
  t0 = 0.0
} yield (t0, phi, mu, sigma, x0)

def ouStep(phi: Real, mu: Real, sigma: Real, x0: Real, dt: Double) = {
  val mean = mu + (-1.0 * phi * dt).exp * (x0 - mu)
  val variance = sigma.pow(2) * (1 - (-2 * phi * dt).exp) / (2*phi)
  Normal(mean, variance.pow(0.5))
}

7.3.3 Stochastic Volatility

def step(st: RandomVariable[(Double, Real, Real, Real, Real)],
         y: (Double, Double)) = for {
    (t, phi, mu, sigma, x0) <- st
    dt = y._1 - t
    x1 <- ouStep(phi, mu, sigma, x0, dt).param
    _ <- Normal(0.0, (x1 * 0.5).exp).fit(y._2)
  } yield (t + dt, phi, mu, sigma, x1)


val fullModel = ys.foldLeft(prior)(step)

7.4 Mixture Model

  • Data is assumed to come from one of \(k\) distributions
\begin{align*} y_i &\sim \mathcal{N}(\mu_k, \sigma), \\ k &\sim \mathcal{F}(\theta). \end{align*}

7.4.1 Mixture Model

mixture_model.png

Figure 4: 500 Simulations from a mixture model with \(\theta = \{0.3, 0.2, 0.5\}\) and \(\mu = \{-2.0, 1.0, 3.0\}\), \(\sigma = 0.5\)

7.4.2 Mixture Model

val model = for {
  theta1 <- Beta(2.0, 5.0).param
  theta2 <- Beta(2.0, 5.0).param
  alphas = Seq(theta1.log, theta2.log, Real.zero)
  thetas = softmax(alphas)
  mu1 <- Normal(0.0, 5.0).param
  mu2 <- Normal(0.0, 5.0).param
  mu3 <- Normal(0.0, 5.0).param
  mus = Seq(mu1, mu2, mu3)
  sigma <- Gamma(2.0, 10.0).param
  components: Map[Continuous, Real] = mus.zip(thetas).map {
    case (m: Real, t: Real) => (Normal(m, sigma) -> t) }.toMap
  _ <- Mixture(components).fit(ys)
} yield Map("theta1" -> thetas.head, "theta2" -> thetas(1),
              "theta3" -> thetas(2),
                "mu1" -> mu1, "mu2" -> mu2, "mu3" -> mu3, "sigma" -> sigma)

7.4.3 Mixture Model

mixture_model_posterior.png

Figure 5: Posterior Densities for the parameters in the mixture model

8 Conclusion

  • Probabilistic programming languages can be used to quickly prototype new models without implementing and testing new inference schemes
  • Embedded PPLs can be deployed in production code without rewriting - reducing bugs and enabling faster inference in industrial settings