Probabilistic Programming in Scala

Jonny Law

2018-12-15 Sat 00:00

Created: 2018-12-18 Tue 08:26

1. Introduction to Bayesian Inference
2. Probabilistic Programming
3. Statistical Computation
4. Functional Programming
5. Category Theory
6. Automatic Differentiation
7. Putting it all together
8. Conclusion

1 Introduction to Bayesian Inference

1.1 Bayesian Statistics

Bayesian statistics is concerned with expressing uncertainty using probability
Provides a framework for subjective beliefs and updating - reflecting how people reason in real life

1.2 Bayes Theorem

Bayes theorem allows us to determine the probability of a hypothesis being true by collecting data related to the hypothesis

\(p(H|y) = \frac{P(y|H)p(H)}{\int p(y)}\)

1.3 Finding the integral

The denominator of Bayes theorem is often intractable
Sampling based inference methods can be used to approximate the posterior distribution

2 Probabilistic Programming

2.1 What is Probabilistic Programming?

Efficiently expressing Bayesian statistical models and performing inference
Provide a consistent, flexible modelling language for specifying prior beliefs and likelihooods
Abstract away the inference algorithms from the user

2.2 Examples of PPLs

Stan http://mc-stan.org/
BUGS https://www.mrc-bsu.cam.ac.uk/software/bugs/
Jags http://mcmc-jags.sourceforge.net/
PyMc https://pymc-devs.github.io/pymc/
Pyro (Uber) https://github.com/uber/pyro
TensorFlow Probability (Google) https://www.tensorflow.org/probability/
Rainier (Stripe) https://github.com/stripe/rainier/

2.3 Why?

Small data
Transparent, interpretable models
Incorporate expert judgment required in many areas of business

3 Statistical Computation

3.1 Conjugacy

The prior distribution is the same distribution as the posterior distribution and can be derived analytically
Only applicable for some models

3.2 Gibbs Sampling

Markov chain Monte Carlo (MCMC) method which exploits conditional probability to derive conditionally conjugate distributions
Sample from each conditional conjugate distribution in turn to create a Markov chain representing draws from the full posterior distribution
Restricts the form of the prior distribution to a conjugate family

3.3 Metropolis-Hastings

General MCMC method with no restriction on prior distributions
New parameters are proposed from some proposal distribution, \(\psi^\star \sim p(\psi^\star|\psi)\) and accepted according to the Metropolis-Hastings ratio \(A = \frac{p(\psi^\star)\pi(Y|\psi^\star)q(\psi|\psi^\star)}{p(\psi)\pi(Y|\psi)q(\psi^\star|\psi)}\)
\(p(\psi)\) is the prior distribution, \(\pi(Y|\psi)\) is the likelihood, \(q(\psi|\psi^\star)\) is the probability of moving from \(\psi^\star\) to \(\psi\)

3.3.1 Metropolis-Hastings

def mhStep[P](posterior: P => Double, 
              proposal: P => Dist[P]) = { p: P =>
  for {
    ps <- proposal(p)
    a = posterior(ps) - proposal(p).logPdf(ps) - 
      posterior(p) + proposal(ps).logPdf(p)
    u <- Dist.uniform(0, 1)
    next = if (log(u) < a) ps else p
  } yield next
}

3.3.2 Metropolis-Hastings

Metropolis-Hastings is simple to implement and guaranteed to converge if it's left long enough - but no one wants to wait forever
The optimal acceptance ratio is 0.234 - most moves are rejected
A random-walk proposal centered at the previous parameter is often a default choice \(p(\psi^\star|\psi) \sim \mathcal{N}(\psi | \Sigma)\)
The cost of an independent sample from the stationary distribution is \(\mathcal{O}(d^2)\) for a \(d\) dimensional parameter space

3.4 Hamiltonian Monte Carlo

Can we use gradient information from the un-normalised log posterior?
Improved proposal based on Hamilton's Equations:

\begin{align} \frac{\mathrm{d}p}{\mathrm{d}t} &= -\frac{\partial \mathcal{H}}{\partial q}, \\ \frac{\mathrm{d}q}{\mathrm{d}t} &= +\frac{\partial\mathcal{H}}{\partial p} \end{align}
\(\boldsymbol{p}\) is the momentum, equal to \(m\dot{\boldsymbol{q}}\)
\(\boldsymbol{q}\) is the particle position

3.4.1 Hamiltonian Monte Carlo

The static parameters correspond to the position in Hamilton's equations, the momentum is an auxiliary parameter
The joint density of the parameters and momentum can be written as: \(p(\psi, \phi) \propto \exp \left\{ \log p(\psi|y) - \frac{1}{2}\phi^T\phi \right\}\)
A special discretisation of Hamilton's equations is used which exactly conserves energy called a leapfrog step

3.4.2 The Leapfrog step

\begin{align*} \phi_{t+\varepsilon/2} &= \phi_{t-1} + \frac{\varepsilon}{2} \nabla_\psi\log p(y|\psi_{t-1}), \\ \psi_{t+\varepsilon} &= \psi_{t-1} + \varepsilon \phi_{t+\varepsilon/2}, \\ \phi_{t+\varepsilon} &= \phi_{t+\varepsilon/2} + \frac{\varepsilon}{2} \nabla\log p(y|\psi_{t+\varepsilon}). \end{align*}

3.4.3 Hamiltonian Monte Carlo

The leapfrog has a tuning parameter, the step size \(\varepsilon\)
Only continuous distributions can be used since the un-normalised log-posterior must be differentiable
Non conjugate prior distributions can be used, like Metropolis-Hastings
HMC is more computationally efficient, requiring \(O(d^\frac{5}{4})\) for an independent sample from the posterior distribution of a \(d\) dimensional parameter space, the optimal acceptance rate is 0.65
Calculating derivatives is tedious and error-prone

3.4.4 HMC algorithm in Scala

def step(psi: DenseVector[Double]): Rand[DenseVector[Double]] = {
  for {
    phi <- priorPhi
    (propPsi, propPhi) = leapfrogs(eps, gradient, l, psi, phi)
    a = logAcceptance(propPsi, propPhi, psi, phi, ll, priorPhi)
    u <- Uniform(0, 1)
    next = if (log(u) < a) {
      propPsi
    } else {
      psi
    }
  } yield next
}

3.4.5 The Leapfrog step

def leapfrog(
  eps: Double,
  gradient: DenseVector[Double] => DenseVector[Double])(
  psi: DenseVector[Double],
  phi: DenseVector[Double]) = {
  val p1 = phi + eps * 0.5 * gradient(psi)
  val t1 = psi + eps * p1
  val p2 = p1 + eps * 0.5 * gradient(t1)
  (t1, p2)
}

3.4.6 Multiple leapfrog steps

def leapfrogs(
  eps: Double,
  gradient: DenseVector[Double] => DenseVector[Double],
  l: Int,
  psi: DenseVector[Double],
  phi: DenseVector[Double]) = {
    if (l == 0) {
      (theta, phi)
    } else {
      val (t, p) = leapfrog(eps, gradient, theta, phi)
      leapfrogs(eps, gradient, l-1, t, p)
    }
  }

3.5 Tuning Hamiltonian Monte Carlo

The step size \(\varepsilon\) and the number of leapfrog steps \(l\) are tuning parameters which can be determined with pilot runs aiming for the optimal acceptance rate 0.65
The Dual averaging and the NUTS algorithm can be used to determine an appropriate step size number of steps
eHMC is another algorithm for automatically selecting the step size

4 Functional Programming

4.1 Good things

Pure Functions
Function Composition
Immutable Data Structures
Static Types with type inference
Predictable, correct programs

4.2 Higher Order Functions

Let's apply a function to a list

val xs = Array(1,2,3,4,5)
var i = 0
while (i < xs.size) {
  xs(i) = xs(i) + 1
  i += 1
}
xs

xs: Array[Int] = Array(1, 2, 3, 4, 5)
i: Int = 0
res16: Array[Int] = Array(2, 3, 4, 5, 6)

4.2.1 Map

Maps, create a copy of the collection with the updated values

xs map (_ + 1)

res18: Array[Int] = Array(3, 4, 5, 6, 7)

def map[A, B](fa: List[A])(f: A => B): List[B]

4.2.2 Reduction

Folds, apply a binary operation to a collection using the previous result

xs.foldLeft(0)(_ + _)

res20: Int = 20

def foldLeft[A, B](fa: List[A])(z: B)(f: (B, A) => B): B

4.2.3 flatMap

Apply a function which returns a collection, to a collection then flatten it (sometimes called bind)

xs flatMap (x => List(x, x + 1, x + 2))

res22: Array[Int] = Array(2, 3, 4, 3, 4, 5, 4, 5, 6, 5, 6, 7, 6, 7, 8)

def flatMap[A, B](fa: List[A])(f: A => List[B]): List[B])

4.3 Polymorphism

Sometimes static types are associated with verbosity
Type inference and ad-hoc Polymorphism can help
This function will add together all elements in a list which have a numeric type

def sum[A: Numeric](xs: List[A]): A = 
  xs.foldLeft(_ + _)

4.4 Typeclasses

A typeclass is an abstract implementation of a class

trait Numeric[A] {
 def compare(x: T, y: T): Int
 def fromInt(x: Int): T
 def minus(x: T, y: T): T
 def negate(x: T): T
 def plus(x: T, y: T): T
 def times(x: T, y: T): T
 def toDouble(x: T): Double
 def toFloat(x: T): Float
 def toInt(x: T): Int
 def toLong(x: T): Long 
}

4.5 Typeclasses

Concrete members of a typeclass can be provided using implicit definitions

implicit def numericInt = new Numeric[Int] { ... }

Type safety is retained and we don't have to write functions twice

5 Category Theory

5.1 What is a Category?

A category \(\mathcal{C}\) consists of objects \(\textrm{obj}(\mathcal{C})\) and arrows, or morphisms between categories, \(\textrm{hom}(\mathcal{C})\)
Morphisms compose, for \(f: X \rightarrow Y\) and \(g: Y \rightarrow Z\), then \(h: X \rightarrow Z\) is in \(\textrm{hom}(\mathcal{C})\) given by \(g \circ f\)
Objects must have identity morphisms, written \(\textrm{id}_X: X \rightarrow X\)

5.2 Functors

A functor is a mapping between categories which preserves structure
\(\mathcal{C}\) and \(\mathcal{D}\) are categories, then a functor \(F:\mathcal{C} \rightarrow \mathcal{D}\):
- Associates \(X \in \mathcal{C}\) to an object \(F(X) \in \mathcal{D}\)
- And each morphism, \(f:X \rightarrow Y\) in \(\mathcal{C}\) to a morphism in \(\mathcal{D}\), \(F(f): F(X) \rightarrow F(Y)\).
Satisfying
- \(F(\textrm{id}_X) = \textrm{id}_{F(X)}\) for each \(X \in \mathcal{C}\)
- \(F(g \circ f) = F(g) \circ F(f)\) for all morphisms, \(f, g \in \mathcal{C}\)

5.3 Natural Transformation

A functor is a morphism between two categories, a natural transformation is a morphism between functors
\(X \in \mathcal{C}\), \(F, G: \mathcal{C} \rightarrow \mathcal{D}\) are functors, then a natural transformation \(\alpha: F(X) \Rightarrow G(X)\) is a family of morphisms such that:
- \(\forall X \in \mathcal{C}\) then \(\alpha_X: F(X) \rightarrow G(X)\) is a morphism in \(\mathcal{D}\)
- for each \(f \in \mathcal{C}\) then \(\alpha_Y \circ F(f) = G(f) \circ \alpha_X\).

5.3.1 Natural Transformation

Figure 1: Commutative diagram for the second natural transformation

5.4 Monads

A monad is an endofunctor, \(T: \mathcal{C} \rightarrow \mathcal{C}\) with two natural transformations
- \(\eta: \textrm{Id}_{\mathcal{c}} \rightarrow T\)
- \(\mu: T \circ T \rightarrow T\)
Such that \(\mu \circ T \mu = \mu \circ \mu T\)
and \(\mu \circ T\eta = \mu \circ \eta T = \textrm{Id}_T\)

5.4.1 Monads

Table 1: Commutative diagrams for the monad laws

5.5 Why?

Types and functions form a category, called Hask, every functor is hence an endofunctor, \(F: \texttt{Hask} \rightarrow \texttt{Hask}\)
Principled abstractions for functional programming
Testing mathematical laws instead of individual functions
Verifying the correctness of programs

5.5.1 Distribution is a Monad

A monad provides a context for an effect
Can preserve immutability be encapsulating random draws in a monad Rand[A] representing a distribution over the type A
Unit is the dirac distribution
flatMap represents a joint or marginal distribution

def coinFlip(n: Int): Rand[Int] = Beta(3, 3).flatMap(p => Binomial(n, p))

5.5.2 Syntactic Sugar

For comprehension provides is syntactic sugar for chains of flatMap and map

def coinFlip(n: Int): Rand[Int] = for {
  p <- Beta(3, 3)
} yield Binomial(n, p)

6 Automatic Differentiation

6.1 What?

Calculate the exact derivative of a function at a point
Not symbolic differentiation
Not numeric differentiation

6.2 Forward Mode AD

Consider the function \(f(x) = x^2 + 2x + 5\) with derivative \(f^\prime(x) = 2x + 2\)
We wish to calculate the derivative of a \(f\) at a specific value of \(x\), suppose \(x = 5, f(5) = 40, f^\prime(5) = 12\)

6.3 Dual Numbers

To perform forward mode AD specify the dual number to \(x = 5\), \(x^\prime = 5 + \varepsilon\) then calculate \(f(x^\prime)\):

\begin{align*} f(5 + \varepsilon) &= (5 + \varepsilon)^2 + 2(5 + \varepsilon) + 5 \\ &= 25 + 10\varepsilon + \varepsilon^2 + 10 + 2\varepsilon + 5 \\ &= 40 + 12\varepsilon \end{align*}
Number of computations depends on the dimension of the input space, ie. the dimension of the parameters

6.4 Implementation

Create a new class representing a dual number:

case class Dual(real: Double, eps: Double)

Define an instance of the Numeric[Dual] typeclass

6.4.1 Dual Operators

def plus(x: Dual, y: Dual) =
  Dual(x.real + y.real, x.eps + y.eps)
def minus(x: Dual, y: Dual): Dual =
  Dual(y.real - x.real, y.eps - x.eps)
def times(x: Dual, y: Dual) =
  Dual(x.real * y.real, x.eps * y.real + y.eps * x.real)
def div(x: Dual, y: Dual) =
  Dual(x.real / y.real, 
  (x.eps * y.real - x.real * y.eps) / (y.real * y.real))

6.4.2 Dual Usage

def logPos(ys: Vector[Double])(mu: Dual) = 
  (-0.5 * 0.125 * mu * mu) - 0.125 * ys.
    map(_ - mu).
    map(x => x * x).
    reduce(_ + _)

6.4.3 Extension to Gradients

Define a new Dual class

case class DualV(real: Double, dual: Vector[Double])

The dual argument can be confused between variables, each variable in a computation must have the same dimension dual

6.5 Reverse Mode AD

Reverse mode AD scales in the dimension of the output space
Faster than forward mode for \(f: \mathbb{R}^n \rightarrow \mathbb{R}^m\) when \(n > m\)

7 Putting it all together

7.1 Building our own PPL

Embedded DSL for model building using Monads
Fast generic inference schemes for sampling from posterior distributions
Automatic differentiation for gradient based samplers

7.2 Linear Regression

\begin{equation} y_i = \beta^T x_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma). \end{equation}

7.2.1 Linear Regression

val model = for {
  b0 <- Normal(0.0, 5.0).param
  b1 <- Normal(0.0, 5.0).param
  b2 <- Normal(0.0, 5.0).param
  sigma <- Gamma(2.0, 2.0).param
  _ <- Predictor.fromDoubleVector { xs =>
    {
      val mean = b0 + b1 * xs.head + b2 * xs(1)
      Normal(mean, sigma)
    }
  }
  .fit(x zip y)
} yield Map("b0" -> b0, "b1" -> b1, "b2" -> b2, "sigma" -> sigma) 

model.sample(EHMC(10), 5000, 10000 * 20, 20)

7.2.2 Linear Regression

Figure 2: Diagnostic plots for simulated data from the linear regression model

7.3 Stochastic Volatility

A heteroskedastic time series with latent log-volatility \(\alpha_t\)

\begin{align} y_t &= \sigma_t\exp\left(\frac{\alpha_t}{2}\right), &\sigma_t &\sim \mathcal{N}(0, 1), \\ \textrm{d}\alpha_t &= \phi(\alpha_t - \mu)\textrm{d}t + \sigma \textrm{d}W_t. \end{align}

7.3.1 Stochastic Volatility

Figure 3: Simulation from a stochastic volatility model with Ornstein-Uhlenbeck latent-state

7.3.2 Stochastic Volatility

val prior = for {
  phi1 <- Beta(5.0, 2.0).param
  phi = 2 * phi1 - 1
  mu <- Normal(0.0, 2.0).param
  sigma <- LogNormal(2.0, 2.0).param
  x0 <- Normal(mu, sigma * sigma / (1 - phi * phi)).param
  t0 = 0.0
} yield (t0, phi, mu, sigma, x0)

def ouStep(phi: Real, mu: Real, sigma: Real, x0: Real, dt: Double) = {
  val mean = mu + (-1.0 * phi * dt).exp * (x0 - mu)
  val variance = sigma.pow(2) * (1 - (-2 * phi * dt).exp) / (2*phi)
  Normal(mean, variance.pow(0.5))
}

7.3.3 Stochastic Volatility

def step(st: RandomVariable[(Double, Real, Real, Real, Real)],
         y: (Double, Double)) = for {
    (t, phi, mu, sigma, x0) <- st
    dt = y._1 - t
    x1 <- ouStep(phi, mu, sigma, x0, dt).param
    _ <- Normal(0.0, (x1 * 0.5).exp).fit(y._2)
  } yield (t + dt, phi, mu, sigma, x1)


val fullModel = ys.foldLeft(prior)(step)

7.4 Mixture Model

Data is assumed to come from one of \(k\) distributions

\begin{align*} y_i &\sim \mathcal{N}(\mu_k, \sigma), \\ k &\sim \mathcal{F}(\theta). \end{align*}

7.4.1 Mixture Model

Figure 4: 500 Simulations from a mixture model with \(\theta = \{0.3, 0.2, 0.5\}\) and \(\mu = \{-2.0, 1.0, 3.0\}\), \(\sigma = 0.5\)

7.4.2 Mixture Model

val model = for {
  theta1 <- Beta(2.0, 5.0).param
  theta2 <- Beta(2.0, 5.0).param
  alphas = Seq(theta1.log, theta2.log, Real.zero)
  thetas = softmax(alphas)
  mu1 <- Normal(0.0, 5.0).param
  mu2 <- Normal(0.0, 5.0).param
  mu3 <- Normal(0.0, 5.0).param
  mus = Seq(mu1, mu2, mu3)
  sigma <- Gamma(2.0, 10.0).param
  components: Map[Continuous, Real] = mus.zip(thetas).map {
    case (m: Real, t: Real) => (Normal(m, sigma) -> t) }.toMap
  _ <- Mixture(components).fit(ys)
} yield Map("theta1" -> thetas.head, "theta2" -> thetas(1),
              "theta3" -> thetas(2),
                "mu1" -> mu1, "mu2" -> mu2, "mu3" -> mu3, "sigma" -> sigma)

7.4.3 Mixture Model

Figure 5: Posterior Densities for the parameters in the mixture model

8 Conclusion

Probabilistic programming languages can be used to quickly prototype new models without implementing and testing new inference schemes
Embedded PPLs can be deployed in production code without rewriting - reducing bugs and enabling faster inference in industrial settings