Analysis of pulse interval data

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here. You can download the data set here.

In [1]:
# Our powertools
import pandas as pd
import numpy as np
import scipy.special

import pystan

import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()
/Users/Justin/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Loading BokehJS ...

The data set and our goals

The data come from Yihan Lin's experiments in Michael Elowitz's lab. Yihan measured the inter-pulse times for Msn2 nuclear localization in yeast. The data set consists of a list of interpulse times. Our goal in analyzing these data is two-fold:

  1. Parameter estimation: Given a model for the probability distribution of waiting times, infer the parameters describing the distribution. E.g., if the waiting times $t$ are exponentially distributed with probabiliy density function $f(t) = \tau^{-1}\,\mathrm{e}^{-t/\tau}$, then $\tau$ is the parameter we wish to find. We also want to know the error bar on $\tau$.
  2. Model selection: Given multiple models describing the probability distribution, quantify the relative goodness of each.

We will take a Bayesian approach to doing this.

The first thing we need to do is load in the data set.

In [2]:
# Read in data file (making sure the times are floats)
t = np.loadtxt('msn2_pulse_intervals.csv', comments='#').astype(float)

Exploratory data analysis

When presented with a new data set, it is a good idea to look at it. In the spirit of Barbara McClintock and countless hours looking at maize, we can "get a feeling for" our data just be looking.

We will start by looking at the ECDF.

In [3]:
def ecdf(x):
    return np.sort(x), np.arange(1, len(x)+1) / len(x)

# Plot cumulative histogram
p = bokeh.plotting.figure(plot_width=400, plot_height=250, x_axis_label='t (min)', y_axis_label='ECDF')
p.circle(*ecdf(t))
bokeh.io.show(p)

We see that about 20% of the interpulse distances are above 30 minutes. We might want to look for power law behavior. Note that I used an ECDF to investigate the distribution of interpulse times. You should in general avoid histograms because they suffer from binning bias.

Complementary cumulative distribution functions are often even more useful because they allow for plotting on a logarithmic scale on both the $x$- and $y$-axes. This allows us to quickly determine if the waiting times are exponentially distributed, power law distributed, etc.

For example, if the waiting times are exponentially distributed, the complementary cumulative distribution is $\mathrm{e}^{-t/\tau}$, which would appear as a line on a semilog-$y$ plot. Similarly, if the waiting times are power law distributed, the complementary cumulative distribution scales like $t^{-a}$, which appears as a straight line on a log-log plot.

In [4]:
# Compute the  complementary cumulative histogram
t, y = ecdf(t)
y_eccdf = 1 - y

p1 = bokeh.plotting.figure(plot_width=400, plot_height=250,
                           x_axis_label='t (min)', y_axis_label='ECCDF',
                           y_axis_type='log')
p1.circle(t, y_eccdf)

p2 = bokeh.plotting.figure(plot_width=400, plot_height=250,
                           x_axis_label='t (min)', y_axis_label='ECCDF',
                           x_axis_type='log', y_axis_type='log')
p2.circle(t, y_eccdf)

bokeh.io.show(bokeh.layouts.gridplot([p1, p2], ncols=2))

The log-log plot (right) tells us immediately that the waiting time is likely not power law distributed. On the semilog-$y$ plot (left), we see that for intermediate waiting times, the complementary cumulative distribution function is approximately linear, implying that it is exponentially distributed. However, at short times, the distribution deviates from this behavior.

Proposed models

Now that we have a basic picture of our data, let's start laying out our analysis. We'll start by specifying which models we will consider for our data. We will consider three models.

  • Model 1: The interpulse waiting time is exponentially distributed. This means that pulsing is a Poisson process. This is the simplest model and has a single parameter, $\tau$, the characteristic waiting time between pulses. The probability density function for the interpulse waiting time is

    \begin{align} f(t\mid \tau, M_1) = \frac{1}{\tau}\,\mathrm{e}^{-t/\tau}. \end{align}

    The cumulative distribution function is

    \begin{align} F(t\mid \tau, M_1) = 1 - \mathrm{e}^{-t/\tau}. \end{align}

  • Model 2: The interpulse waiting time is Weibull distributed. This means that $(t/\tau)^\beta$ is exponentially distributed with mean of unity. In other words, there is an aging process here. If $\beta < 1$, the probability of a pulse coming in the next instant drops as time since the last pulse grows. This might explain the long tail. This model has two parameters, $\beta$ and $\tau$. The probability density function for the interpulse waiting time is

    \begin{align} f(t\mid \beta, \tau, M_2) = \frac{\beta}{\tau} \left(\frac{t}{\tau}\right)^{\beta - 1}\mathrm{e}^{-(t/\tau)^\beta}. \end{align}

    The cumulative distribution function is

    \begin{align} F(t\mid \beta, \tau, M_2) = 1 - \mathrm{e}^{-(t/\tau)^\beta}. \end{align}

  • Model 3: There are two types of burst, both of which have waiting times that are exponentially distributed. There is some probability $p$ of being in pulse state 1. This model, which is an example of a mixture model, has three parameters, the waiting times of the respective pulse types, $\tau_1$, and $\tau_2$, and $p$. The probability density function for the interpulse waiting time is

    \begin{align} f(t\mid p, \tau_1, \tau_2, M_3) &= \frac{p}{\tau_1}\,\mathrm{e}^{-t/\tau_1}

    • \frac{1-p}{\tau_2}\,\mathrm{e}^{-t/\tau_2}. \end{align}

    The cumulative distribution function is

    \begin{align} F(t\mid p, \tau_1, \tau_2, M_3) = 1 - p\mathrm{e}^{-t/\tau_1} - (1-p)\mathrm{e}^{-t/\tau_2}. \end{align}

Note that in all of these models, we are considering the time interval between any two pulsed to be independent of all others. We may consider more sophisticated time series models, including useing hidden Markov models, but we choose these three simple models for the present analysis.

Now that we have our models in place, we will start with our first problem, which is using the data to estimate the parameters of the respective models.

Parameter estimation

The problem of parameter estimation can be stated as follows: Given a data set $D$ and a model $M_i$ describing the data that is parametrized with a set of parameters $\boldsymbol{\theta}_i$, what is the probability distribution of the parameters? I.e., we want to compute the probability $g(\boldsymbol{\theta}_i\mid D, M_i)$. If $g(\boldsymbol{\theta}_i\mid D, M_i)$ is sharply peaked, we have good confidence that our parameters are those for this $g(\boldsymbol{\theta}_i\mid D, M_i)$ is maximal.

Bayes's Theorem and parameter estimation

We seek an expression for $g(\boldsymbol{\theta}_i\mid D, M_i)$ in terms of things we know. To do this, we employ Bayes's theorem, which says,

\begin{align} g(\boldsymbol{\theta}_i\mid D, M_i) = \frac{f(D\mid \boldsymbol{\theta}_i, M_i)\,g(\boldsymbol{\theta}_i\mid M_i)}{f(D\mid M_i)}. \end{align}

Let us look at each term individually.

  • $f(D\mid \boldsymbol{\theta}_i,M_i)$ is the likelihood. It is the probability of observing the measured data given a set of parameters in model $M_i$.
  • $g(\boldsymbol{\theta}_i\mid M_i)$ is the prior probability. It is the probability of a set of parameters in model $M_i$ being the true values without ever observing the data from our experiment.
  • $f(D\mid M_i)$ is called the evidence. It is the probability of observing the experimental data given that model $M_i$ is the model describing the data, but without knowing anything about the parameter values.
  • $g(\boldsymbol{\theta}_i\mid D, M_i)$ is called the posterior probability, which is what we seek to estimate.

In our case, $D = \{t_1, t_2, \ldots t_n\}$, the set of measured time in minutes between peaks of pulses of localization.

For our first parameter estimation, we will consider model 1, which says that the waiting time $t$ is exponentially distributed.

Parameter estimation for Model 1: Exponentially distributed interpulse times

The likelihood

Our model says that the data should be exponentially distributed. As a reminder, the probability distribution for a single interpulse waiting time is thus

\begin{align} f(t\mid \tau, M_1) = \frac{\mathrm{e}^{-t / \tau}}{\tau}. \end{align}

To connect to our more general notation from the last section, here we have $\boldsymbol{\theta}_1 = \tau$, since the exponential distribution has a single parameter. Now, we assume that each of the interpulse waiting times is independent of all others (which we implicitly already did by sorting the data). Therefore, the likelihood is

\begin{align} f(D\mid \tau, M_1) &= \prod_{t\in D}\frac{\mathrm{e}^{-t / \tau}}{\tau} = \frac{1}{\tau^n}\mathrm{e}^{-n\bar{t}/\tau}, \end{align}

where

\begin{align} \bar{t} \equiv \frac{1}{n}\sum_{t \in D} t \end{align}

is the mean interpulse time and $n$ is the total number of pulses. Another, more compact way of writing the likelihood is

\begin{align} t_i \mid \tau \sim \text{Expon}(\tau) \;\forall i. \end{align}

The prior

The prior probability distribution describes what we know about the parameters before ever doing a measurement. We know that $\tau$ should be longer than the time it takes to make a gene product, but should be shorter than the life span of a colony of yeast. So, an educated guess of $\tau$ might be about ten minutes. We are not very sure of this, we should make our prior uniformative, meaning that it is not sharply peaked. We may choose a Log-Normal prior, centered at $\tau = 10$ minutes with a generous variance. If a variable $x$ is Log-Normally distributed, then $\ln x$ is Normally (a.k.a. Gaussian) distributed. It has two parameters, with a probability density function of

\begin{align} g(\tau \mid \tau_\mu, \sigma) = \frac{1}{\tau\sqrt{2\pi\sigma^2}} \, \exp\left[-\frac{(\ln (\tau/\tau_\mu))^2}{2\sigma^2}\right]. \end{align}

The PDF is not usually written this way, but as

\begin{align} g(\tau \mid \mu, \sigma) = \frac{1}{\tau\sqrt{2\pi\sigma^2}} \, \exp\left[-\frac{(\ln \tau - \mu))^2}{2\sigma^2}\right]. \end{align}

I am always apprehensive to write anything such that I take a logarithm of something with units, which is why I wrote it the first way. However, Stan, the software package we will use to compute the posterior, specifies the PDF as the latter, so we need to think about what $\mu$ means in this context. To specify a prior centered around $\tau \approx 10$ minutes, we choose $\mu = 2.3$ (assuming $\tau$ is in units of minutes). We don't expect the waiting time to be longer than several hours, so we will take $\sigma = 4$. We know for sure that $\tau > 0$ by the physical constraints of this process, and this is insured because the support for the Log-Normal distribution is positive real numbers.

So, our prior is

\begin{align} \tau \sim \text{LogNorm}(2.3, 4). \end{align}

The evidence

Given the likelihood and prior, we know

\begin{align} g(\tau\mid D,M_1) \propto f(D\mid \tau,M_1)\,g(\tau\mid M_1), \end{align}

since the evidence, $g(D \mid M_1)$ is not a function of the parameter $\tau$. Thus, the evidence is given by the normalization condition of the posterior probability. Namely,

\begin{align} \int_0^\infty \mathrm{d}\tau\,g(\tau\mid D,M_1) = \frac{1}{f(D\mid M_1)}\int_0^\infty\mathrm{d}\tau\, f(D\mid \tau,M_1)\,g(\tau\mid M_1) = 1. \end{align}

This gives us the evidence in terms of the likihood and prior, which we have already defined.

\begin{align} g(D\mid M_1) = \int_0^\infty\mathrm{d}\tau\, f(D\mid \tau,M_1)\,g(\tau\mid M_1). \end{align}

So, it is evident that the model is fully specified by specifying the prior and the likelihood. The statistical model is then completely and succinctly written as

\begin{align} \tau \sim \text{LogNorm}(2.3, 4) \\[1em] t_i \mid \tau \sim \text{Expon}(\tau) \; \forall i. \end{align}

Using MCMC to get the posterior

The posterior distribution is the goal of the parameter estimation problem. We have provided enough information to specify it, but really do not know what it looks like. We don't know its moments. We don't know how heavy its tails are. So, specifying the likelihood and prior are sufficient to specify what the posterior is in terms of a mathematical expression, but wholly insufficient to understand anything about it.

In general, we can learn about a probability distribution by sampling from it. We have seen this before in class when we talked about the Gillespie algorithm (a.k.a. stochastic simulation algorithm) as a means to sample from a distribution described by a master equation. Similarly, Markov chain Monte Carlo refers to a family of algorithms that can be used to sample from a posterior distribution, given the spefication of the likelihood and prior.

We will use Stan as our engine for sampling out of the posterior. Stan consists of a probabilistic programming language that allows us to cleanly specify a model and then sample out of it using efficient Hamiltonian Monte Carlo methods. We will not get into the technical details of how the algorithm works here, but will instead specify the model and then sample out of it using PyStan.

Once a model is specified in the Stan language, it is compiled from C++ code. We will first instantiate a StanModel instance and then we can expose the code in the expon.stan file that has the code. Because the compiler will print many warnings to the screen, we use the %%capture magic function to suppress the output.

In [5]:
%%capture
sm_exp = pystan.StanModel(file='expon.stan')
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_7f72f8e3a0640d36d0a9c7f8b8bced3a NOW.

Now that we have the StanModel instance, we can expose the code.

In [6]:
sm_exp.show()
StanModel object 'anon_model_7f72f8e3a0640d36d0a9c7f8b8bced3a' coded as follows:
data {
  int<lower=0> n;
  real t[n];
}


parameters {
  real<lower=0> tau;
}


transformed parameters {
  real r;
  r = 1 / tau;
}


model {
  tau ~ lognormal(2.3, 4);
  t ~ exponential(r);
}


generated quantities {
  vector[n] log_like;
  for (i in 1:n){
    log_like[i] = exponential_lpdf(t[i] | r);
  }
}

I will not go through the details of the syntax, but in looking at the code, we see that it looks very much like our mathematical specifiction of the model. We just had to take care to note that Stan's Exponential distribution is parametrized by $r \equiv 1/\tau$. The generated quantities field gives the log likelihood, which is not important for the present parameter estimation problem, but is used in the model comparison problem that follows.

The model is built; now we need to feed it data. There are two pieces of data to feed in, the number of time intervals, and the time intervals themselves. Stan expects these as a dictionary.

In [7]:
data = {'n': len(t),
        't': t}

Now we are ready to fit the model. We will draw four sets of samples in parallel, with 1000 warm-up samples (samples that are thrown away to make sure the sampler converges) and 1000 samples to keep.

In [8]:
sm_exp_fit = sm_exp.sampling(data=data, iter=2000, chains=4)
/Users/Justin/anaconda3/lib/python3.6/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):

That was fast! Now that we have the samples, getting summary statistics is facilitated by PyStan.

In [9]:
print(sm_exp_fit.stansummary(['tau']))
Inference for Stan model: anon_model_7f72f8e3a0640d36d0a9c7f8b8bced3a.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

      mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
tau  20.49  5.9e-3   0.22  20.05  20.33  20.48  20.63  20.93   1456    1.0

Samples were drawn using NUTS at Tue May 29 14:07:42 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The 95% credible region (essentially the error bar) for the parameter $\tau$ can be read off from the 2.5% and 97.5% values, giving $\tau = 20.49^{+0.44}_{-0.43}$ minutes.

We will not get into diagnostics here, but an R-hat value of 1 implies that the sampler worked correctly (there are many more diagnostic checks you can and should do, but we will not get into that here).

Let's look at the ECDF of the samples to visualize the posterior distribution.

In [10]:
p = bokeh.plotting.figure(plot_width=400, plot_height=250, x_axis_label='Ï„ (min)', y_axis_label='ECDF')
p.circle(*ecdf(sm_exp_fit.extract('tau')['tau']))
bokeh.io.show(p)

This looks approximately Gaussian, with a mean of 20.5 minutes.

Parameter estimation for model 2

We can similarly specify a likelihood and prior for model 2. We need only to specify a prior for $\beta$. We are not really sure how the "aging" process will work, so we will take and take the prior for $\beta$ to be centered at $\beta = 1$ with an ample width. The model is then

\begin{align} &\beta \sim \text{LogNormal}(0, 2) \\[1em] &t_i \sim \text{Weibull}(\beta, \tau) \; \forall i \end{align}

We will again use $\tau \sim \text{LogNormal}(2.3, 4)$. The only thing to be careful of here is that beta is the name of a distribution in Stan, so we should not use it as a variable name. We will instead call it beta_w (the subscript signifying "Weibull") in the Stan code.

Let's compile and take a look at the model specification.

In [11]:
%%capture
sm_weibull = pystan.StanModel(file='weibull.stan')
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_ba250f515ca0b02ddc4aa9cbadfaef59 NOW.
In [12]:
sm_weibull.show()
StanModel object 'anon_model_ba250f515ca0b02ddc4aa9cbadfaef59' coded as follows:
data {
  int<lower=0> n;
  real t[n];
}


parameters {
  real <lower=0> beta_w;
  real<lower=0> tau;
}


model {
  beta_w ~ lognormal(0, 1);
  tau ~ lognormal(2.3, 4);
  t ~ weibull(beta_w, tau);
}


generated quantities {
  vector[n] log_like;
  for (i in 1:n){
    log_like[i] = weibull_lpdf(t[i] | beta_w, tau);
  }
}

And now we can sample!

In [13]:
sm_weibull_fit = sm_weibull.sampling(data=data, iter=2000, chains=4)
/Users/Justin/anaconda3/lib/python3.6/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):

Sampling took a bit longer for the Weibull model, perhaps due to its added complexity. Let's take a look at the results.

In [14]:
print(sm_weibull_fit.stansummary(['beta_w', 'tau']))
Inference for Stan model: anon_model_ba250f515ca0b02ddc4aa9cbadfaef59.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
beta_w   0.89  1.4e-4 7.3e-3   0.88   0.89   0.89    0.9   0.91   2629    1.0
tau     19.21  5.3e-3   0.25  18.73  19.05  19.22  19.38  19.69   2194    1.0

Samples were drawn using NUTS at Tue May 29 14:08:44 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We see that $\beta$ is a bit below unity, around 0.9, and $\tau$ is a bit smaller than what we saw with the Exponential models, around 19 minutes. The R-hats look good.

We can visualize the posterior distribution as a corner plot using a the bebi103 module I wrote for my data analysis class.

In [15]:
# Store samples as a DataFrame
df = pd.DataFrame(sm_weibull_fit.extract(['beta_w', 'tau']))

bokeh.io.show(bebi103.viz.corner(df, ['beta_w', 'tau'], labels=['β', 'τ'], smooth=2))

The posterior also looks approximately Gaussian in $\beta$ and $\tau$.

Parameter estimation for model 3

We proceed similiary for the mixture model. We again need to specify the prior and likelihood. For the prior for $p$, we will assume it is Uniformly distributed on the interval $[0, 1]$, which is equivalently expressed as Beta(1, 1).

\begin{align} &\tau_1, \tau_2 \sim \text{LogNorm}(2.3, 4) \\[1em] &p \sim \text{Beta}(1, 1) \\[1em] &t_i \sim p\,\text{Expon}(\tau_1) + (1-p)\,\text{Expon}(\tau_2) \; \forall i \end{align}

Let's check out how this looks in Stan. Note in particular that we specify that tau is positive_ordered, which means that the entries in the array tau must be positive and they are ordered from lowest to highest. This makes the specification of the mixture model unabiguous.

In [16]:
%%capture
sm_mix = pystan.StanModel(file='mixed_exponential.stan')
INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_6b6d2eae01897fdb1d82c94ad4b5b8fc NOW.
In [17]:
sm_mix.show()
StanModel object 'anon_model_6b6d2eae01897fdb1d82c94ad4b5b8fc' coded as follows:
data {
  int<lower=0> n;
  real t[n];
}


parameters {
  positive_ordered[2] tau;
  real<lower=0, upper=1> p;
}


transformed parameters {
  vector[2] r;
  r[1] = 1 / tau[1];
  r[2] = 1 / tau[2];
}

model {
  tau ~ lognormal(2.3, 4);
  p ~ beta(1, 1);

  for (i in 1:n)
    target += log_mix(p, 
                      exponential_lpdf(t[i] | r[1]),
                      exponential_lpdf(t[i] | r[2]));
}


generated quantities {
  vector[n] log_like;
  for (i in 1:n){
    log_like[i] = log_mix(p, 
                          exponential_lpdf(t[i] | r[1]),
                          exponential_lpdf(t[i] | r[2]));
  }
}

And now we can fire up the sampler! We will take more samples for this one because it takes longer to converge.

In [18]:
sm_mix_fit = sm_mix.sampling(data=data, iter=4000, chains=4)
/Users/Justin/anaconda3/lib/python3.6/site-packages/pystan/misc.py:399: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif np.issubdtype(np.asarray(v).dtype, float):

And, let's take a look.

In [19]:
print(sm_mix_fit.stansummary(['p', 'tau']))
Inference for Stan model: anon_model_6b6d2eae01897fdb1d82c94ad4b5b8fc.
4 chains, each with iter=4000; warmup=2000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=8000.

         mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
p        0.65  3.5e-4   0.02   0.61   0.64   0.65   0.66   0.68   2858    1.0
tau[0]  10.22  5.0e-3   0.29   9.65  10.01  10.22  10.41  10.79   3431    1.0
tau[1]  39.43    0.02   1.33  37.01  38.52  39.37  40.28  42.23   2964    1.0

Samples were drawn using NUTS at Tue May 29 14:11:08 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

And we can look at a corner plot.

In [20]:
# Store samples as a DataFrame
df = pd.DataFrame(sm_mix_fit.extract('p'))
tau_vals = sm_mix_fit.extract('tau')['tau']
df['tau_1'] = tau_vals[:, 0]
df['tau_2'] = tau_vals[:, 1]

bokeh.io.show(bebi103.viz.corner(df, ['p', 'tau_1', 'tau_2'], labels=['p', 'Ï„1', 'Ï„2'], smooth=2))

These results suggest that there are two pulse states, one with a 10 minute waiting time and one with a 40 minute waiting time, with similar probability of occurring. Let's plot the result and see how this looks.

Model comparison

Now that we have performed parameter estimation for each of the three models, we turn to assessing how the respective models perform. As a first look, we can plot the ECDF and then the CDF predicted by each model, assuming the median value of the parameter values from the MCMC samples.

In [21]:
# Pull out median parameter values
tau_exp = np.median(sm_exp_fit.extract('tau')['tau'])
tau_weibull = np.median(sm_weibull_fit.extract('tau')['tau'])
beta = np.median(sm_weibull_fit.extract('beta_w')['beta_w'])
p_mix = np.median(sm_mix_fit.extract('p')['p'])
tau_1, tau_2 = np.median(sm_mix_fit.extract('tau')['tau'], axis=0)

# Make theoretical CDFs
t_smooth = np.linspace(0, 250, 200)
cdf_exp = 1 - np.exp(-t_smooth/tau_exp)
cdf_weibull = 1 - np.exp(-(t_smooth/tau_weibull)**beta)
cdf_mix = 1.0 - p_mix * np.exp(-t_smooth / tau_1) - (1 - p_mix) * np.exp(-t_smooth / tau_2)

# Plot
colors = bokeh.palettes.brewer['Set2'][4]
x, y = ecdf(t)
y = 1 - y
p = bokeh.plotting.figure(plot_width=400,
                          plot_height=250,
                          y_axis_type='log',
                          x_axis_label='t (min)',
                          y_axis_label='ECCDF',
                          y_range=[1e-4, 2])
p.circle(x, y)
p.line(t_smooth, 1-cdf_exp, line_width=2, color=colors[1], legend='Expon')
p.line(t_smooth, 1-cdf_weibull, line_width=2, color=colors[0], legend='Weibull')
p.line(t_smooth, 1-cdf_mix, line_width=2, color=colors[3], legend='mixture')
bokeh.io.show(p)

By eye, the mixture model seems to perform best. The die-off at the tail may be due to sparse data (but I doubt it). Though the "eye test" suggests the mixture model is best, we should not trust our eyes alone. Instead, we should perform a model comparison analysis.

For model comparison, we need a metric to describe the relative goodness of a model.

The Watanabe-Akaike Information Criterion

A good model is a predictive model. If we were to acquire more data under identical conditions, the parameters we derived from the posterior should be able to accurately predict what those new data would look like. It makes sense to assess a model on how well it can predict new data. Furthermore, the connection between predictive capabilities and the Bayes factor is clear if you think about what must be true of a predictive model. First, it must describe the data we have actually acquired well. The goodness-of-fit term in the Bayes factor covers this. Second, it must describe new data well. If the parameters are such that it describes the data already collection very well, but cannot predict, the model is not good. This usually happens when the model has many parameters tailor-made for the data (such as fitting data with a higher-order polynomial). This is captured in the Occam factor. So, models with good Bayes factors are often predictive.

With that in mind, I will introduce a good metric for comparing models, the Watanabe-Akaike Information Criterion, also known as the Widely Applicable Information Criterion (WAIC). I will discuss it intuitively and not provide much rigor. For detailed descriptions of what follows, I recommend reading chapter 7 of Gelman, et al., Bayesian Data Analysis, 3rd Ed. and chapter 6 of McElreath, Statistical Rethinking.

In what follows, for notational convenience, I will drop explicit dependence of $M_i$, and define the parameter set pertinent for a given model as $\theta$. We define the predictive density of a single data point $x \in D$ as

\begin{align} \text{single point predictive density} = \int\mathrm{d}\theta\,\,f(x\mid\theta)\,g(\theta \mid D). \end{align}

This is the likelihood for observing data point $x$, averaged over the posterior probability distribution of parameter values $\theta$. We are therefore taking into account posterior information and using the likelihood to assess goodness-of-fit. We can take the product of each of the single point predictive densities in the data set and take the logarithm to get the log pointwise predictive density, or lppd,

\begin{align} \text{lppd} &= \ln\left( \prod_{x\in D} \int\mathrm{d}\theta\,\,f(x\mid\theta)\,g(\theta \mid D)\right) \\[1em] &= \sum_{x\in D} \ln\left(\int\mathrm{d}\theta\,\,f(x\mid\theta)\,g(\theta \mid D)\right). \end{align}

This gives a metric of how well the model manages to predict the observed data. Put succinctly, the lppd is the sum of the logarithm of the average likelihood of each observation in a data set.

The lppd is biased toward complicated models, so we add a correction. We compute the effective number of parameters, $p_\mathrm{WAIC}$ as

\begin{align} p_\mathrm{WAIC} = \sum_{i\in D} \text{variance}(\ln f(x\mid D)), \end{align} where the variance is computed over the posterior. Written out, this is

\begin{align} \text{variance}(\ln f(x\mid D)) &= \int\mathrm{d}\theta\, g(\theta \mid D)\,(\ln f(x\mid D))^2 \\[1em] &\;\;\;\;- \left(\int\mathrm{d}\theta\, g(\theta \mid D)\,\ln f(x\mid D)\right)^2. \end{align}

This parameter $p_\mathrm{WAIC}$, can be thought of as the number of unconstrained parameters in a model. Parameters that are influences only by the prior contribute little to $p_\mathrm{WAIC}$, while those that are determined mostly by the data contribute more.

The WAIC is then \begin{align} \text{WAIC} = -2(\text{lppd} - p_\mathrm{WAIC}). \end{align} The factor of $-2$ is there for historical reasons to enable comparisons to the Akaike Information Criterion (AIC) and the Deviance Information Criterion (DIC). These two information criteria are also widely used, but have assumptions about Gaussianity, and in the case of the AIC, also flat priors. The WAIC is a better choice.

Computing the WAIC is difficult, unless, of course, you managed to get MCMC samples! Given a set of $S$ MCMC samples of the parameters $\theta$ (where $\theta^{(s)}$ is the $s$th sample), the lppd may be calculated as

\begin{align} \text{lppd} = \sum_{x\in D}\ln \left(\frac{1}{S}\sum_{s=1}^S f(x\mid \theta^{(s)})\right). \end{align}

Similarly we can compute $p_\mathrm{WAIC}$ from samples.

\begin{align} p_\mathrm{WAIC} = \sum_{x\in D}\frac{1}{S-1}\sum_{s=1}^S\left(\log f(x\mid \theta^{(s)}) - q(x)\right)^2, \end{align}

where

\begin{align} q(x) = \frac{1}{S}\sum_{s=1}^S\ln f(x\mid\theta^{(s)}). \end{align}

For an intuitive description of the WAIC, you may think of it as an estimate of the negative log likelihood of new data. That is, it is an estimate of how badly the model would perform with new data. So, the lower the WAIC, the better the model.

The Akaike weights

The value of a WAIC by itself does not tell us anything. Only comparison of two or more WAICs makes sense. Recalling that the WAIC is a measure of a log likelihood, if we exponentiate it, we get something proportional to a probability. If we have two models, $M_i$ and $M_j$, the \textbf{Akaike weight} of model $i$ is

\begin{align} w_i = \frac{\exp\left[-\frac{1}{2}\,\text{WAIC}_i\right]}{\exp\left[-\frac{1}{2}\,\text{WAIC}_i\right] + \exp\left[-\frac{1}{2}\,\text{WAIC}_j\right]}. \end{align}

This weight may be interpreted as an estimate of the probability that $M_i$ will make the best predictions of new data. (This interpretation is common, but not entirely agreed upon.) We can generalize this to multiple models.

\begin{align} w_i = \frac{\exp\left[-\frac{1}{2}\,\text{WAIC}_i\right]}{\sum_j \exp\left[-\frac{1}{2}\,\text{WAIC}_j\right]}. \end{align}

Compaison of WAICs

Let us write a function to compute the WAIC from a given trace. Note that to compute it, we need the log likelihood for every data point for every sample. This is why we had the log likelihood in the generated quantities field of our Stan model specification. By default, Stan does not compute the log likelihood for each sample, since it is unnecessary for sampling and computationally intensive. We therefore need to extract the log likelihoods after the sampling using the generated quantities field. Let's start by extracting the log likelihoods.

In [22]:
log_likes = {'exp': sm_exp_fit.extract('log_like')['log_like'],
             'weibull': sm_weibull_fit.extract('log_like')['log_like'],
             'mix': sm_mix_fit.extract('log_like')['log_like']}

Now, we define a function to compute the WAIC from a set of samples, and then a function to compute the Akaike weights from a set of WAICs.

In [24]:
def waic(log_like, return_lppd_p_waic=False):
    """Compute WAIC from a set of log-likelihoods."""
    lppd = np.sum(scipy.special.logsumexp(log_like, axis=0, b=1/log_like.shape[0]))
    p_waic = np.sum(np.var(log_like, axis=0, ddof=1))
    
    if return_lppd_p_waic:
        return -2*(lppd - p_waic), lppd, p_waic
    else:
        return -2*(lppd - p_waic)

def akaike_weights(model_waics):
    """Compute Akaike weights from a dictionary of WAICs."""
    log_denom = scipy.special.logsumexp(np.array([-val/2 for _, val in model_waics.items()]))
    return {key: np.exp(-val/2 - log_denom) for key, val in model_waics.items()}

Now, let's compute the WAICs.

In [25]:
waics = {key: waic(val) for key, val in log_likes.items()}

# Take a look
waics
Out[25]:
{'exp': 66064.60197304442,
 'weibull': 65860.51886685542,
 'mix': 65160.918351401655}

The mixture model has the smalled WAIC, in agreement with what we would expect from eyeballing the plot of the ECCDFs. Let's compute the Akaike weights.

In [26]:
akaike_weights(waics)
Out[26]:
{'exp': 5.855916387360625e-197, 'weibull': 1.2124903781034876e-152, 'mix': 1.0}

The mixture model absolutely crushes the other two models. When we have lots of data, goodness of fit matters a lot!