Stochastic simulation of genetic circuits

(c) 2018 Justin Bois. With the exception of pasted graphics, where the source is noted, this work is licensed under a Creative Commons Attribution License CC-BY 4.0. All code contained herein is licensed under an MIT license.

This document was prepared at Caltech with financial support from the Donna and Benjamin M. Rosen Bioengineering Center.

This tutorial was generated from an Jupyter notebook. You can download the notebook here.

In [1]:
import multiprocessing
import tqdm

import numpy as np
import scipy.stats as st
import numba

# Plotting modules
import bokeh.io
import bokeh.plotting

bokeh.io.output_notebook()

# Line profiler (can install with )
%load_ext line_profiler
Loading BokehJS ...

Sampling out of master equations

For some stories, we know the probability distribution. For example, if the story is that we are doing n independent coin flips, each with probability p of landing heads, the number h of heads is Binomially distributed. We could prove that this is the case, or look it up in a book. But sometimes, it is too difficult (or impossible) to prove what the distribution is, so we can numerically compute its properties by sampling out of the distribution. Sampling involved using a random number generator to simulate the story that generates the distribution. So, if you know the story and you have a computer handy, you can sample to be able to get a plot of your distribution, though you will not get the analytical form.

Let's demonstrate this with the Binomial distribution. We will take n = 25 and p = 0.25 and compute P(h | n, p), the probability of getting h heads in n flips, each with probability p of landing heads. We will draw 10, 30, 100, and 300 samples and plot them versus the expected Binomial distribution.

In [2]:
def simulate_coinflips(n, p, n_samples):
    """
    Simulate n_samples sets of n coin flips with prob. p of heads.
    
    The easier way to do this is `np.random.binom(n, p, n_samples)`,
    but I show this "simulation" for instructive purposes.
    """
    n_heads = np.empty(n_samples, dtype=int)
    for i in range(n_samples):
        n_heads[i] = np.sum(np.random.random(size=n) < p)
    return n_heads

n_samples = (30, 100, 1000, 10000)
n = 25
p = 0.25
h_plot = np.arange(26)
theor_dist = st.binom.pmf(h_plot, n, p)

plots = []
for n_samp in n_samples:
    plot = bokeh.plotting.figure(plot_height=200,
                                 plot_width=300,
                                 x_axis_label='h',
                                 y_axis_label='P(h)',
                                 title=f'{n_samp} samples')
    h = simulate_coinflips(n, p, n_samp)
    plot.circle(h_plot, theor_dist)
    plot.circle(np.arange(h.max()+1), np.bincount(h)/n_samp, color='orange')
    plots.append(plot)
    
bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

As we can see, if we sample out of the probability distribution, we can approximately calculate the actual distribution. If we sample enough, the approximation is very good.

We will use the same strategy for solving master equations. We will find a way to sample out of the distribution that is governed by the master equation. This technique was pioneered by Dan Gillespie in the last 70s. For that reason, these sampling techniques are often called Gillespie simulations. The algorithm is sometimes referred to as a stochastic simulation algorithm, or SSA.

Here, we will explore how this algorithm works by looking at simple production of a protein.

The dynamical equations

For simple protein production, we have the following reactions.

\begin{align} \text{DNA} \rightarrow \text{mRNA} \rightarrow \text{protein} \end{align}

Macroscale equations

As we've seen before, the deterministic dynamics, which describe mean concentrations over a large population of cells, are described by the ODEs

\begin{align} \frac{\mathrm{d}m}{\mathrm{d}t} &= \beta_m - \gamma_m m, \\[1em] \frac{\mathrm{d}p}{\mathrm{d}t} &= \beta_p m - \gamma_p p. \end{align}

The same equations should hold if m and p represent the mean numbers of cells; we would just have to appropriately rescale the constants. Assuming the m and p are now numbers (so we are not free to pick their units), we can nondimensionalize using $\gamma_m$ to nondimensionalize time. This leads to redefinition of parameters and variables

\begin{align} &\beta_m/\gamma_m \to \beta_m, \\[1em] &\beta_p/\gamma_m \to \beta_p, \\[1em] &\gamma_m t \to t. \end{align}

The dimensionless equations are

\begin{align} \frac{\mathrm{d}m}{\mathrm{d}t} &= \beta_m - m, \\[1em] \frac{\mathrm{d}p}{\mathrm{d}t} &= \beta_p m - \gamma p, \end{align}

with $\gamma = \gamma_p/\gamma_m$.

The Master equation

We can write a master equation for these dynamics. In this case, each state is defined by an mRNA copy number m and a protein copy number p. So, we will write a master equation for P(m, p, t).

\begin{align} \frac{\mathrm{d}P(m,p,t)}{\mathrm{d}t} &= \beta_m P(m-1,p,t) + (m+1)P(m+1,p,t) - \beta_m P(m,p,t) - mP(m,p,t) \nonumber \\[1em] &+\beta_p mP(m,p-1,t) + \gamma_p (p+1)P(m,p+1,t) - \beta_p mP(m,p,t) - \gamma p P(m,p,t). \end{align}

As in lecture, we have implicitly defined P(m, p, t) = 0 if m < 0 or p < 0. This is the master equation we will sample from using the stochastic simulation algorithm (SSA) or Gillespie algorithm.

The Gillespie algorithm

The transition probabilities are also called propensities in the context of stochastic simulation. The propensity for a given transition, say indexed $i$, is denoted as $a_i$. The equivalence to notation we introduced for master equations is that if transition $i$ results in the change of state from $n'$ to $n$, then $a_i = W(n\mid n')$.

To cast this problem for a Gillespie simulation, we can write each change of state (moving either the copy number of mRNA or protein up or down by 1 in this case) and their respective propensities.

\begin{align} \begin{array}{ll} \text{reaction, }r_i & \text{propensity, } a_i \\ m \rightarrow m+1,\;\;\;\; & \beta_m \\[0.3em] m \rightarrow m-1, \;\;\;\; & m\\[0.3em] p \rightarrow p+1, \;\;\;\; & \beta_p m \\[0.3em] p \rightarrow p-1, \;\;\;\; & \gamma p. \end{array} \end{align}

We won't carefully prove that the Gillespie algorithm samples from the probability distribution governed by the master equation, but will state the principles behind it. The basic idea is that events (such as those outlined above) are rare, discrete, separate events. I.e., each event is an arrival of a Poisson process. The Gillespie algorithm starts with some state, $(m_0,p_0)$. Then a state change, any state change, will happen in some time $\Delta t$ that has a certain probability distribution (which we will show is Exponential momentarily). The probability that the state change that happens is reaction $j$ is proportional to $a_j$. That is to say, state changes with high propensities are more likely to occur. Thus, choosing which of the $n$ state changes happens in $\Delta t$ is a matter of drawing an integer $j$ in $[1,n]$ where the probability of drawing $j$ is

\begin{align} \frac{a_j}{\sum_i a_i}. \end{align}

Now, how do we determine how long the state change took? The probability density function describing that a given state change $i$ takes place in time $t$ is

\begin{align} P(t\mid a_i) = a_i \mathrm{e}^{-a_i t}, \end{align}

since the time it takes for arrival of a Poisson process is Exponentially distributed. The probability that it has not arrived in time $\Delta t$ is the probability that the arrival time is greater than $\Delta t$, given by the complementary cumulative distribution function for the Exponential distribution.

\begin{align} P(t > \Delta t\mid a_i) = \int_{\Delta t}^\infty \mathrm{d}t\,P(t\mid a_i) = \mathrm{e}^{-a_i \Delta t}. \end{align}

Now, say we have $n$ processes that arrive in time $t_1, t_2, \ldots$. The probability that none of them arrive before $\Delta t$ is

\begin{align} P(t_1 > \Delta t, t_2 > \Delta t, \ldots) &= P(t_1 > \Delta t) P(t_2 > \Delta t) \cdots = \prod_i \mathrm{e}^{-a_i \Delta t} = \mathrm{exp}\left[-\Delta t \sum_i a_i\right]. \end{align}

This is the same as the probability of a single Poisson process with $a = \sum_i a_i$ not arriving before $\Delta t$. So, the probability that it does arrive in $\Delta t$ is Exponentially distributed with mean $\left(\sum_i a_i\right)^{-1}$ (since the mean of an Exponential distribution is $1/a$).

So, we know how to choose a state change and we also know how long it takes. The Gillespie algorithm then proceeds as follows.

  1. Choose an initial condition, e.g. $m = p = 0$.
  2. Calculate the propensity for each of the enumerated state changes. The propensities may be functions of $m$ and $p$, so they need to be recalculated for every $m$ and $p$ we encounter.
  3. Choose how much time the reaction will take by drawing out of an exponential distribution with a mean equal to $\left(\sum_i a_i\right.)^{-1}$. This means that a change arises from a Poisson process.
  4. Choose what state change will happen by drawing a sample over the discrete distribution where $P_i = \left.a_i\middle/\left(\sum_i a_i\right)\right.$. In other words, the probability that a state change will be chosen is proportional to its propensity.
  5. Increment time by the time step you chose in step 3.
  6. Update the states according to the state change you choose in step 4.
  7. If $t$ is less than your pre-determined stopping time, go to step 2. Else stop.

Gillespie proved that this algorithm samples the probability distribution described by the master equation in his seminal papers in 1976 and 1977. I recommend reading D. T. Gillespie, J. Phys. Chem., 81, 2340-2361, 1977. You can also read a concise discussion of how the algorithm samples the master equation in section 4.2 of Del Vecchio and Murray.

Coding up the Gillespie simulation

To code up the Gillespie simulation, we first make an an array that gives the changes in the counts of $m$ and $p$ for each of the four reactions. This is a way of encoding the updates in the particle counts that we get from choosing the respective state changes.

In [3]:
# Column 0 is change in m, column 1 is change in p
simple_update = np.array([[1, 0],   # Make mRNA transcript
                          [-1, 0],  # Degrade mRNA
                          [0, 1],   # Make protein
                          [0, -1]], # Degrade protein
                         dtype=np.int)

Next, we make a function that returns an array of propensities for each of the four reactions.

In [4]:
def simple_propensity(population, beta_m, beta_p, gamma):
    """
    Returns an array of propensities given a set of parameters
    and an array of populations.
    """
    # Unpack population
    m, p = population
    
    return np.array([beta_m,     # Make mRNA transcript
                     m,          # Degrade mRNA
                     beta_p * m, # Make protein
                     gamma * p]  # Degrade protein
                   )

Making a draw

Finally, we write a general function that draws a choice of reaction and the time interval for that reaction. This is the heart of the Gillespie algorithm, so we will take some time to discuss speed. First, to get the time interval, we sample a random number from an exponential distribution with mean $\left(\sum_i a_i\right)^{-1}$. This is easily done using the np.random.exponential() function.

Next, we have to select which reaction will take place. This amounts to drawing a sample over the discrete distribution where $P_i = a_i\left(\sum_i a_i\right)^{-1}$, or the probability of each reaction is proportional to its propensity. This can be done using scipy.stats.rv_discrete, which allows specification of an arbitrary discrete distribution. We will write a function to do this.

In [5]:
def sample_discrete_scipy(probs):
    """
    Randomly sample an index with probability given by probs.
    """
    return st.rv_discrete(values=(range(len(probs)), probs)).rvs()

This is a nice one-liner, but is it fast? I suspect (actually, I know, since I already tried it) that there is significant overhead in setting up the scipy.stats random variable object to sample from each time. Remember, we can't just do this once because the array probs changes with each step in the SSA because the propensities change. We will therefore write a less elegant, but maybe faster way of doing it.

Another way to sample the distribution is to generate a uniformly distributed random number $q$, with $0 < q < 1$ and return the value $j$ such that

\begin{align} \sum_{i=0}^{j-1} p_i < q < \sum_{i=0}^{j}p_i. \end{align}

We'll code this up.

In [6]:
def sample_discrete(probs):
    """
    Randomly sample an index with probability given by probs.
    """
    # Generate random number
    q = np.random.rand()
    
    # Find index
    i = 0
    p_sum = 0.0
    while p_sum < q:
        p_sum += probs[i]
        i += 1
    return i - 1

Now let's compare the speeds using the %timeit magic function.

In [7]:
# Make dummy probs
probs = np.array([0.1, 0.3, 0.4, 0.05, 0.15])

print('Result from scipy.stats:')
%timeit sample_discrete_scipy(probs)

print('\nResult from hand-coded method:')
%timeit sample_discrete(probs)
Result from scipy.stats:
921 µs ± 395 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Result from hand-coded method:
1.64 µs ± 100 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Wow! The less concise method is over 400 times faster! So, we'll ditch using scipy.stats, and use our hand-built sampler instead.

Now we can write a function to do our draws.

In [8]:
# Function to draw time interval and choice of reaction
def gillespie_draw(propensity_func, population, args=()):
    """
    Draws a reaction and the time it took to do that reaction.
    
    Parameters
    ----------
    propensity_func : function
        Function with call signature propensity_func(population, *args)
        used for computing propensities. This function must return
        an array of propensities.
    population : ndarray
        Current population of particles
    args : tuple, default ()
        Arguments to be passed to `propensity_func`.
        
    Returns
    -------
    rxn : int
        Index of reaction that occured.
    time : float
        Time it took for the reaction to occur.
    """
    # Compute propensities
    props = propensity_func(population, *args)
    
    # Sum of propensities
    props_sum = props.sum()
    
    # Compute time
    time = np.random.exponential(1.0 / props_sum)
    
    # Compute discrete probabilities of each reaction
    rxn_probs = props / props_sum
    
    # Draw reaction from this distribution
    rxn = sample_discrete(rxn_probs)
    
    return rxn, time

SSA time stepping

Now we are ready to write our main SSA loop. We will only keep the counts at pre-specified time points. This saves on RAM, and we really only care about the values at given time points anyhow.

Note that this function is generic. All we need to specify our system are:

  • A function to compute the propensities
  • How the updates for a given reaction are made

Additionally, we specify necessary parameters, an initial condition, and the time points at which we want to store our samples. So, providing the propensity function and update are analogous to providing the time derivatives when using scipy.integrate.odeint().

In [9]:
def gillespie_ssa(propensity_func, update, population_0, time_points, args=()):
    """
    Uses the Gillespie stochastic simulation algorithm to sample
    from proability distribution of particle counts over time.
    
    Parameters
    ----------
    propensity_func : function
        Function of the form f(params, population) that takes the current
        population of particle counts and return an array of propensities
        for each reaction.
    update : ndarray, shape (num_reactions, num_chemical_species)
        Entry i, j gives the change in particle counts of species j
        for chemical reaction i.
    population_0 : array_like, shape (num_chemical_species)
        Array of initial populations of all chemical species.
    time_points : array_like, shape (num_time_points,)
        Array of points in time for which to sample the probability
        distribution.
    args : tuple, default ()
        The set of parameters to be passed to propensity_func.        

    Returns
    -------
    sample : ndarray, shape (num_time_points, num_chemical_species)
        Entry i, j is the count of chemical species j at time
        time_points[i].
    """

    # Initialize output
    pop_out = np.empty((len(time_points), update.shape[1]), dtype=np.int)

    # Initialize and perform simulation
    i_time = 1
    i = 0
    t = time_points[0]
    population = population_0.copy()
    pop_out[0,:] = population
    while i < len(time_points):
        while t < time_points[i_time]:
            # draw the event and time step
            event, dt = gillespie_draw(propensity_func, population, args)
                
            # Update the population
            population_previous = population.copy()
            population += update[event,:]
                
            # Increment time
            t += dt

        # Update the index
        i = np.searchsorted(time_points > t, True)
        
        # Update the population
        pop_out[i_time:min(i,len(time_points))] = population_previous
        
        # Increment index
        i_time = i
                           
    return pop_out

Running and parsing results

We can now run a set of SSA simulations and plot the results. We will run 100 trajectories and store them. We will use $\beta_p = \beta_m = 10$ and $\gamma = 0.4$. We will use the nifty package tqdm to give a progress bar so we know how long it is taking.

In [10]:
# Specify parameters for calculation
args = np.array([10, 10, 0.4])
time_points = np.linspace(0, 50, 101)
population_0 = np.array([0, 0], dtype=int)
n_simulations = 100

# Seed random number generator for reproducibility
np.random.seed(42)

# Initialize output array
pops = np.empty((n_simulations, len(time_points), 2), dtype=int)

# Run the calculations
for i in tqdm.tqdm_notebook(range(n_simulations)):
    pops[i,:,:] = gillespie_ssa(simple_propensity, simple_update,
                                population_0, time_points, args=args)

We now have our samples, so we can plot the trajectories.

In [11]:
# Set up plots
plots = [bokeh.plotting.figure(plot_width=300,
                               plot_height=200,
                               x_axis_label='dimensionless time',
                               y_axis_label='number of mRNAs'),
         bokeh.plotting.figure(plot_width=300,
                               plot_height=200,
                               x_axis_label='dimensionless time',
                               y_axis_label='number of proteins')]

# Plot trajectories and mean
for i in [0, 1]:
    for x in pops[:,:,i]:
        plots[i].line(time_points, x, line_width=0.3, alpha=0.2, line_join='bevel')
    plots[i].line(time_points, pops[:,:,i].mean(axis=0), line_width=6, color='tomato',
                  line_join='bevel')


# Link axes
plots[0].x_range = plots[1].x_range

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

We can also compute the steady state properties by considering the end of the simulation. The last 50 time points are at steady state, so we will average over them.

In [12]:
print('mRNA mean copy number =', pops[:,-50:,0].mean())
print('protein mean copy number =', pops[:,-50:,1].mean())
print('\nmRNA variance =', pops[:,-50:,0].std()**2)
print('protein variance =', pops[:,-50:,1].std()**2)
print('\nmRNA noise =', pops[:,-50:,0].std() / pops[:,-50:,0].mean())
print('protein noise =', pops[:,-50:,1].std() / pops[:,-50:,1].mean())
mRNA mean copy number = 10.0748
protein mean copy number = 251.6994

mRNA variance = 10.26200496
protein variance = 2145.33703964

mRNA noise = 0.317965262817717
protein noise = 0.18402023679851773

Finally, we can compute the steady state probability distributions. To plot them, we plot the empirical cumulative distribution function (ECDF) from the sampling. The theoretical distribution for mRNA is Poisson, which is overlayed in red.

In [13]:
def ecdf(data):
    """Give x and y values for plotting ECDFs"""
    return np.sort(data), np.arange(1, len(data)+1) / len(data)


# mRNA ECDF
p_m = bokeh.plotting.figure(plot_width=300,
                            plot_height=200,
                            x_axis_label='mRNA copy number',
                            y_axis_label='ECDF')
p_m.circle(*ecdf(pops[:,-50:,0].flatten()))

# Theoretial mRNA CDF (Poisson)
p_m.circle(np.arange(25), st.poisson.cdf(np.arange(25), args[0]), color='tomato')

# protein ECDF
p_p = bokeh.plotting.figure(plot_width=300,
                            plot_height=200,
                            x_axis_label='protein copy number',
                            y_axis_label='ECDF')
p_p.circle(*ecdf(pops[:,-50:,1].flatten()))

bokeh.io.show(bokeh.layouts.gridplot([p_m, p_p], ncols=2))

As we expect, the mRNA copy number matches the Poisson distribution. We also managed to get a distribution for the protein copy number that we could plot. So, you now have the basic tools for doing Gillespie simulations. You just need to code the propensity function and the update, and you're on your way! (Note, though, that this only works for Gillespie simulations where the states are defined by particle counts, which should suffice for this course.)

Everything else in this tutorial from here is instruction on how to increase speed.

Increasing speed

That calculation took a while. We would naturally like to speed up the calculation. One approach is to carefully evaluate the algorithm and try to find speed improvements. Gibson and Bruck developed the next reaction method, which can result in significant speed-up for complicated sets of reactions. Instead of wading into the algorithmic details, we will instead investigate how we can speed up our implementation of the direct Gillespie algorithm.

Let's do some profiling to see what took so long. We will use the magic function %lprun to profile runs of SSAs. The output has line wrapping, so it is kind of hard to read in a Jupyter notebook, unfortunately.

In [14]:
%lprun -T lp_results.txt -f gillespie_ssa gillespie_ssa(\
        simple_propensity, simple_update, population_0, time_points, args)
*** Profile printout saved to text file 'lp_results.txt'. 
Timer unit: 1e-06 s

Total time: 0.354165 s
File: <ipython-input-9-45b4c0a8000b>
Function: gillespie_ssa at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def gillespie_ssa(propensity_func, update, population_0, time_points, args=()):
     2                                               """
     3                                               Uses the Gillespie stochastic simulation algorithm to sample
     4                                               from proability distribution of particle counts over time.
     5                                               
     6                                               Parameters
     7                                               ----------
     8                                               propensity_func : function
     9                                                   Function of the form f(params, population) that takes the current
    10                                                   population of particle counts and return an array of propensities
    11                                                   for each reaction.
    12                                               update : ndarray, shape (num_reactions, num_chemical_species)
    13                                                   Entry i, j gives the change in particle counts of species j
    14                                                   for chemical reaction i.
    15                                               population_0 : array_like, shape (num_chemical_species)
    16                                                   Array of initial populations of all chemical species.
    17                                               time_points : array_like, shape (num_time_points,)
    18                                                   Array of points in time for which to sample the probability
    19                                                   distribution.
    20                                               args : tuple, default ()
    21                                                   The set of parameters to be passed to propensity_func.        
    22                                           
    23                                               Returns
    24                                               -------
    25                                               sample : ndarray, shape (num_time_points, num_chemical_species)
    26                                                   Entry i, j is the count of chemical species j at time
    27                                                   time_points[i].
    28                                               """
    29                                           
    30                                               # Initialize output
    31         1         25.0     25.0      0.0      pop_out = np.empty((len(time_points), update.shape[1]), dtype=np.int)
    32                                           
    33                                               # Initialize and perform simulation
    34         1          6.0      6.0      0.0      i_time = 1
    35         1          1.0      1.0      0.0      i = 0
    36         1          4.0      4.0      0.0      t = time_points[0]
    37         1          8.0      8.0      0.0      population = population_0.copy()
    38         1          5.0      5.0      0.0      pop_out[0,:] = population
    39       101        134.0      1.3      0.0      while i < len(time_points):
    40      9329      10495.0      1.1      3.0          while t < time_points[i_time]:
    41                                                       # draw the event and time step
    42      9229     275577.0     29.9     77.8              event, dt = gillespie_draw(propensity_func, population, args)
    43                                                           
    44                                                       # Update the population
    45      9229      22106.0      2.4      6.2              population_previous = population.copy()
    46      9229      34343.0      3.7      9.7              population += update[event,:]
    47                                                           
    48                                                       # Increment time
    49      9229       8850.0      1.0      2.5              t += dt
    50                                           
    51                                                   # Update the index
    52       100       1854.0     18.5      0.5          i = np.searchsorted(time_points > t, True)
    53                                                   
    54                                                   # Update the population
    55       100        675.0      6.8      0.2          pop_out[i_time:min(i,len(time_points))] = population_previous
    56                                                   
    57                                                   # Increment index
    58       100         82.0      0.8      0.0          i_time = i
    59                                                                      
    60         1          0.0      0.0      0.0      return pop_out

We see that 80% of the time is spent doing draws. Nothing else is really worth looking at. Let's see how we can improve our draw speed.

In [15]:
%lprun -T lp_results.txt -f gillespie_draw \
        [gillespie_draw(simple_propensity, population_0, args) \
            for _ in range(10000)]
*** Profile printout saved to text file 'lp_results.txt'. 
Timer unit: 1e-06 s

Total time: 0.374578 s
File: <ipython-input-8-0c4c83f72706>
Function: gillespie_draw at line 2

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     2                                           def gillespie_draw(propensity_func, population, args=()):
     3                                               """
     4                                               Draws a reaction and the time it took to do that reaction.
     5                                               
     6                                               Parameters
     7                                               ----------
     8                                               propensity_func : function
     9                                                   Function with call signature propensity_func(population, *args)
    10                                                   used for computing propensities. This function must return
    11                                                   an array of propensities.
    12                                               population : ndarray
    13                                                   Current population of particles
    14                                               args : tuple, default ()
    15                                                   Arguments to be passed to `propensity_func`.
    16                                                   
    17                                               Returns
    18                                               -------
    19                                               rxn : int
    20                                                   Index of reaction that occured.
    21                                               time : float
    22                                                   Time it took for the reaction to occur.
    23                                               """
    24                                               # Compute propensities
    25     10000     106778.0     10.7     28.5      props = propensity_func(population, *args)
    26                                               
    27                                               # Sum of propensities
    28     10000      80255.0      8.0     21.4      props_sum = props.sum()
    29                                               
    30                                               # Compute time
    31     10000      68626.0      6.9     18.3      time = np.random.exponential(1.0 / props_sum)
    32                                               
    33                                               # Compute discrete probabilities of each reaction
    34     10000      46593.0      4.7     12.4      rxn_probs = props / props_sum
    35                                               
    36                                               # Draw reaction from this distribution
    37     10000      64033.0      6.4     17.1      rxn = sample_discrete(rxn_probs)
    38                                               
    39     10000       8293.0      0.8      2.2      return rxn, time

The propensity function it taking the most time. We will focus on improving that.

Speed boost by JIT compilation with Numba

Numba is a package that does LLVM optimized just-in-time compilation of Python code. The speed-ups can be substantial. We will use numba to compile the parts of the code that we can. For many functions, we just need to decorate the function with

@numba.jit()

and that is it. If possible, we can use the nopython=True keyword argument to get more speed because the compiler can assume that all of the code is JIT-able.

We can speed that up by sacrificing some readability and directly accessing the arrays instead of setting up views, as we have been doing.

In [16]:
@numba.jit(nopython=True)
def simple_propensity_numba(population, beta_m, beta_p, gamma):
    """
    Returns an array of propensities given a set of parameters
    and an array of populations.
    """
    # Unpack population
    m, p = population
    
    return np.array([beta_m, 
                     m, 
                     beta_p * m, 
                     gamma * p])


# Check speeds
print('Old propensity function:')
%timeit simple_propensity(population_0, *args)

print('\nNumba\'d propensity function:')
%timeit simple_propensity_numba(population_0, *args)
Old propensity function:
3.9 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Numba'd propensity function:
2.24 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So, we got about a factor of two or so speed up. We'll take it!

Now, we also saw that the sums and division of arrays are slow. Let's optimize the sum operation using numba.

In [17]:
@numba.jit(nopython=True)
def sum_numba(ar):
    return ar.sum()

# Make dummy array for testing
ar = np.array([0.3, 0.4, 0.3, 0.2, 0.15])

print('\nNumPy sum:')
%timeit ar.sum()

print('\nNumba sum:')
%timeit sum_numba(ar)
NumPy sum:
1.97 µs ± 81.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Numba sum:
420 ns ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So, we get a five-fold speed boost with numba. Though I should note that this speed boost in the sum is due to numba optimizing sums of a certain size. For sums over large numbers of entries, numba's performance will not exceed NumPy's by much at all.

Finally, we will speed up the sampling of the discrete distribution. We will do this in two ways. First, we notice that the division operation on the propensities took a fair amount of time when we profiled the code. We do not need it; we can instead sample from an unnormalized discrete distribution. Secondly, we can use numba to accelerate the while loop in the sampling.

In [18]:
@numba.jit(nopython=True)
def sample_discrete_numba(probs, probs_sum):
    q = np.random.rand() * probs_sum

    i = 0
    p_sum = 0.0
    while p_sum < q:
        p_sum += probs[i]
        i += 1
    return i - 1

# Make dummy unnormalized probs
probs = np.array([0.1, 0.3, 0.4, 0.05, 0.15, 0.6])
probs_sum = sum_numba(probs)

print('Result from hand-coded method:')
%timeit sample_discrete(probs)

print("\nResults from numba'd version:")
%timeit sample_discrete_numba(probs, probs_sum)
Result from hand-coded method:
1.61 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Results from numba'd version:
423 ns ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

We get a speed-up of about a factor of three. Let's now make a new gillespie_draw() function that is faster. The fast propensity function is just an argument to the fast draw-er.

In [19]:
def gillespie_draw_fast(propensity_func, population, args):
    """
    Draws a reaction and the time it took to do that reaction.
    """
    # Compute propensities
    props = propensity_func(population, *args)
    
    # Sum of propensities
    props_sum = sum_numba(props)
    
    # Compute time
    time = np.random.exponential(1 / props_sum)

    # Draw reaction given propensities
    rxn = sample_discrete_numba(props, props_sum)
    
    return rxn, time

print('Old Gillespie draw:')
%timeit gillespie_draw(simple_propensity, population_0, args)

print('\nFast Gillespie draw:')
%timeit gillespie_draw_fast(simple_propensity_numba, population_0, args)
Old Gillespie draw:
16.9 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Fast Gillespie draw:
6.08 µs ± 593 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So, our optimization got us a speed boost of about 3-fold. Let's adjust our SSA function to include the fast draws.

In [20]:
def gillespie_ssa_fast(propensity_func, update, population_0, 
                       time_points, args=()):
    """
    Uses the Gillespie stochastic simulation algorithm to sample
    from proability distribution of particle counts over time.
    
    Parameters
    ----------
    propensity_func : function
        Function of the form f(params, population) that takes the current
        population of particle counts and return an array of propensities
        for each reaction.
    update : ndarray, shape (num_reactions, num_chemical_species)
        Entry i, j gives the change in particle counts of species j
        for chemical reaction i.
    population_0 : array_like, shape (num_chemical_species)
        Array of initial populations of all chemical species.
    time_points : array_like, shape (num_time_points,)
        Array of points in time for which to sample the probability
        distribution.
    args : tuple, default ()
        The set of parameters to be passed to propensity_func. 
        
    Returns
    -------
    rxn : int
        Index of reaction that occured.
    time : float
        Time it took for the reaction to occur.        
    Returns
    -------
    sample : ndarray, shape (num_time_points, num_chemical_species)
        Entry i, j is the count of chemical species j at time
        time_points[i].
    """

    # Initialize output
    pop_out = np.empty((len(time_points), update.shape[1]), dtype=np.int)

    # Initialize and perform simulation
    i_time = 1
    i = 0
    t = time_points[0]
    population = population_0.copy()
    pop_out[0,:] = population
    while i < len(time_points):
        while t < time_points[i_time]:
            # draw the event and time step
            event, dt = gillespie_draw_fast(propensity_func, population, args)
                
            # Update the population
            population_previous = population.copy()
            population += update[event,:]
                
            # Increment time
            t += dt

        # Update the index
        i = np.searchsorted(time_points > t, True)
        
        # Update the population
        pop_out[i_time:min(i,len(time_points))] = population_previous
        
        # Increment index
        i_time = i
                           
    return pop_out

Finally, we'll test the speed of the two SSAs.

In [21]:
print('Gillespie SSA:')
%timeit gillespie_ssa(simple_propensity, simple_update, \
                      population_0, time_points, args)

print('\nFast Gillespie SSA:')
%timeit gillespie_ssa_fast(simple_propensity_numba, simple_update,\
                           population_0, time_points, args)
Gillespie SSA:
210 ms ± 6.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Fast Gillespie SSA:
104 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So, we are now faster by a factor of two with not too much work. This is still a general solver that you can use with any propensity function and update.

We have constructed a generic tool for doing Gillespie simulations. Specifically, we pass a propensity function into the algorithm. Passing a function as an argument precludes use of numba. If we instead insist that our propensity function be encoded in a global function prop_func(), then we can fully JIT compile the entire simulation. Note, though, that numba has some problems with tuples passed as arguments because it will sometimes infer Numpy array data types. Therefore, the arguments to the propensity function should be passed in as a Numpy array.

In [22]:
@numba.jit(nopython=True)
def prop_func(population, args):
    # Unpack input arrays
    m, p = population
    beta_m, beta_p, gamma = args
    
    # Construct propensities
    return np.array([beta_m,     # Make mRNA transcript
                     m,          # Degrade mRNA
                     beta_p * m, # Make protein
                     gamma * p]  # Degrade protein
                   )

@numba.jit(nopython=True)
def gillespie_draw_numba(population, args):
    """
    Draws a reaction and the time it took to do that reaction.
    
    Assumes that there is a globally scoped function
    `prop_func` that is Numba'd with nopython=True.
    """
    # Compute propensities
    props = prop_func(population, args)

    # Sum of propensities
    props_sum = np.sum(props)
    
    # Compute time
    time = np.random.exponential(1 / props_sum)
    
    # Draw reaction given propensities
    rxn = sample_discrete_numba(props, props_sum)

    return rxn, time


@numba.jit(nopython=True)
def gillespie_ssa_numba(update, population_0, time_points, args):
    """
    Uses the Gillespie stochastic simulation algorithm to sample
    from proability distribution of particle counts over time.
    
    Parameters
    ----------
    update : ndarray, shape (num_reactions, num_chemical_species)
        Entry i, j gives the change in particle counts of species j
        for chemical reaction i.
    population_0 : array_like, shape (num_chemical_species)
        Array of initial populations of all chemical species.
    time_points : array_like, shape (num_time_points,)
        Array of points in time for which to sample the probability
        distribution.
    args : tuple, default ()
        The set of parameters to be passed to propensity_func.
        
    Returns
    -------
    sample : ndarray, shape (num_time_points, num_chemical_species)
        Entry i, j is the count of chemical species j at time
        time_points[i].
        
    Notes
    -----
    .. Assumes that there is a globally scoped function
       `propensity_func` that is Numba'd with nopython=True.
    """
    # Initialize output
    pop_out = np.empty((len(time_points), update.shape[1]), dtype=np.int64)

    # Initialize and perform simulation
    i_time = 1
    i = 0
    t = time_points[0]
    population = population_0.copy()
    pop_out[0,:] = population
    while i < len(time_points):
        while t < time_points[i_time]:
            # draw the event and time step
            event, dt = gillespie_draw_numba(population, args)
                
            # Update the population
            population_previous = population.copy()
            population += update[event,:]
                
            # Increment time
            t += dt

        # Update the index (Have to be careful about types for Numba)
        i = np.searchsorted((time_points > t).astype(np.int64), 1)

        # Update the population
        for j in np.arange(i_time, min(i, len(time_points))):
            pop_out[j,:] = population_previous
        
        # Increment index
        i_time = i
                           
    return pop_out

Now let's test the speed of all three of our functions.

In [23]:
print('Gillespie SSA:')
%timeit gillespie_ssa(simple_propensity, simple_update, \
                      population_0, time_points, args)

print('\nFast Gillespie SSA:')
%timeit gillespie_ssa_fast(simple_propensity_numba, simple_update,\
                            population_0, time_points, args)

print('\nTotally numba\'d Gillespie SSA:')
%timeit gillespie_ssa_numba(simple_update, population_0, time_points, np.array(args))
Gillespie SSA:
240 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Fast Gillespie SSA:
109 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Totally numba'd Gillespie SSA:
6.5 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

We got an extra factor of ten by totally JIT compiling everything. This is a factor of 20 faster than the un-JITted version.

Parallel Gillespie simulations

Sampling by the Gillespie algorithm is trivially parallelizable. We can use the multiprocessing module to parallelize the computation. Syntactically, we need to specify a function that takes a single argument. Below, we set up a parallel calculation of Gillespie simulations for our specific example.

In [24]:
def gillespie_fn(args):
    return gillespie_ssa_numba(*args)
    
def gillespie_parallel(fn, update, population_0, time_points, args,
                       n_simulations, n_threads):
    """
    Convenience function to do parallel Gillespie simulations for simple
    gene expression.
    """
    input_args = (update, population_0, time_points, args)

    with multiprocessing.Pool(n_threads) as p:
        populations = p.map(fn, [input_args]*n_simulations)
        
    return np.array(populations)

We are paying some overhead in setting up the parallelization. Let's time it to see how we do with two threads (my little laptop can only do a maximum of four threads).

In [25]:
n_simulations = 1000

print('\nnumba\'d Gillespie SSA:')
%timeit [gillespie_ssa_numba(simple_update, population_0, time_points, args) \
            for _ in range(n_simulations)]

print('\nParallel numba\'d Gillespie SSA:')
%timeit gillespie_parallel(gillespie_fn, simple_update, population_0, time_points,\
                           args, n_simulations, 2)
numba'd Gillespie SSA:
6.88 s ± 120 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Parallel numba'd Gillespie SSA:
3.94 s ± 364 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We get a bit less than a factor of two speed boost from two cores. This brings our total speed boost from the optimization to a factor of about 40. We will do even better with the parallelization for longer calculations, since the overhead is less of the total calculation time.

More samples

Now that we can sample quickly, let's take many more samples. We will take 10,000 samples, instead of 100.

In [26]:
n_simulations = 10000
pops = gillespie_parallel(gillespie_fn, simple_update, population_0,
                          time_points, args, n_simulations, 2)

# Print out statistics
print('mRNA mean copy number =', pops[:,-50:,0].mean())
print('protein mean copy number =', pops[:,-50:,1].mean())
print('mRNA variance =', pops[:,-50:,0].std()**2)
print('protein variance =', pops[:,-50:,1].std()**2)
print('mRNA noise =', pops[:,-50:,0].std() / pops[:,-50:,0].mean())
print('protein noise =', pops[:,-50:,1].std() / pops[:,-50:,1].mean())
mRNA mean copy number = 10.014776
protein mean copy number = 250.256866
mRNA variance = 10.037897669823998
protein variance = 2043.1479498580434
mRNA noise = 0.31635896213979775
protein noise = 0.18061919675692076

Our mRNA histogram and Fano factor are much closer to the theoretical values. We can also plot the time traces. Instead of plotting all 10,000 of them, we will only plot 100 randomly sampled ones so the plot is not too cluttered.

In [27]:
# Set up plots
plots = [bokeh.plotting.figure(plot_width=300,
                               plot_height=200,
                               x_axis_label='dimensionless time',
                               y_axis_label='number of mRNAs'),
         bokeh.plotting.figure(plot_width=300,
                               plot_height=200,
                               x_axis_label='dimensionless time',
                               y_axis_label='number of proteins')]

# Plot trajectories and mean
for i in [0, 1]:
    for x in pops[::100,:,i]:
        plots[i].line(time_points, x, line_width=0.3, alpha=0.2, line_join='bevel')
    plots[i].line(time_points, pops[:,:,i].mean(axis=0), line_width=6, color='tomato',
                  line_join='bevel')


# Link axes
plots[0].x_range = plots[1].x_range

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

The averages are predictably much smoother with the much higher sample size.