Saturday 9 December 2023

Contributing code to Stan

This post is about me trying to contribute some code to the Stan codebase. My colleagues and I often use the Dirichlet-Multinomial distribution (e.g. in this preprint), which is not a native distribution in the Stan language (v2.33, December 9, 2023). It is not hard to define this as a user-defined distribution, but it still would be convenient to have this as a native function; it helps with avoiding bugs, repetitive code or having to #include files. This is going to be my first pull request, so mistakes will be made.
I based all this on tutorials from the Stan Math website (specifically the sections on "adding new functions" and "adding new distributions"), and the stanc webpage. The first link explains how to add functions to the Stan Math library, and the second explains how to make such functions available in the Stan language.

Some updates about the submission process:

2023-12-09: After trying three times, I managed to pass all automated tests for the Stan Math library (see Steps 2 and 7 below). Now I'm waiting for code reviews from a Stan developer.

2023-12-23: Last week a Stan developer reviewed my code, made improvements and requested some changes. I discuss the improvements below (Step 7), as they are really instructive. My PR was then approved and merged with the "develop" branch. I then submitted PRs for the documentation and the stan-to-c++ compiler. This went through without any issues. This means that the Dirichlet-multinomial distribution should be available in Stan version 2.34.

2024-01-15: I posted this blog post on the Stan furum and received some useful feed-back. I've added some notes below.


All code and documentation published in this blog post is licensed under the following licenses:

Step 1: Coding the distribution as a user-defined function in Stan

A good place to start is to write a working Stan model that we can use for testing and benchmarking. The Dirichlet-Multinomial distribution is a Multinomial distribution in which the probability vector (simplex) is sampled from a Dirichlet distribution. We can actually use this fact to generate random samples in Stan. Lets start there, because it is easy.

To define a custom RNG in Stan, you write a function in the functions block with name distribution_name_rng where distribution_name is the name of your distribution. In our case, that's dirichlet_multinomial. The RNG function typically uses native Stan RNGs to generate random samples, which are then transformed, or rejected until they satisfy some condition, or whatever you need. In this case, we first sample a probability vector from the Dirichlet distribution using dirichlet_rng(alpha). Then, we use this probability vector as input for the Multinomial distribution to get a integer array with multinomial_rng(dirichlet_rng(alpha), N). Here N is the "population size" parameter of the multinomial distribution (also referred to as the "number of trials"). Using some python code, we can use the Stan model to sample random Dirichelet Multinomial deviates \(X\) and plot \(X / N\) on a \(K\)-simplex. Here we generate \(40\,000\) samples using \(N = 1000\) and \(K = 3\).



Next, we define a probability mass function (PMF), which, like everything has to be defined on the log scale in Stan. The PMF of the Dirichlet Multinomial distribution is \[ p(x | \alpha) = \frac{N B(\alpha_0, N)}{\prod_{k : x_k > 0} x_k B(\alpha_k, x_k)}\] where \(\alpha_0 = \sum_{k=1}^K \alpha_k \), \( N = \sum_{k=1}^K x_k \) and \(K\) is the number of categories. The function \(B\) is the Beta function. We can code this in Stan almost verbatim using the lbeta function, which computes \(\log \circ B\), and replacing products with sums.

This implementation contains a small bug, as it does not work for \(x_1 = x_2 = \cdots = x_K = 0\), but we'll fix that later.
To do a simple test, I colored randomly sampled points (using \(\alpha = (2.4, 4.0, 6.0)'\) and \(N = 1000\)) scaled to fit in a 3-simplex in three ways. First I used the scipy.stats implementation of the Dirichlet-Multinomial PMF, then I used the dirichlet_multinomial_lpmf function defined in the functions block (and exponentiated the values). Finally, I used a Gaussian KDE (from scipy.stats) to compute an emperical density of points in the sample (this is how I colored points in the simplex plot above). The tree methods of coloring give very similar results.

Step 2: Adding the distribution to the Stan math library

Stan is built on top of the Stan math library. All native functions are defined here. We first have to clone the Stan math library from github with git clone git@github.com:stan-dev/math.git. Documentation for the Stan math library can be found here. It shows you how to write programs that use the math library, giving you access to automatic differentiation, Stan functions, etc. The distributions can be found in the folder math/stan/math/prim/prob. Here we want to add the files dirichlet_multinomial_rng.hpp for the random number generator, dirichlet_multinomial_lpmf.hpp for the probability mass function and dirichlet_multinomial_log.hpp which is only there for legacy reasons. In the same folder, we also find the Dirichlet distribution, the Multinomial distribution and the Beta-Binomial distribution. If we somehow combine these three, we should get our Dirichlet-Multinomial distribution, so these examples are very useful. Let's start with the RNG, which is easiest.

We'll talk a bit about the docstring later, so let's ignore that for now. The actual logic of the code is really simple: we make use of the Dirichlet and Multinomial RNGs. Therefore, we have to #include their header files. The difficulty is understanding some C++ tricks, error checking, and working with the Eigen library. The first observation is that Stan math functions are defined in the namespace stan::math. Next, we see that the parameter alpha has a somewhat complicated type

const Eigen::Matrix<double, Eigen::Dynamic, 1>& alpha

This means that alpha is a \(K \times 1\) matrix (a vector) with double entries. \(K\) is not determined at compile time. The variable function is used for generating error messages and contains the name of the C++ function. The alpha_ref variable is the result of the to_ref function call. What this does is "evaluating" Eigen expressions. Eigen expressions can be "lazy" in the sense that they are not immediately evaluated. In this case, alpha could be the result of some complicated vector computation. Here, we need the actual values of alpha. If alpha is already evaluated, then to_ref does nothing.
Next, we validate the input. The variable alpha has to be positive, and not infinite. There is also a check_positive test, that does allow positive infinity. I guess \(+\infty\) could make sense if only one of the \(\alpha_k\) is infinite. But if multiple \(\alpha_k\) are infinite, the distribution is no longer well defined. So it's better to avoid infinities. The integer N has to be non-negative, and all integers are finite.
As mentioned, the logic is identical to the user-defined Stan function. There is one exception: the case where \(N = 0\). If we look at the RNG for the Binomial distribution (v4.7.0 (September 5, 2023)), we find that \(N\) has to be positive:

check_positive(function, "number of trials variables", N);

and using this RNG with \(N=0\) indeed results in an exception:

Exception: multinomial_rng: number of trials variables is 0,
but must be positive!

On the other hand, the multinomial_lpmf function works fine when \(N = \sum_{k=1}^K n_k = 0\), and return \(0 = \log(1)\). So I don't think the check_positive check is correct. I might file an issue on GitHub in the future. For now, we have to make sure that the case \(N = 0\) is handled separately by our RNG: we just return the zero vector.

Next, we are going to write the code for the probability mass function. The distribution is discrete, so we have to define a function than ends with _lpmf instead of _lpdf. If you read the example for "adding new distributions" on the Stan Math page, you'll find that a lot of effort is put in computing the gradients of the function, and also making sure the function is properly vectorized. The binomial example that I based my implementation on did not do any of these things. It relies on automatic differentiation and is currently not vectorized. In fact none of the multivariate discrete distributions are currently vectorized. Therefore I did not try to implement vectorization for the Dirichlet-Multinomial either. In the non vectorized case, it also does not make much sense to manually compute derivatives. The log-PMF is just a sum of log-Beta functions. I'll leave vectorization and gradients for a future PR.

Alas, as it turns out, the Stan team did not let me get away with this so easily. See section "Revisions" in "Step 7" below. Thankfully, I got a lot of help. Also, the version of the code with analytic derivatives is somewhat faster than my lazy version shown here.

Again, the logic of the actual computations are pretty simple and very similar to our user-defined Stan function. We also perform similar domain checks as in the RNG function. In addition, we check that ns and alpha have compatible sizes. Then we define lp the log-probability. The type of lp has to be the same as the base type of alpha, and this can be achieved with return_type_t. If you had multiple parameters where some have base type double and others Stan's autodiff type var, then passing all parameter types to return_type_t would result in the appropriate return type var. The template parameter T_prior_size is the type of the "prior size" (or intensity) parameter alpha. There are some restrictions on what alpha is allowed to be, and the template parameter require_eigen_col_vector_t<T_prior_size>* = nullptr ensures that alpha has the correct Eigen type. I would like to understand this better, but that will have to wait.
Next let's look at the propto template parameter. This is used for dropping constant terms in the computation of the log-PMF. In a Stan sampling statement, we don't have to compute any constant terms of the log-probability or log-density, as constant terms do not matter in MCMC. As \(n\) is an integer vector and therefore always constant "data", we don't have to compute log(sum_ns) or log(ns[i]). Not computing these logs can same time and energy. In some cases, we do want to know the absolute log-PMG. For instance in the generated quantities block when we want to use log-probabilities for model comparison (WAIC, LOO_IC, etc.). It is also possible that alpha is data (meaning that the base type is double. In that case, there is nothing to compute if we're only interested in the log PMF up-to constants. Hence, we return 0.0. To test if "summands" depending on the "prior size" have to be included in the log PMF, if propto == true, we can use include_summand<propto, T_prior_size>::value. Next, we compute the terms of the log PMF that include alpha, and in the same loop, we compute the sum of the ns. If sum_ns is zero (\(N = 0\)), then lp is still zero and we can simply return this value.
Then we compute \(\alpha_0 = \sum_{k=1}^K \alpha_k\) using Eigen's sum method. Note that the type is value_type_t<T_prior_size>, which is the base type of alpha. The difference between return_type_t and value_type_t is that the former takes multiple arguments and returns the most "general" base type that fits all of them. The arguments of return_type_t don't have to be containers.
If propto is false, we still have to add the terms that only depend on ns.

To give the user access to the functions we just defined, we have to add #include directives in the file prob.hpp in folder stan/math/prim. This file contains a list of include directions for all distributions in alphabetical order. So we just add our new files in the right place

Let's now write a C++ program that imports the Stan library to test our new function. Instructions on how to use the Stan math library in your project can be found here. In this example, I will simply sample a random vector from the Dirichlet-Multinomial distribution with fixed intensity. I then compute the log-probability of that sample. I put some effort in generating "real" random numbers using std::random_device, and I wrote a function to print std::vector. It helps to read a bit about the Eigen library, which has some methods for formatting matrices and vectors.

This program prints something like:

random sample from dirichlet_multinomial(100, (10.1,  3.5,  7.2)) = (30, 8, 62)
log dirichlet_multinomial((30, 8, 62) | (10.1,  3.5,  7.2)) = -8.60311

To compile a program that makes use of the Stan math library, Stan provides a useful tool called "standalone". Compilation of the above code is done as follows

make -f path_to_stan_math/make/standalone hello_dirmult

Building the Stan math library can be done in a similar way

make -j4 -f path_to_stan_math/make/standalone math-libs

The -j4 flag determines the number of CPU cores used for the task (or threads). I have an old core i5. So I'll use 4.

If we look at the makefile in the Stan math directory, we see that there is a target called doxygen. This is for building the documentation with doxygen using docstrings in the code. In our case, we have used some LaTeX equations (using delimiters \f$ ... \f$ and \f[ ... \f]), that are rendered by doxygen. Let's check that this is rendered correctly, and that we didn't make any TeX errors. We can build the documentation with make doxygen. We then find the HTML pages in doc/api/html and view them in a browser. Here's a screenshot
And here's one for the RNG documentation
Building the docs revealed several mistakes (and I left at least one in the screenshots above), so doing this was definitely helpful.

Step 3: Writing unit tests

This step may sound a bit boring, but I found out that it is quite important. First of all the Stan team will not accept your code without any unit tests. Second, testing your code helps. I managed to catch two errors in my (very small) piece of code, all because of writing tests. Stan makes use of the GoogleTest framework. There are lots of examples of tests for distributions in the folder math/test/unit/math/prim/prob/. I took multinomial_test.cpp as a template, which I modified. One of the tests checks that a exception is thrown if one of the intensity parameters are infinite. For the multinomial distribution, the function checks that the parameters form a simplex. Hence, it also checks of the values are finite, as they are in the interval \((0,1)\). In our case, the intensities only have to be positive, and so initially I changed check_simplex into check_positive. However, infinity is also positive and therefore this check does not raise an error if one of the parameters is infinite. Therefore I changed the check to check_positive_finite (provided in math/stan/math/prim/err.hpp) and this time my test passed. The other error was produced by a case in which all counts are zero. In our definition of the PMF, we would lead to \(0 / 0\) which gives a NaN. However, if the number of trials (or population size) is zero, then it makes sense to define the probability of the event \((0,0,\dots,0)'\) as \(1\), since this is the only possible event. I fixed this by checking for this condition in dirichlet_multinomial_lpmf. I will not show you all the unit tests I wrote/modified but just to give an idea, this is the first one where I compare the value of the Dirichlet-Multinomial PMF with a known value (calculated with scipy.stats in Python)

What's happening here is that TEST is a macro defined in gtest.h. The identifiers ProbDistributionsDirichletMultinomial and DirichletMultinomial are used to name the test function. The test function's body is defined in curly brackets. We then create a real vector \(\theta = (2.0, 3.0, 5.0)'\) and an integer vector \(n = (1,2,3)'\) and test if \[ \log(p(n | \theta)) = -2.477938 \] where \(p\) is the PMF of the Dirichlet-Multinomial distribution.
You can run all or a subset of tests with the runTests.py python script. For instance

python3 runTests.py test/unit/math/prim/prob/dirichlet_multinomial_test

runs only the new tests. Warning: running all tests takes a lot of time.

Step 4: Adding the distribution to the Stan documentation

Before we expose our new distribution to the user (i.e. make it a native function in the Stan programming language), we have to write some documentation for the Stan functions reference. We can clone the documentation with git clone git@github.com:stan-dev/docs.git. The documentation is written in R markdown. This is the part that I added to multivariate_discrete_distributions.Rmd in the folder docs/src/functions-reference

This is largely based on the multinomial example. Note that all functions have an HTML-commented function signature. The README.md says this about these comments:
The Stan Functions Reference contains HTML comments which describe the function signature for all functions. The script extract_function_sigs.py is used to scrape these signatures into a plain text file.
This feature exists for Stan-specific syntax highlighting. Syntax highlighters (see this link for a list) have to know which variable names correspond to native functions so that they can be colored distictly. To build the documentation, use the python script build.py as follows

python3 build.py 2 34

The numbers 2 and 34 correspond to the major and minor version numbers. The R markdown file contains strings such as `r since("2.34")` which correspond to the version number when a function became available. It was not clear to me what version I should use in my new documentation. At the time of writing, the version was 2.33, and I reasoned that this new distribution would become available in the next version: 2.34.
The build script should work if you have R studio installed, which I don't (I have rstudio-server, in case you're wondering how I can survive). In such a case, you'll have to install both pandoc and pandoc-citeproc with sudo apt install (on Ubuntu 22.04, that is). Make sure the R packages bookdown, dplyr, ggplot2, kableExtra are installed with install.packages(). However, I also found that I needed to install the Ubuntu packages libfontconfig1-dev, libcurl4-openssl-dev, libxml2-dev with sudo apt install. Here's a screenshot of the new page in the functions reference pdf.


Step 5: adding the distribution to the stanc3 compiler

In terms of lines of code, this step is really simple. The stanc3 compiler is written in the OCaml language which might be a bit unfamiliar (it certainly is for me). Also the OCaml ecosystem has a bit of a steep learning curve. We start by cloning stanc3 using git clone git@github.com:stan-dev/stanc3.git. The file that contains all native functions and distributions can then be found in stanc3/src/middle/Stan_math_signatures.ml. We just have to add lines for dirichlet_multinomial_rng and dirichlet, and we can basically copy the code from the signatures of multinomial_rng and multinomial.

Instructions on how to install all dependencies and stanc can be found here. the stanc repository contains a script that does basically everything for you. However, if you want to install OCaml yourself, check this page. You will need the dune package. I started working on this a while ago, and when I picked up the work recently, I found that I had to update my OCaml version. This can be done with opam switch create [version number] (with the right version number).

Naturally, the stanc compiler has its own tests which includes a couple of Stan programs to test function signatures. The Stan program stanc3/test/integration/good/function_signatures/distributions/rngs.stan contains function calls for all RNGs. We can add our new RNG to this file by adding the line

  ns = dirichlet_multinomial_rng(alpha, 20);

The variables ns and alpha are already defined or declared. I then added a folder stanc3/test/integration/good/function-signatures/distributions/multivariate/discrete/dirichlet_multinomial and created a file dirichlet_multinomial_lpmf.stan in this folder. I basically copied the content of the test file for the Multinomial distribution (located at stanc3/test/integration/good/function-signatures/distributions/multivariate/discrete/multinomial) and changed multinomial to dirichlet_multinomial. The fact that the multinomial distribution takes simplices and the Dirichlet Multinomial positive vectors does not seem to matter for these tests: it's just about function signatures. This is the test file that I added:

Finally, Use dune runtest to check changes in the code, and then run dune promote to accept these changes. Otherwise the integration tests (see Step 7 below) will fail. More instructions on exposing new functions can be found here

Step 6: Compiling a model with cmdstanpy

To test if everything works as expected, I will now write a Stan model that can be used to estimate intensity parameters of the Dirichlet-Multinomial distribution, given some data that was sampled from this distribution with known parameters. To make life easy (i.e. no messing around with cmdstan, json files etc.), I want to do this in Python using the cmdstanpy interface. Let's first write our Stan model.

This is pretty straightforward. We have \(K\) categories, \(N\) observations \(X\) and for sampling pseudo-observations, we also pass a population size parameter \(M\) to the model. In the parameters block, we define \(\alpha \in \mathbb{R}_+^K\), which is the intensity, and in the model block we give \(\alpha\) a weakly-informative log-normal prior. We have to write the "sampling statements" for \(X\) in a for loop, as the Dirichlet-Multinomial distribution does not allow vectorization at this point. In the generated quantities block, we sample "fake data" \(Y\) and compute the likelihood of the data \(X\), again in a for loop. This is not the most exciting model, but it's good enough for this test.

Important note: The "linking" solution presented in the following section works, but there is a better "official" solution, which I explain a little bit later.
So far, we have only used the Stan Math library in C++ directly. In order to get cmdstan to work with (1) our extended Stan Math library, and (2) the updated stanc compiler, we have to link all these components together. First, we have to point stan to the right math library. If you clone stan from github (with git clone git@github.com:stan-dev/stan.git) you will find a folder stan/lib/stan_math. This folder is empty and should contain the contents for the Stan Math library. It is possible replace the folder stan_math with a link to the math folder. First cd to the stan/lib folder and remove the empty stan_math folder. Then create a link with

ln -s path/to/math stan_math

Then clone cmdstan with git clone git@github.com:stan-dev/cmdstan.git. The cmdstan folder contains an empty stan folder. Replace this with a link to the modified stan folder.

This approach of creating links worked for me, but there is an easier way. It is possible to clone cmdstan from github "recursively", i.e. also clone the content of the stan folder, and the content of the lib/stan_math folder at once. Download and installation instructions for cmdstan can be found here. As indicated in the "Installing from GitHub repo" section, we can clone the stan and math repositories with make stan-update from the cmdstan folder. Or, instead of just cloning the cmdstan repository, you can clone recursively with git clone https://github.com/stan-dev/cmdstan.git --recursive. This cloning is quite slow, because you don't only get the latest version, but also the entire history of Stan development. To speed things up, you could do a "shallow" clone with the additional flags --depth 1 --shallow-submodules. We can then build cmdstan with make build. This will also download a pre-built stanc compiler. This is not quite what we want, though, because we have to link to our own version of the math library, and our modified stanc compiler.
To tell cmdstan that it should use our modified stanc version, we have to create a file called local in the folder cmdstan/make. In fact, there is a file local.example already present in this folder which contains come other (commented-out) options. We can copy this file with cp local.example local and then open it in a text editor and add the line STANC3=path/to/stanc3. I have stanc3 and cmdstan in the same folder, so for me this line is STANC3=../stanc3

It turns out that a similar trick can be used to use a custom math repository, and so we don't have to create symbolic links. In the cmdstan makefile, you'll find these lines:
-include make/local     # user-defined variables

STAN ?= stan/
MATH ?= $(STAN)lib/stan_math/
The ?= assignment in makefiles only works when the variable on the left-hand-side hasn't already been defined. So if we define MATH=path/to/math in our make/local file (which is included in the main makefile), then we override the MATH ?= stan/lib/stan_math/ assignment. We could do the same for the STAN variable.


To compile and run the above Stan model with cmdstanpy, we have to make sure cmdstanpy can find the right cmdstan installation. This can simpy be done using cmdstanpy.set_cmdstan_path("path/to/cmdstan") where "path/to/cmdstan" is the location of the modified cmdstan installation.

The python script makes a plot of the ground truth parameter vector alpha (black dots), together with gray violins representing the posterior distribution. Judging from this plot, everything seems to be working.
It also helps to look at the diagnostics (with print(sam.diagnose()) in python). They look reassuring:

Processing csv files: /tmp/tmp_vwn59j0/dirmultqq672ujg/dirmult-20231126104727_1.csv, ...

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

Step 7: creating a pull request (PR) on GitHub

As Stan has multiple components, we have to submit multiple PRs. First, we have to submit our code to the Stan Math library. Then we have to provide new documentation for the functions reference, and then we have to submit our modifications to stanc3.

Stan Math PR

I will start with creating a PR for the math library. All PRs must correspond to an issue. So typically you first "request" a feature and then submit a PR to add it. In this case, there already exists an issue with a feature request for the Dirichlet-Multinomial distribution (from 2014), so we'll use that.
There's a lot of information to digest before you make a PR. A place to start is the "contributing" page. A much more detailed (and essential) "Developer process overview" can be found here, although these instructions are sometimes specific for the stan repository (not math). Then there is also this Introduction to Stan for New Developers. It is also a good idea to have a look at the pull request template (which is math specific).

On GitHub, fork the Stan math repository. Then clone the fork to your local machine. Make sure you're on the develop branch with git branch, and create a new branch with git checkout -b feature/issue-NN-short-description (this creates and switches to the new branch) where NN is the issue number and "short-description" is... a short description of the feature. In Steps 1-6, I had been working on just a clone of the math repository. So at this point I copied all new files and applied changes to the header file. The Stan team asks that you only commit well formatted code (more about this later), so be careful with what you commit to the feature branch. When you're ready, add files with git add path/to/file and commit changes with git commit -a -m "explain what changed". The -a option makes sure that all tracked files are added before the commit. To push the new brach to the remote fork, use git push -u origin feature/issue-NN-short-description

There are some requirements not covered in Steps 1-6 above. To make sure your code passes Stan's automated tests, these tests should pass:

make test-headers
make test-math-dependencies
runTests.py test/unit
make doxygen
make cpplint

For me, cpplint failed as lines were longer than 80 characters, and some lines ended in whitespace. In order to run cpplint, I had to install python2 (as Ubuntu 22.04 only has python3) with sudo apt install python2. On the web, there are some pages explaining how to make python2 the default python version. I think that's a really bad idea. However, you can tell make to use python2 with

export PYTHON=python2
make cpplint

The rule make test-math-dependencies did not work for me, but this make target just points to ./runChecks.py which has a hash-bang that did not work for me. So I just ran the script directly with

python3 runChecks.py

This ran without producing any output. Which means no errors were found.

My first PR failed very quickly. It turned out that passing cpplint was not sufficient to ensure proper formatting. To make sure that your code is formatted correctly, you have to use clang-formatter.
Installing clang-format in Ubuntu requires that you to first add the PPA for LLVM. Instructions are found here. On ubuntu 22.04, you would need the line "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy main". First get the signature with "wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -" (see section "Install (stable branch)". Then add the PPA with sudo add-apt-repository followed by the PPA string. It should also be possible to use the GUI (software and updates). On the coding style page instructions are given to install clang-format-5.0. For me, this did not work, but installing clang-format (without the 5.0) did. In the hooks folder, a script is provided to install a pre-commit git "hook". This makes sure that clang-format is run on all cpp and hpp files that were changed.

My second PR failed as well. This time at the "verify changes" stage
I got the following message

remote: Not Found
fatal: repository 'https://github.com/chvandorp/stan_math/math/' not found
error: pathspec 'forkedOrigin/feature/issue-54-dirichlet-multinomial' 
	did not match any file(s) known to git
script returned exit code 1

Which makes some sense because my fork is called stan_math and not stan_math/math. I renamed the repository math to stan_math to have a somewhat more descriptive name. The solution to this was simple: just change the name of the fork on GitHub. A name change is automatically updated the PR. All you have to do is change the remote URL in your local .git/config file. In the meantime, the Stan team has fixed this "bug" in their testing pipeline: it should now be possible to choose a different name for your fork.

After attempt 2, I also received a request to add an additional unit test. This test had to be added to test/unit/math/mix/prob/ using the expect_ad function. I was able to use the Dirichlet distribution as an example. expect_ad compares the AD with a value based on finite difference method. These tests also pass if an exception is thrown. It then just checks that all AD types also lead to the same exception. With these changes I tried submitting the PR again.
This time all checks passed.

Revisions

My dirichlet_multinomial_lpmf function is written in an ugly imperative style, while the Eigen and Stan math libraries provide methods to vectorize a lot of these operations. A Stan developer re-wrote my function in a much nicer, more functional style, and also added analytic derivatives.

I definitely do not understand all details of this revised function, but let's go through it anyway. The first line that's different defines a variable alpha_val as follows

const auto& alpha_val = as_value_array_or_scalar(alpha_ref);
As I understand it, the point of this is that if we want to compute gradients ourselves, then we want to temporarily "turn off" automatic differentiation. Hence, we have to use the "values" of Stan autodiff types instead of the full types in our computations. This is what as_value_array_or_scalar does. Next, we see that Stan math provides an overloaded function sum that takes sums of a lot of things. So there's no need for for loops or std::accumulate. The integer valued ns are then cast to double values using Eigen's cast<double> method (I'm not sure if the template disambiguater is required here).

Next, we are going to compute the log-probability lp. First, we add \(\log(N) - \sum_{k : x_k > 0} \log(x_k)\) to lp, but only if we are interested in constants. Notice that we can do elementwise comparison ns_array > 0 on Eigen arrays, and use the select method to compute an elementwise ternary operation. One important thing to realize here is that the computations are "lazy". This means although the expression log(ns_array) occurs in the C++ code, the log of the elements of ns_array that are zero are never evaluated. Similarly, we add \(\log(B(\alpha_0, N)) - \sum_{k : x_k > 0} \log(B(\alpha_k, x_k))\) to lp.

Now we will compute the gradient. For this, we use a object of the class partials_propagator, using the function make_partials_propagator. This function takes a variable number of arguments, one for each of the parameters or random variable that might require a gradient. In this case, this only holds for alpha, because ns is always constant. To put the "manually calculated" gradient in the right "slot", we use the partials<0> function, where the template argument 0 is used because we only had a single non-constant parameter alpha. The gradient of the distribution is given by \[ \frac{\partial\log(p(x | \alpha))}{\partial\alpha_i} = \frac{\partial\log(B(\alpha_0, N))}{\partial \alpha_i} - \sum_{k : x_k > 0} \frac{\partial \log(B(\alpha_k, x_k))}{\partial \alpha_k} \] Then we write \( \log(B(a, b)) = \log(\Gamma(a)) + \log(\Gamma(b)) - \log(\Gamma(a+b)) \) and consider two cases, \(x_i = 0\) (case I) and \(x_i > 0\) (case II). In case I, we get \[ \frac{\partial\log(p(x | \alpha))}{\partial\alpha_i} = \psi(\alpha_0) - \psi(\alpha_0 + N) \] where \(\psi(a) = \Gamma'(a) / \Gamma(a) \) denotes the digamma function. In case II, we get \[ \frac{\partial\log(p(x | \alpha))}{\partial\alpha_i} = \psi(\alpha_0) - \psi(\alpha_0 + N) - \psi(\alpha_i) + \psi(\alpha_i + x_i) \] Notice that we can use the latter expression for both cases, because \(\psi(\alpha_i) - \psi(\alpha_i + x_i) = 0\) when \(x_i = 0\). However, evaluating the digamma function is not free. Therefore we use the select method. The build method of the partials_propagator class then combines gradients and values in the right way and returns the appropriate value.

The other PRs

The remaining PRs (documentation and stanc) are quite uneventful. I did not create separate issues for the doc and stanc PRs, but referred to the accepted PR on stan-dev/math. For user-facing changes to stanc3, you must provide a PR which documents these changes. So first submit the PR for the doc, and then create a PR for stanc3.

Sunday 3 January 2021

Spin your own autodiff in C++

I recent years, I have made a lot of good use of automatic differentiation (AD, autodiff), for instance using Stan or pytorch. However, it always appeared a bit magical to me until I decided to just implement a simple forward-mode "autograd" method in C++ myself. This is a stripped-down version of my "exercise" with plenty of room for enhancement.

Dual numbers

To make forward-mode autodiff work, we replace all floating point numbers with so-called dual numbers \(z = x + y N\) where \(N\) is a nilpotent element, i.e., \(N^2 = 0\). The space of dual number \(\mathbb{D}\) is a real algebra with addition and multiplication defined as follows. Let \(z_1 = x_1 + y_1 N\) and \(z_2 = x_2 + y_2 N\), then \[ z_1 + z_2 = (x_1 + x_2) + (y_1 + y_2) N \] and \[ z_1 \cdot z_2 = x_1 x_2 + (x_1 y_2 + x_2 y_1) N \] Notice now this product encodes the product rule for derivatives. All elements \(z \in \mathbb{D} \) with \(z = x + yN\) and \(x \neq 0 \) have a multiplicative inverse \(z^{-1} = x^{-1} - yx^{-2} N \). In C++ we can define the dual numbers using a class and overload the arithmatic operators as follows. In the header file dual.hpp we declare a class Dual with members x and y. In the source file dual.cpp we define all methods using the rules for dual arithmetic described above.

Function evaluation

The magic of dual numbers comes from the way we can extend differentiable functions on \(\mathbb{R}\) to the dual numbers. For any differentiable function \(f : \mathbb{R} \rightarrow \mathbb{R} \), we can extend \(f\) to a function \(f : \mathbb{D} \rightarrow \mathbb{D} \) by defining \[ f(x + yN) = f(x) + y f'(x)N \] where \(f'\) is the derivative of \(f\). To see why this is a natural choise, we take the Taylor exansion of \(f\) around \(x\) and see that \[ f(x + yN) = f(x) + f'(x) yN + \tfrac12 f''(x) (yN)^2 + \dots \] However, as \(N^2 = 0\) all the terms of order \(y^2\) and higher vanish. This allows us to overload a number of standard mathematical functions such that they work on both double and Dual variables. In the files dualmath.hpp and dualmath.cpp we define some functions like \(\sin(x)\) and \(\cos(x)\) Notice that the functions declared in dualmath.hpp are declared for double values in the <cmath> header, which is included in the file dualmath.cpp.

Templated functions

Of course, in the functions defined in dualmath.cpp we had to "manually" calculate the derivative to evaluate the functions on \(\mathbb{D}\) (with the exception of the pow function). However, we went to the trouble to define the dual numbers in C++ in order to "automatically" calculate derivatives of functions. To see how this works, suppose that define a templated C++ function fun<T>(T x), then we can either use fun<double> or fun<Dual>. The derivative of fun can then be evaluated at a real number \(x\) using a dual number \(z = x + yN\) with non-zero \(y\). In the following example, we "automatically" compute the derivative of \(\tan(x)\). This file gives the following output
$ g++ ad_example1.cpp dual.cpp dualmath.cpp -o ad_example1
$ ./ad_example1
0 1 1
0.01 1.0001 1.0001
0.02 1.0004 1.0004
0.03 1.0009 1.0009
0.04 1.0016 1.0016
0.05 1.0025 1.0025
0.06 1.00361 1.00361
0.07 1.00492 1.00492
0.08 1.00643 1.00643
...
which we can use to make the following graph

To see why this works, we write \(z = x + 1N\) and mimick the calculation done by the computer: \[ \tan(z) = \frac{\sin(z)}{\cos(z)} = \frac{\sin(x) + \cos(x)N}{\cos(x) - \sin(x)N} \] which is defined in dualmath.cpp. Then using the rules for division and multiplication of dual numbers defined in dual.cpp, we get \[ \frac{\sin(x) + \cos(x)N}{\cos(x) - \sin(x)N} = (\sin(x) + \cos(x)N)(1/\cos(x) + \sin(x)/\cos^2(x)N) \] \[ = \frac{\sin(x)}{\cos(x)} + (1 + \sin^2(x)/\cos^2(x))N \] Which simplifies to \[ \tan(x) = \tan(x) + 1/\cos^2(x) N \] Hence, the Dual::y component of fun(z) is equal to \(\tan'(x)\).

Create derivatives and gradients with a functor

Modern C++ makes it easy to write functional programs, and we can use this to define a functor that returns the derivative of \(f\) when given a function \(f\). For this, we make use of the <functional> header. The signature of our functor has the following form
std::function<double(double)> derivative(std::function<Dual(Dual)> f);
meaning that the functor derivative takes a function \(f : \mathbb{D} \rightarrow \mathbb{D}\) as argument, and returns a function \(f' : \mathbb{R} \rightarrow \mathbb{R} \).
Things become much more useful in higher dimensions. In the example below, we also show how to automatically compute the gradient of a function \( F : \mathbb{R}^n \rightarrow \mathbb{R} \) and (more efficiently) directional derivatives \(\nabla F(x) \cdot y \). The 1D derivative, gradient and directional derivatives are declared in the following header file The definitions of the functors makes use of C++ lambda expressions that capture the function given as argument A simple example of the functors defined in autograd.cpp is given here: The result of the second example is the following:
$ g++ ad_example2.cpp dual.cpp dualmath.cpp autograd.cpp -o ad_example2
$ ./ad_example2
norm of x = 1.58114
gradient of norm(x) = [0.316228, 0.948683]
gradient of norm(x) times u = 0.948683

More advanced methods

In addition to gradients, one can compute Jacobains of vector fields. Also, instead of dual numbers one can define so-called "hyper-dual" numbers to compute second-order derivatives and Hessians.

Sunday 21 June 2020

A C++ worker pool using thread and functional

To celebrate the fifth anniversary of the TBZ533 blog, I decided the revisit the first post. This was about a class that could be inherited to provide worker pool functionality, implemented with the pthread library and function pointers. Using a bit of modern C++, the implementation of a simple worker pool becomes much more elegant.

In the header file workerpool.hpp, we declare the WorkerPool class. The argument of the constructor is the number of workers. If the number of workers is zero, all tasks are executed serially (which can be convenient for debugging). A WorkerPool can not be copied, so we disable the copy constructor and copy assignment constructor. A Task is defined as a function that takes no arguments and returns nothing. A Task can be passed to the worker pool using the push_back() method (which name is inspired by std::list::push_back()). The final public method is the sync() method that waits for all tasks to be completed.

The private methods and members of WorkerPool are a vector of worker threads, a list of (unfinished) tasks, the worker routine, and some synchronization objects. The boolean abort_flag is used to get the workers to exit their worker routines. The count integer represents the number of unfinished tasks. Tasks are guarded by the task_mut mutex, and count by the count_mut mutex. The task_cv condition variable is used to signal that a new Task is added to the list of tasks, while the count_cv is used to signal that another task has been completed.



The source file contains the definitions of the methods of WorkerPool. The constructor initiates a number of worker threads and starts their worker_routine(). The descructor makes sure that all threads exit their worker routine and joins the threads. This means that the WorkerPool uses the RAII technique. The push_back() method evaluates the Task object if there are no workers, and otherwise adds the task to the tasks list, after increasing the count. We signal that a task has been added in case a worker is sleeping. The sync() method only returns when count == 0. Finally, the most important part of the code is the worker_routine(). When the list tasks is empty, a worker waits until a signal is given that a new tasks is added, or the abort_flag is set. In the first case, the task is executed, in the second case, the worker exits the worker_routine().



Example: drawing the Mandelbrot set in parallel



To give a nice example of how the WorkerPool should be used, I modified some code from Farouk Ounane for drawing the Mandelbrot set in C++. Each Task determines if a point in the complex plane is part of the Mandelbrot set or not. We use a closure (lambda expression) to define these tasks, and pass them to the WorkerPool. The result is shown above. To compile the C++ program with gcc, put the files workerpool.cpp, workerpool.hpp and mandelbrot.cpp in the same directory, and execute from the terminal
  g++ workerpool.cpp mandelbrot.cpp -o mandelbrot -pthread -std=c++11
  

Sunday 1 March 2020

My Matplotlib Cheatsheet (#2)

Previously, I wrote about some matplotlib methods I often use. In this post I will show two more.

Scatter plots with color for density

Scatter plots with many points can become very uninformative. A very easy solution is to color the points according to their local density. This can be done with the gaussian_kde from scipy.stats. Suppose that we want to make a scatter plot from xs and ys in subplot ax, then we just have to add two lines to color the points:
xy = numpy.vstack([xs, ys])
z = scipy.stats.gaussian_kde(xy)(xy)
ax.scatter(xs, ys, s=5, c=z)
Here's an example:



This is the code I used to make the figure, making use of inset_axes for a color bar:



Recycling bins for multiple histograms

When you want to plot multiple histograms in one plot, the width of the bins can be very different, which does not look so good. I recently found a simple method to fix this. The second value that the hist function returns specifies the boundaries of the bins used for the histogram. This value can be used directly in the next call to the hist function using the bins keyword. The following figure shows the result:



I used the following code to plot this figure:

Saturday 11 January 2020

Generalized Profiling with Stan

The probabilistic programming language Stan has built-in support for ODE models through the higher-order functions integrate_ode_rk45 and integrate_ode_bdf. Here, I will demonstrate an alternative way to implement ODE models in Stan, that has some interesting benefits. This method is called generalized profiling and has been developed by Ramsay et al. in 2007.

In a previous blog post, I used the Lotka-Volterra predator-prey model as an example for an ODE model in Stan. The model is given by the 2-dimensional system of ODEs $$ \frac{dx}{dt} = a x - b xy\,,\quad \frac{dy}{dt} = c b xy - dy $$ with initial conditions \(x(0) = x_0\) and \(y(0) = y_0\). Here, \(x\) and \(y\) represent the prey and predator "concentrations", respectively. If we choose a system size \(K\), we can generate some data \(X_i \sim \mathrm{Poisson}(K x(t_i))\) and \(Y_i \sim \mathrm{Poisson}(K y(t_i))\) at times \(0\leq t_i \leq T\). However, when we implement this model in Stan with integrate_ode_rk45, and try to estimate the parameters \(a, b, c, d, x_0, y_0\) without giving any initial guesses for these parameters, Stan will have a really hard time finding any good values. The reason for this is that for non-linear systems of ODEs the posterior density can be quite "rugged", i.e. can contain many local optima that even Stan's NUTS sampler can have a hard time navigating through. The generalized profiling method solves this problem by modifying the model in such a way that the posterior density becomes more smooth. This works roughly as follows.

Suppose that the model is defined by the system of ODEs $$ \dot{u}(t) = f(t, u(t), \theta)\,,\quad u(0) = u_0 $$ Fitting the solution of a system of ODEs through data can be hard, so why not replace the solution of the ODEs with something that can be fit through the data easily: for instance a B-spline. Obviously, this leads to the problem of over-fitting: when the number of knots equals the number of observations, the spline will fit the data perfectly. Also, we are not learning much about the parameters of the original model. We can solve both problems by penalizing. At each time point \(t\), the spline \(u\) has a total derivative \(\dot{u}(t)\), while according to the model, the total derivative should be equal to the vector field \(f(t, u(t), \theta)\). We can enforce that the spline resembles trajectories of the model by defining a penalty $$ \lambda \mathcal{S}[u] = \frac{\lambda}{2} \int_0^T \|\dot{u}(t) - f(t, u(t), \theta) \|^2 dt $$ which equals \(0\) exactly when \(u\) solves the system of ODEs. The log-likelihood of the data, given parameters \(\theta\) and spline \(u\) is then given by $$ \sum_{i=1}^n \log p(U_i | u(t_i), \theta) - \lambda \mathcal{S}[u]\,, $$ where \(p\) defines the measurement model (in our predator-prey model, \(p\) is the probability mass function of the Poisson distribution). When \(\lambda\) is close to \(0\), the spline is allowed to deviate from a true model trajectory, and fitting is easy. On the other hand, when \(\lambda\) is very large, the spline starts to resemble a true solution to the ODEs, and we will estimate \(\theta\) more precisely.

Stan implementation

In order to implement generalized profiling is Stan, we have to solve two problems. First, we have to construct a B-spline \(u\) in Stan, and compute the derivative \(\dot{u}\) of such a B-spline. Second, we have to numerically integrate the penalty \(\mathcal{S}[u]\). For the B-spline, I use an example from Milad Kharratzadeh, that uses a recursive Stan function to build a spline basis matrix in the generated quantities block. Milad gives a concise introduction to B-splines, which I will not repeat, but we do need the derivative of the spline.

Fortunately, we can define the derivative of a spline in terms of other B-splines. Define the spline basis of order \(k\) recursively $$ B_{i,k}(t) = \omega_{i,k} B_{i,k−1}(t) + (1−\omega_{i+1,k})B_{i+1,k−1}(t) $$ with coefficients \(\omega_{i, k} = (t - \tau_i) / (\tau_{i+k-1} - \tau_i)\) and knots \(\tau_1 \leq \tau_2 \leq \dots \leq \tau_m\). The base-case of the recursion is given by the order \(1\) splines $$ B_{i, 1}(t) = \left\{\begin{array}{ll} 1 & \mbox{if } \tau_i \leq t < \tau_{i+1} \\ 0 & \mbox{otherwise} \end{array}\right. $$ The derivative of a basis spline \(B\) is then given by $$ B_{i, k}'(t) = (k-1)\left(\alpha_{i, k} B_{i, k-1}(t) - \alpha_{i+1, k} B_{i+1, k-1}(t)\right) $$ where \(\alpha_{i, k} = 1/(\tau_{i+k-1} - \tau_i)\). We can prove this easily with induction.

The following Stan functions are implementations of B-splines and their derivative. The file splines.stan can be included in other Stan models with an #include "splines.stan" statement in the functions block.


Next, we need a way to compute the integral defined by \(\mathcal{S}[u]\). In general, this will not have a closed form expression, and therefore we use numerical integration. According to Simpson's rule, we can approximate the integral $$ \int_a^b g(x) dx = \tfrac{h}{3}\left(g(a) + 4 g(\tfrac12(b+a)) + g(b)\right) + \mathcal{O}(h^5) $$ where \(h = \tfrac12 (b-a)\). Hence, we will partition the interval \([0,T]\) into \(k-1\) sub-intervals of equal length \(2h = \frac{T}{k-1}\), and we have to know the value of the spline \(u(t)\) and it's derivative \(\dot{u}(t)\) at the \(2k-1\) end and midpoints of these intervals. The penalty functional \(\mathcal{S}[u]\) is then approximated with $$ \mathcal{S}[u] \approx \tfrac{h}{3} \sum_{i=1}^{k-1} \left(L(2(i-1)h) + 4 L((2i-1)h) + L(2ih)\right) $$ where \(L(t) = \tfrac12 \| \dot{u}(t) - f(t, u(t), \theta) \|^2 \). The number of "grid points" \(k\) is defined in the Stan model's data block as NumGridPts

The Stan model is organized as follows. In the functions block, we define the system of ODEs (copied directly from a previous post). Also, we define a convenient function seq to make sequences. In the data block, we import the observations and observation times, define the number of knots, grid points, the degree of the splines, the constant \(K\) and the weight of the penalty \(\lambda\). In the transformed data block, we define a knot vector and the grid for numerical integration. Then, we build the spline and spline derivative matrices BObs, BGrid, and DBGrid for the observation times and grid points. The parameters block declares the predator-prey model parameters \(a, b, c, d\), and the spline weights \(\upsilon\). In the model block, we compute the values of the spine \(u\) and spline derivative \(\dot{u}\) at the observation times (u = BObs * upsilon) and grid points (u_grid = BGrid * upsilon and du_grid = DBGrid * upsilon). These values are used for numerical integration of the penalty, and in the "sampling statement" of the Poisson-distributed observations.


Fitting the model to generated data

I will now demonstrate the Stan model with simulated data from the predator-prey model. We will use Python to generate data. First, we have to import some modules and compile the Stan model. A GitHub "gist" with all the python code in a single script is available here.
import pystan
import numpy as np
import scipy.stats as sts
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})
from matplotlib.gridspec import GridSpec
from scipy.integrate import solve_ivp
import sdeint ## will be used later

## compile the Stan model
sm = pystan.StanModel(file="gen-prof-LV.stan")
Now, we define the predator-prey model and choose some parameters \(\theta = (a, b, c, d)\), and initial condition \(u_0 = (x_0, y_0)^T\). We also have to define the times we observe the system and the number of observations. Then we simulate the model using the solve_ivp method from scipy.integrate and generate noisy observations using poisson.rvs from scipy.stats.
## choose nice parameter values
a = 1.0
b = 0.4
c = 0.4
d = 0.5
theta = [a, b, c, d]

## observation times and initial conditions
NumObs = 25
tmin, tmax = 0, 50
TimesObs = np.linspace(tmin, tmax, NumObs)

## initial values
u0 = [1.0, 1.0] ## x0, y0
## the "system size" parameter K
K = 10

## define the Lotka-Volterra predator-prey model
def LV_sys(t, u):
    return [a*u[0] - b*u[0]*u[1], c*b*u[0]*u[1] - d*u[1]]

## use an ODE integrator to produce a trajectory
sol = solve_ivp(LV_sys, (tmin, tmax), u0, t_eval=TimesObs)
## generate random data (observations)
Obs = sts.poisson.rvs(sol.y.T*K)
## K determines the measurement noise
Next, we define a function run_gen_prof that makes the correct data dictionary for Stan and starts the Stan model. One of the arguments of the function is \(\lambda\), the weight of the penalty. We will run the Stan model twice. Once with a small \(\lambda = 1\), and once with a large \(\lambda = 100 \). The function run_gen_prof also returns the grid points used for numerical integration of the penalty term, that we will use for plotting the spline and it's derivative below.
def run_gen_prof(sm, obs, times, lam, system_size, deg=3, 
                 chains=4, chain_len=1000, thin=5):
    """
    convenient function to make a data dictionary for Stan
    and run the Stan model
    """
    n = len(times)
    ## put a knot at every observation and between two observations
    num_knots = 2*n-1
    ## number of points for numerical integration
    num_grid_pts = 3*num_knots-1
    grid_pts = np.linspace(times[0], times[n-1], num_grid_pts)
    data = {
        "NumKnots" : num_knots,
        "SplineDeg" : deg,
        "NumGridPts" : num_grid_pts,
        "NumObs" : n,
        "TimesObs" : times,
        "Obs" : obs,
        "K" : system_size,
        "Lambda" : lam,
    }
    settings = {
        "max_treedepth" : 15,
        "adapt_delta" : 0.9
    }
    sam = sm.sampling(data=data, chains=chains, iter=chain_len, 
                          control=settings, thin=thin)
    ## return the grid points for plotting
    return sam, grid_pts


## fit the model twice with different lambdas
lam_small = 1
sam_small, GridPts = run_gen_prof(sm, Obs, TimesObs, lam_small, K)
lam_large = 100
sam_large, _ = run_gen_prof(sm, Obs, TimesObs, lam_large, K)
In order to visualize the result, we define a function plot_gen_prof_fit that makes a plot of the data and fit. The function also plots the derivative of the spline \(\dot u\) and the vector field \(f(t, u(t), \theta)\), such that we can see how well the spline resembles a trajectory of the predator-prey model.
def plot_gen_prof_fit(sam, times, obs, grid_pts, system_size, n=None):
    """
    Make a figure with the data and the fitted spline.
    Also add the derivative of the spline and the vector field
    to give an indication of the deviation from the LV model
    """
    if n is None:
        n = len(times)
    chain_dict = sam.extract(permuted=True)
    fig = plt.figure(figsize=(14, 7))
    gs = GridSpec(4,1)
    ax = fig.add_subplot(gs[:2,0])
    bxs = [fig.add_subplot(gs[2,0], sharex=ax), 
           fig.add_subplot(gs[3,0], sharex=ax)]
    labels = ["Prey ($X$)", "Predator ($Y$)"]
    colors = ["tab:blue", "tab:orange"]
    pcts = [2.5, 97.5]
    ## make plots for predators and prey
    for i, color in enumerate(colors):
        ax.scatter(times[:n], obs[:n,i], color=color, edgecolors='k', 
                   zorder=3, label=labels[i])
        ## plot trajectories
        uss = chain_dict["uhat"][:,:,i].T
        mean_uhat = [system_size*np.mean(us) for us in uss]
        ax.plot(grid_pts, mean_uhat, color='k', zorder=2, 
                label='fit' if i == 0 else None)
        range_uhat = [system_size*np.percentile(us, pcts) for us in uss]
        ax.fill_between(grid_pts, *np.array(range_uhat).T, color=color, 
                        alpha=0.5, linewidth=0, zorder=1)
        ## plot simulations
        uss = chain_dict["usim"][:,:,i].T
        range_usim = [np.percentile(us, pcts) for us in uss]
        ax.fill_between(grid_pts, *np.array(range_usim).T, color=color, 
                        alpha=0.3, linewidth=0)
        ## plot derivative of the spline and the target derivative
        uss = chain_dict["duhat_real"][:,:,i].T
        mean_duhat_real = [system_size*np.mean(us) for us in uss]
        bxs[i].plot(grid_pts, mean_duhat_real, color=color, 
                    linewidth=3, label="spline")
        uss = chain_dict["duhat_target"][:,:,i].T
        mean_duhat_target = [system_size*np.mean(xs) for xs in uss]
        bxs[i].plot(grid_pts, mean_duhat_target, color='k', 
                    linestyle='--', label="LV model")
        bxs[i].legend(loc=1, ncol=2, prop={'size': 10})
    ## some labels etc...
    ax.legend(loc=1, ncol=3, prop={'size': 10})
    ax.set_ylabel("data and fit")
    for i, c in enumerate('xy'):
        bxs[i].set_ylabel(f"$\\frac{{d{c}}}{{dt}}$", 
                          rotation=0, va='center')
    ax.get_xaxis().set_visible(False)
    bxs[0].get_xaxis().set_visible(False)
    bxs[1].set_xlabel("Time ($t$)")
    fig.align_ylabels()
    return fig, (ax, *bxs)
Let us first have a look at the fit with \(\lambda = 1\).
fig, axs = plot_gen_prof_fit(sam_small, TimesObs, Obs, GridPts, K)
fig.savefig("gen-prof-fit-small-lambda.png", dpi=300, bbox_inches='tight')
This is the result (click on the image to make it larger):
The scaled spline \(K u\) goes through almost all the data points \(X_i\) and \(Y_i\), and the derivative \(\dot{u}\) roughly follows the vector field \(f(t, u(t), \theta)\). However, the prediction intervals (dark colored bands) are quite wide, meaning that the uncertainty in \(u\) is very large between observations.

In order to make our spline-based model more closely related to the ODE model, we have to increase \(\lambda\). Now let us look at the fit with \(\lambda = 100\).
fig, axs = plot_gen_prof_fit(sam_large, TimesObs, Obs, GridPts, K)
fig.savefig("gen-prof-fit-large-lambda.png", dpi=300, bbox_inches='tight')
The result of taking \(\lambda\) large is that the spline \(K u\) no longer goes through all the data points \(X_i, Y_i\), as it should, since the the data is sampled from a Poisson distribution with mean \(K u(t_i) \). In addition, the derivative of the spline \(\dot{u}\) is now almost identical to the vector field \(f(t, u(t), \theta)\).

We will now look at the parameter estimates, and the effect of choosing small and large \(\lambda\). Again, we define a simple function plot_par_est to plot the estimates, and then use this function to create the two plots.
def plot_par_est(ax, sam, real_par_vals):
    """
    plot parameter estimates and compare them with the real values
    """
    chain_dict = sam.extract(permuted=True)
    parnames = ["a", "b", "c", "d"]
    latex_parnames = [f"${x}$" for x in parnames]
    pcts = [2.5, 97.5]

    ## plot estimates and 95 percentiles
    pos = range(len(parnames))
    means = [np.mean(chain_dict[x]) for x in parnames]
    ranges = [np.percentile(chain_dict[x], pcts) for x in parnames]
    ax.scatter(pos, [np.mean(chain_dict[x]) for x in parnames], 
               color='tab:red', label="estimate", marker='D', zorder=1)
    for p, r in zip(pos, ranges):
        ax.plot([p, p], r, color='tab:red', linewidth=2, zorder=1)
    
    ax.set_xticks(pos)
    ax.set_xticklabels(latex_parnames)

    ## plot real parameter values
    ax.scatter(pos, real_par_vals, color='k', label="real value", zorder=2)

    ax.legend(loc=1, ncol=1, prop={'size': 10})

    ax.set_ylabel("parameter value")
    ax.set_xlabel("parameter name")

## compare the parameter estimates with small and large lambda
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(7,3), sharey=True)
plot_par_est(ax1, sam_small, theta)
plot_par_est(ax2, sam_large, theta)

ax1.set_title(f"$\\lambda = {lam_small}$")
ax2.set_title(f"$\\lambda = {lam_large}$")

fig.savefig("gen-prof-estimates.png", dpi=300, bbox_inches='tight')
This results in the following figure.
The estimates (red diamonds) are pretty close to the real values (black dots) for both small and large \(\lambda\), but the 95% credibility intervals (CrI; the red vertical lines) are much smaller when \(\lambda = 100\).
disclaimer: I am not sure that these red lines are in fact 95% CrIs, as we are not fitting the true model to the data.
The question now is: what \(\lambda\) should we choose. In this case, the larger \(\lambda\) is, the better, because the spline gets closer to a trajectory of the model. However, as \(u\) is a B-spline, and not any differentiable function, we will never have \(\dot{u}(t) = f(t, u(u), \theta)\) for all but finitely many \(t\), and hence the penalty functional \(\mathcal{S}[u]\) will never be zero, unless the number of knots goes to infinity. In practice, one can try a number of increasing values for \(\lambda\) and choose different numbers of knots, and see when the estimates (and uncertainties) stabilize.

Model misspecification: the predator-prey model with intrinsic noise

An interesting benefit of the generalized profiling method over ODE integrators is that it can deal with some level of misspecification of the model. For biological systems, practically all models are misspecified. One source of misspecification is intrinsic noise, i.e., the true dynamics are not deterministic but stochastic. For the Lotka-Volterra predator-prey model, intrinsic noise has interesting consequences. The oscillatory dynamics predicted by the deterministic model have a fixed period and amplitude. When we add some noise to the system, we find that the model quickly gets out of phase with the deterministic trajectory, and that also the amplitude will vary over time. This makes it very difficult to use an ODE integrator to fit the model to data. We will see that the generalized profiling approach still does the job.

First, we will add diagonal multiplicative noise to the predator-prey model. We will use the sdeint module to simulate stochastic differential equations (SDEs). This module is still a pre-release, but it suffices for our purpose. Our stochastic predator-prey model is given by the SDEs $$ dx_t = (a x_t - bx_t y_t)dt + \sigma dW^1_t $$ $$ dy_t = (cb x_t y_t - d y_t)dt + \sigma dW^2_t $$ where \(W_t\) is a 2-dimensional Wiener process, and \(\sigma\) is the volatility. We use this model to generate data as follows
## add process noise to the model
sigma = 0.1
## other parameters stay the same

NumObs = 50
tmin, tmax = 0, 100
Thin = 10
TimesObsSDE = np.linspace(tmin, tmax, Thin*(NumObs-1)+1)
TimesObs = TimesObsSDE[::Thin]

## define the system
def LV_sys_drift(u, t):
    return np.array([a*u[0] - b*u[0]*u[1], c*b*u[0]*u[1] - d*u[1]])

def LV_sys_diffusion(u, t):
    return sigma * np.diag(u)

sol = sdeint.itoint(LV_sys_drift, LV_sys_diffusion, u0, TimesObsSDE)
sol_ode = solve_ivp(LV_sys, (tmin, tmax), u0, t_eval=TimesObsSDE)

## generate random data (observations)
Obs = sts.poisson.rvs(sol[::Thin,:]*K)
We then plot the trajectory and observations, and compare them to the trajectory predicted by the determinitic model.
## make a figure of the stochastic process, 
## the data and the deterministic skellaton
pcts = [2.5, 97.5] ## percentiles used later for CrIs
colors = ["tab:blue", "tab:orange"] ## color for prey and predator

fig, axs = plt.subplots(2, 1, figsize=(14, 7), sharex=True)

for i, color in enumerate(colors):
    axs[i].scatter(TimesObs, Obs[:,i], color=color, 
                   edgecolors='k', zorder=2, label='observations')
    axs[i].plot(TimesObsSDE, sol[:,i]*K, color=color,
                linewidth=3, zorder=1, label='stochastic process')
    axs[i].plot(TimesObsSDE, sol_ode.y[i]*K, color='k', alpha=0.75,
                label='deterministic skellaton')
    axs[i].legend(loc=0, ncol=3, prop={'size': 10})
    
axs[-1].set_xlabel("Time ($t$)")
axs[0].set_ylabel("$x$ SDE (blue)\n$x$ ODE (black)")
axs[1].set_ylabel("$y$ SDE (orange)\n$y$ ODE (black)")

fig.savefig("stochastic-LV-sim.png", dpi=300, bbox_inches='tight')
As mentioned, the stochastic trajectory gets out of phase with the deterministic model (black curves) and also the amplitude varies over time.

We now fit the same Stan model as before to the stochastic predator-prey data. We again use the run_gen_prof function to start the Stan program, and plot figures with the plot_gen_prof_fit and plot_par_est functions.
## choose a lambda, and fit the model...
lam = 10
sam, GridPts = run_gen_prof(sm, Obs, TimesObs, lam, K)

## plot the fit
fig, axs = plot_gen_prof_fit(sam, TimesObs, Obs, GridPts, K)
fig.savefig("gen-prof-fit-sde.png", dpi=300, bbox_inches='tight')
## plot estimates
fig, ax = plt.subplots(1, 1, figsize=(4,3))
plot_par_est(ax, sam, theta)
fig.savefig("gen-prof-estimates-sde.png", dpi=300, bbox_inches='tight')
The data and model fit are shown in the following figure. The generalized profiling fit stays close to the data, even though the stochastic path gets out of phase with the deterministic trajectory.
Also, the parameter estimates are still very close to the real values, as shown by the following figure

For this fit, I used a \(\lambda = 10\). Again, it is not clear what value for \(\lambda\) we should choose. In this case, a bigger \(\lambda\) is not necessarily better, as a finite \(\lambda\) allows the spline to deviate from the deterministic model, which is exactly what we need. In a follow-up paper Hooker et al. fit a model to a subset of the data and then compare model predictions with the left-out data points (i.e. forward cross-validation). This is repeated for different values of \(\lambda\) to determine the penalty weight yielding the best predictive performance. I experimented with this, but will leave it for a future blog post.

Theoretical considerations and further developments

The motivation for the penalty functional \(\mathcal{S}[u]\) may appear to be ad hoc, but in fact the justification for such a functional is given by some very interesting mathematics and physics. The theory of small random pertubations of dynamical systems developed by Friedlin and Wentzell tells us that the probability that a stochastic trajectory is "close" to the path \(u\) is roughly \(\exp(-\sigma^{-2} \mathcal{S}[u])\). Here the volatility \(\sigma\) should be very small and the noise is additive. Hence, in a way, the penalty (or action) functional \(\mathcal{S}[u]\) is proportional to the log-likelihood of the path \(u\).

There is no reason to choose B-splines for \(u\) other than convenience. In the so called tracking approach of Clairon and Brunel (2017), the path \(u(t)\) is instead defined in terms of a \(2n\) dimensional system of ODEs (where \(n\) is the dimension of the original system \(\dot{u} = f(t, u, \theta)\)). This system is derived as follows.

First, The discrete ubservations \(U_i\) at times \(0 \leq t_i \leq T\) are replaced by a continuous approximation \(U(t)\) with \(0\leq t\leq T\) (e.g. using a B-spline), and it is assumed that the measurment error is normally distributed with precision \(\kappa\). Hence, the log-likelihood of the smoothened data \(U(t)\) given trajectory \(u(t)\) is (up to a constant) equal to $$ -\frac{\kappa}{2}\int_0^T \| u(t) - U(t)\|^2 dt\,. $$ The penalty of a path \( u(t) \) is again given by $$ \frac{\lambda}{2}\int_0^T \| \dot{u}(t) - f(t, u(t), \theta)\|^2 dt\,. $$ Given a parameter \(\theta\), we now have to find a path \(u(t)\) with some initial condition \(u(0) = u_0\) that minimizes the action functional $$ \mathcal{S}[u] = \int_0^T \left(\frac{\kappa}{2} \| u(t) - U(t) \|^2 + \frac{\lambda}{2} \| \dot{u}(t) - f(t, u(t), \theta) \|^2 \right) dt\,. $$ This is a variational problem, and can be solved in terms of Hamilton's equations. First, we have to transform the Lagrangian (the integrand of the action functional) $$ \mathcal{L}(t, u, \dot{u}) = \frac{\kappa}{2} \| u(t) - U(t) \|^2 + \frac{\lambda}{2} \| \dot{u}(t) - f(t, u(t), \theta) \|^2 $$ into a Hamiltonian \(\mathcal{H}\). For this, we define the conjugate momentum $$ p = \frac{\partial \mathcal{L}}{\partial \dot{u}} = \lambda (\dot{u} - f(t, u, \theta))\,. $$ The Hamiltonian is then equal to $$ \mathcal{H}(t, u, p) = \langle \dot{u}, p \rangle - \mathcal{L}(t, u, \dot{u}) = \langle \lambda^{-1} p + f(t, u, \theta) , p \rangle - \frac{\kappa}{2} \|u-U\|^2 - \frac{\lambda^{-1}}{2} \|p\|^2 $$ which simplifies to $$ \mathcal{H}(t, u, p) = \frac{\lambda^{-1}}{2} \|p\|^2 + \langle f(t, u, \theta), p\rangle - \frac{\kappa}{2} \|u - U\|^2 $$ The path \(u\) that minimizes the action functional then has to satisfy Hamilton's equations: $$ \dot{u} = \frac{\partial \mathcal{H}}{\partial p} = f(t, u, \theta) + \lambda^{-1} p $$ and $$ \dot{p} = -\frac{\partial \mathcal{H}}{\partial u} = \kappa (u-U) - \left(\frac{\partial f}{\partial u}\right)^T p $$ together with the boundary conditions \(u(0) = u_0\) and \(p(T) = 0\). More details and a much more general derivation can be found in the paper by Clairon and Brunel. In a future blog post I want apply the tracking method to the predator-prey example.

Friday 15 November 2019

My Matplotlib Cheatsheet (#1)

I find myself Googling the same matplotlib/pyplot queries over and over again. To make life easy, I started collecting some of them in a series of blog posts.
In this first post, I'll look at subplots and annotations.

Turn off axes


We can remove axes, ticks, borders, etc. with ax.axis('off'). This can be used to remove unnecessary subplots created with the subplots function from pyplot, for instance, when we only want to use the upper triangle of the grid. Here's an example:


The result is:


Share axes between subplots after plotting


An alternative way of making a ragged array of subplots makes use of gridspec. The problem here is that it is a bit more difficult to share x and y axes. Of course add_subplot has keyword arguments sharex and sharey, but then we have to distinguish between the first and subsequent subplots. A better solution is the get_shared_x_axes() method. Here's an example:


The result is:

Hybrid (or blended) transforms for annotations


When you annotate a point in a plot, the location of the text is often relative to the data in one coordinate, but relative to the axis (e.g. in the middle) in the other. I used to do this with inverse transforms, but it turns out that there is a better way: the blended_transform_factory function from matplotlib.transforms.
Suppose that I want to annotate three points in three subplots. The arrows should point to these three points, and I want the text to be located above the point, but the text in the 3 subplots has to be vertically aligned.



Notice that the y-axes are not shared between the subplots! To accomplish the alignment, we have to use the annotate method with a custom transform.