Science should be reproducible. I like to think of an experiment as a recipe: you follow the steps described in the recipe, and you get results that are similar to the original ones (that amazing taste when your mom made it).
Now, if you were making a cake and ended up with a pot roast, then that’s an issue of replicability. I won’t go into that problem in this post, but rather focus on the problem of reproducibility: making sure that we have all the tools to facilitate replicability. To follow the culinary analogy, when I came to the US all my recipes called for grams and milliliters, and not for cups or tablespoons: my recipes were still replicable, but I didn’t have a way to reproduce them because I needed to convert between the two unit systems.
When we run computational analyses, we end up with the same problem: somebody might share their code and data with us, we try to run it, and it doesn’t work. “But it works on my computer!” is usually the reply, once the author is confronted with a complaint. We have the recipe, but we don’t have a valid system to follow the recipe. To solve this problem, scientists have started to use containers. The basic idea is to create an image containing the system that we use to run our analyses, so that it can be shared with our colleagues and collaborators, along with the data and the code. For a nice overview I recommend this presentation by Greg Kurtzer. There are different ways to create containers, but I will focus on the latest iteration that was specifically designed for the needs of scientific computing: Singularity.
Installing Singularity is fairly straightforward, although if you use
Windows you are out of luck, because at the moment it’s not supported natively.
You can find instructions on how to install Singularity on
Linux and on
Mac. If you’re using
NeuroDebian, it’s as easy as
apt-get install singularity-container. When installing it on a
Mac, the commands described in the previous link will have you install
brew (“The missing package manager for macOS”)
vagrant, which will allow you to install an image of Ubuntu on
your mac, so that you can use Singularity within that image. It sounds a
bit convoluted, but in practice it’s very easy if you are familiar with
the terminal and the command line environment. Make sure to install the latest
Setting up our Singularity image
Once we have Singularity on our system, we are ready to create our first image. We could just create an image, run it, and then install the software that we need interactively. It could be a good idea to test out things, but it’s not reproducible. A better way is creating a definition file, from which we can bootstrap our image. Thus, whenever you want to create a reproducible image, you should follow two steps
- Write a Singularity definition file with the software you need
- Create a Singularity image, and bootstrap it using the definition file
1. Writing a Singularity definition file
A definition file is a script that tells Singularity what the base image should be, and what packages to install in the system. Check the user-guide for more details. What follows is the definition file I created for my project, an MEG experiment for which I need MNE-Python and standard scientific Python packages.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 # Singularity definition file for hauntedhouse project # Matteo Visconti di Oleggio Castello # May 2017 bootstrap: docker from: neurodebian:jessie %environment PATH="/usr/local/anaconda/bin:$PATH" %post # install debian packages apt-get update apt-get install -y eatmydata eatmydata apt-get install -y wget bzip2 \ ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 \ git git-annex apt-get clean # install anaconda if [ ! -d /usr/local/anaconda ]; then wget https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh \ -O ~/anaconda.sh && \ bash ~/anaconda.sh -b -p /usr/local/anaconda && \ rm ~/anaconda.sh fi # set anaconda path export PATH="/usr/local/anaconda/bin:$PATH" # install specific versions of packages for reproducibility conda install\ pip=9.0.1 \ numpy=1.11.3 scipy=0.18.1 scikit-learn=0.18.1 scikit-image=0.12.3 \ pandas=0.19.2 seaborn=0.7.1 jupyter ipython=5.3.0 joblib=0.9.4 \ pyqt=4 conda clean --tarballs pip install mne==0.13 duecredit # make /data and /scripts so we can mount it to access external resources if [ ! -d /data ]; then mkdir /data; fi if [ ! -d /scripts ]; then mkdir /scripts; fi %runscript exec /bin/bash
Let’s go through the definition file. The first part describes from where we want to pull the base image. In this case I’m using an image that is available from Docker Hub; in particular, I want to use the awesome NeuroDebian as a base image. Note that I specify the release I want to make sure that in the future it will pull the correct version, and not the latest one.
1 2 bootstrap: docker from: neurodebian:jessie
After specifying the base image, we enter into the
%post section, where we specify the commands that we want to be run inside the container, for example to install basic packages or setting up environment variables.
First, we’re setting up environment variables, and then installing some required packages in Debian:
Then, it’s time to install Python. I decided to use anaconda because it makes it very easy to have a system with all the Python scientific packages. If you worry about size, you could install miniconda and then manually install all the packages you need.
Then, we install the Python packages we need for our analyses. Note that for the majority of them I specify exactly what version I want. This is very important for reproducibility: new versions might introduce new features, break the API, etc., and we want to make sure that in 20 years our analyses can still be reproduced on our quantum computers.
Finally, we create two directories inside the container:
/scripts. When we run the container, we will mount each of these to point to where we store the data and the scripts for our project. I prefer to keep the two separate, and you should change this as well to fit your needs.
%post section, one can specify under the
%runscripts section what command the container will run once it starts. In this case, we simply call
/bin/bash to have a shell.
2. Bootstrapping the image
Once we created our definition file
hauntedhouse.def, we need to create an actual image, and then tell Singularity to use the definition file to bootstrap our system. Note that for all these commands you need root access, so you could create these images on your laptop and then transfer them to your High Performance Cluster (HPC).
First, we create the image with a size of 4GB. For this definition file 4GB are more than enough, but you might have to play with the size if you are installing a lot of packages or software.
Then, we use the
hauntedhouse.def definition file to bootstrap it. This can take a while depending on how many packages you’re installing.
That’s it! We have created our Singularity image, and we can test it with the following command, with which we enter the container and run
echo "Hello world":
Using our new container
Now we’re one step closer to using the container for our analyses, but we still need to make sure that we’re not inheriting anything from our parent system (like environment variables) and we’re mounting the correct directories. By default Singularity mounts the home directory of the current user, and inherits environment variables: while this is great in certain situations, we want to avoid it if we want to make a container that can be shared with other researchers; we want to rely as little as possible on the current system. Thus, we need to start our image as follows (h/t to Yaroslav Halchenko for showing me this trick):
The two important bits here are the flag
-e, to clean up the environment variables, and the flag
-c, which doesn’t mount
/tmp inside the container.
Putting all together
You can create a wrapper to run the Singularity image with those flags; this is what I use for my analyses:
Note that with the
-B flag I’m telling Singularity to mount the correct directories to
/scripts, so that I can access both from inside the container. This wrapper starts a bash shell inside the container, and I use it to start a Jupyter notebook to run interactive analyses. When I want to scale the analyses and submit everything to a HPC, I use this wrapper called
exec_hauntedhouse to run pipelines or scripts:
In this way I can run any command from inside the container; if you’re pointing to a specific script, make sure to point to the path inside the container. For example, this is how I would run it:
I started using Singularity in the past week and I’m in love with it. It’s very easy to create containers to run and scale analyses, and it makes my work much more reproducible. I don’t have to worry about installing the right packages across different computers, and I can use different HPCs without problems. Now it has become incredibly easy to share your system with Singularity and Singularity Hub, your code with GitHub and the Open Science Framework, and your data with DataLad and OpenfMRI. I don’t see any reason why we shouldn’t do it. It might seem more work, but it’s a good investment in the long run, for both you and science.
Update 2017-06-04: modified scripts to reflect changes to Singularity 2.3