Matteo Visconti di Oleggio Castello's website: Blog



Using Singularity to make analyses reproducible

by Matteo Visconti di Oleggio Castello
Fri 26 May 2017

Science should be reproducible. I like to think of an experiment as a recipe: you follow the steps described in the recipe, and you get results that are similar to the original ones (that amazing taste when your mom made it). Now, if you were making a cake and ended up with a pot roast, then that's an issue of replicability. I won't go into that problem in this post, but rather focus on the problem of reproducibility: making sure that we have all the tools to facilitate replicability. To follow the culinary analogy, when I came to the US all my recipes called for grams and milliliters, and not for cups or tablespoons: my recipes were still replicable, but I didn't have a way to reproduce them because I needed to convert between the two unit systems.

When we run computational analyses, we end up with the same problem: somebody might share their code and data with us, we try to run it, and it doesn't work. "But it works on my computer!" is usually the reply, once the author is confronted with a complaint. We have the recipe, but we don't have a valid system to follow the recipe. To solve this problem, scientists have started to use containers. The basic idea is to create an image containing the system that we use to run our analyses, so that it can be shared with our colleagues and collaborators, along with the data and the code. For a nice overview I recommend this presentation by Greg Kurtzer. There are different ways to create containers, but I will focus on the latest iteration that was specifically designed for the needs of scientific computing: Singularity.

Installing Singularity

Installing Singularity is fairly straightforward, although if you use Windows you are out of luck, because at the moment it's not supported natively. You can find instructions on how to install Singularity on Linux and on Mac. If you're using NeuroDebian, it's as easy as apt-get install singularity-container. When installing it on a Mac, the commands described in the previous link will have you install brew ("The missing package manager for macOS") and vagrant, which will allow you to install an image of Ubuntu on your mac, so that you can use Singularity within that image. It sounds a bit convoluted, but in practice it's very easy if you are familiar with the terminal and the command line environment. Make sure to install the latest release.

Setting up our Singularity image

Once we have Singularity on our system, we are ready to create our first image. We could just create an image, run it, and then install the software that we need interactively. It could be a good idea to test out things, but it's not reproducible. A better way is creating a definition file, from which we can bootstrap our image. Thus, whenever you want to create a reproducible image, you should follow two steps

  1. Write a Singularity definition file with the software you need
  2. Create a Singularity image, and bootstrap it using the definition file

1. Writing a Singularity definition file

A definition file is a script that tells Singularity what the base image should be, and what packages to install in the system. Check the user-guide for more details. What follows is the definition file I created for my project, an MEG experiment for which I need MNE-Python and standard scientific Python packages.

# Singularity definition file for hauntedhouse project
# Matteo Visconti di Oleggio Castello
# May 2017

bootstrap: docker
from: neurodebian:jessie

%environment
    PATH="/usr/local/anaconda/bin:$PATH"
%post
    # install debian packages
    apt-get update
    apt-get install -y eatmydata
    eatmydata apt-get install -y wget bzip2 \
      ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 \
      git git-annex
    apt-get clean

    # install anaconda
    if [ ! -d /usr/local/anaconda ]; then
         wget https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh \
            -O ~/anaconda.sh && \
         bash ~/anaconda.sh -b -p /usr/local/anaconda && \
         rm ~/anaconda.sh
    fi
    # set anaconda path
    export PATH="/usr/local/anaconda/bin:$PATH"

    # install specific versions of packages for reproducibility
    conda install\
        pip=9.0.1 \
        numpy=1.11.3 scipy=0.18.1 scikit-learn=0.18.1 scikit-image=0.12.3 \
        pandas=0.19.2 seaborn=0.7.1 jupyter ipython=5.3.0 joblib=0.9.4 \
        pyqt=4
    conda clean --tarballs
    pip install mne==0.13 duecredit

    # make /data and /scripts so we can mount it to access external resources
    if [ ! -d /data ]; then mkdir /data; fi
    if [ ! -d /scripts ]; then mkdir /scripts; fi

%runscript
    exec /bin/bash

Let's go through the definition file. The first part describes from where we want to pull the base image. In this case I'm using an image that is available from Docker Hub; in particular, I want to use the awesome NeuroDebian as a base image. Note that I specify the release I want to make sure that in the future it will pull the correct version, and not the latest one.

bootstrap: docker
from: neurodebian:jessie

After specifying the base image, we enter into the %post section, where we specify the commands that we want to be run inside the container, for example to install basic packages or setting up environment variables.

First, we're setting up environment variables, and then installing some required packages in Debian:

%environment
    PATH="/usr/local/anaconda/bin:$PATH"
%post
    # install debian packages
    apt-get update
    apt-get install -y eatmydata
    eatmydata apt-get install -y wget bzip2 \
      ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 \
      git git-annex
    apt-get clean

Then, it's time to install Python. I decided to use anaconda because it makes it very easy to have a system with all the Python scientific packages. If you worry about size, you could install miniconda and then manually install all the packages you need.

    # install anaconda
    if [ ! -d /usr/local/anaconda ]; then
         wget https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh \
            -O ~/anaconda.sh && \
         bash ~/anaconda.sh -b -p /usr/local/anaconda && \
         rm ~/anaconda.sh
    fi
    # set anaconda path
    export PATH="/usr/local/anaconda/bin:$PATH"

Then, we install the Python packages we need for our analyses. Note that for the majority of them I specify exactly what version I want. This is very important for reproducibility: new versions might introduce new features, break the API, etc., and we want to make sure that in 20 years our analyses can still be reproduced on our quantum computers.

    # install specific versions of packages for reproducibility
    conda install\
        pip=9.0.1 \
        numpy=1.11.3 scipy=0.18.1 scikit-learn=0.18.1 scikit-image=0.12.3 \
        pandas=0.19.2 seaborn=0.7.1 jupyter ipython=5.3.0 joblib=0.9.4 \
        pyqt=4
    conda clean --tarballs
    pip install mne==0.13 duecredit

Finally, we create two directories inside the container: /data and /scripts. When we run the container, we will mount each of these to point to where we store the data and the scripts for our project. I prefer to keep the two separate, and you should change this as well to fit your needs.

    # make /data and /scripts so we can mount it to access external resources
    if [ ! -d /data ]; then mkdir /data; fi
    if [ ! -d /scripts ]; then mkdir /scripts; fi

After the %post section, one can specify under the %runscripts section what command the container will run once it starts. In this case, we simply call /bin/bash to have a shell.

%runscript
    exec /bin/bash

2. Bootstrapping the image

Once we created our definition file hauntedhouse.def, we need to create an actual image, and then tell Singularity to use the definition file to bootstrap our system. Note that for all these commands you need root access, so you could create these images on your laptop and then transfer them to your High Performance Cluster (HPC).

First, we create the image with a size of 4GB. For this definition file 4GB are more than enough, but you might have to play with the size if you are installing a lot of packages or software.

$ sudo singularity create --size 4096 hauntedhouse.img

Creating a new image with a maximum size of 4096MiB...
Executing image create helper
Formatting image with ext3 file system
Done.

Then, we use the hauntedhouse.def definition file to bootstrap it. This can take a while depending on how many packages you're installing.

$ sudo singularity bootstrap hauntedhouse.img hauntedhouse.def

Bootstrap initialization
Checking bootstrap definition
Executing Prebootstrap module
Executing Bootstrap 'docker' module
From: neurodebian:jessie
library/neurodebian:jessie
Downloading layer: sha256:226a6875826cc83a2267f32a1a9f78e632d328aded82bfbbc164464890b78e12

...

Running setup.py bdist_wheel for duecredit: finished with status 'done'
Stored in directory: /root/.cache/pip/wheels/19/92/83/8d495c12e60a1e6ac33ebf7e693089f572740d04ca927dbdb7
Successfully built mne duecredit
Installing collected packages: mne, citeproc-py, duecredit
Successfully installed citeproc-py-0.3.0 duecredit-0.6.1 mne-0.13
+ [ ! -d /data ]
+ mkdir /data
+ [ ! -d /scripts ]
+ mkdir /scripts
Done.

That's it! We have created our Singularity image, and we can test it with the following command, with which we enter the container and run echo "Hello world":

$ singularity exec hauntedhouse.img echo "Hello world"
Hello world

Using our new container

Now we're one step closer to using the container for our analyses, but we still need to make sure that we're not inheriting anything from our parent system (like environment variables) and we're mounting the correct directories. By default Singularity mounts the home directory of the current user, and inherits environment variables: while this is great in certain situations, we want to avoid it if we want to make a container that can be shared with other researchers; we want to rely as little as possible on the current system. Thus, we need to start our image as follows (h/t to Yaroslav Halchenko for showing me this trick):

$ singularity exec -e -c hauntedhouse.img echo "Hello world"
Hello world

The two important bits here are the flag -e, to clean up the environment variables, and the flag -c, which doesn't mount /home and /tmp inside the container.

Putting all together

You can create a wrapper to run the Singularity image with those flags; this is what I use for my analyses:

#!/bin/bash
singularity run \
    -c \
    -e \
    -B /data/annex/hauntedhouse:/data \
    -B /home/mvdoc/exp/hauntedhouse_mne/:/scripts \
    $(dirname $0)/hauntedhouse.img

Note that with the -B flag I'm telling Singularity to mount the correct directories to /data and /scripts, so that I can access both from inside the container. This wrapper starts a bash shell inside the container, and I use it to start a Jupyter notebook to run interactive analyses. When I want to scale the analyses and submit everything to a HPC, I use this wrapper called exec_hauntedhouse to run pipelines or scripts:

#!/bin/bash
singularity exec \
    -e \
    -c \
    -B /data/annex/hauntedhouse:/data \
    -B /home/mvdoc/exp/hauntedhouse_mne/:/scripts \
    $(dirname $0)/hauntedhouse.img "$@"

In this way I can run any command from inside the container; if you're pointing to a specific script, make sure to point to the path inside the container. For example, this is how I would run it:

$ ./exec_hauntedhouse cat /scripts/test.py
print("Hello!")

$ ./exec_hauntedhouse python /scripts/test.py
Hello!

Conclusion

I started using Singularity in the past week and I'm in love with it. It's very easy to create containers to run and scale analyses, and it makes my work much more reproducible. I don't have to worry about installing the right packages across different computers, and I can use different HPCs without problems. Now it has become incredibly easy to share your system with Singularity and Singularity Hub, your code with GitHub and the Open Science Framework, and your data with DataLad and OpenfMRI. I don't see any reason why we shouldn't do it. It might seem more work, but it's a good investment in the long run, for both you and science.

Acknowledgements

Thanks to the folks at Singularity for making such a great piece of software, and to Yaroslav Halchenko for an endless bag of tips and tricks.

Update 2017-06-04: modified scripts to reflect changes to Singularity 2.3

If you have feedback on this post, please let me know! Send me a tweet or an email at mvdoc.grobfuscatemyemail!@dartmouth.edu.



Personal mod of Thème mnmlist, built using Pelican