Chapter 4 Reproducibility

An important skill in data science is reproducibility. If you want to share your analysis you should describe the software you’ve used in a way that other people can easily install the same software as you have. There are two ways to easily do this, software environments and containerization. In software environments, software is installed in isolated folders and in containerization, software is installed in a virtual machine. Both of these methods work on linux.

4.1 Pixi based environments

If you want to consistently install software on a system without admin rights, pixi can help to determine 1) what programmes can be installed, 2) what programs you need to also install to get everying working, 3) manage these rules between operating systems and different versions.

Pixi is installed as:

# Install the binary
$ curl -fsSL https://pixi.sh/install.sh | bash
# Reset the shell
$ source ~/.bashrc
# Confirm installation.
$ pixi --version 

The documentation for pixi is found on https://pixi.sh/latest/basic_usage/.

Because putting all programs in a single database would get too massive, the developers put the software into groups, called channels. These channels can be general, such as conda-forge or domain specific, such as bioconda. Most bioinformatic software is found in the bioconda channel.

4.1.1 Usecase: installing qiime

The example I will give here with pixi is installing qiime, a software for analysing amplicon sequencing experiments. To find if this software is availible, we go to https://prefix.dev/. Then we type qiime in the search as so:

Searching for the qiime package on prefix.dev.
Searching for the qiime package on prefix.dev.

Click on the first hit then you’ll see this:

The main information for the qiime software shows the version.
The main information for the qiime software shows the version.

Here you see that v1.9.1 is available. This means that you can install qiime!

Pixi has two modes of installation, global and local. The global installation places the program in the home (~/.pixi) directory. This makes the installed program behave as a program such as ls or echo. In the local method, you install the program for a specific pixi project. This is most useful when you mix programs with scripts and you want to share them. For instance, if you write a python program that you want to share with someone, you need to give the person the python code, and the python interpreter that you’ve used.

4.1.2 Global installation

For global installation, you would use the following command:

pixi global install qiime -c bioconda

Here the -c bioconda indicates that qiime is located in the bioconda database.

4.1.3 Local installation

If you are writing an analysis using qiime and you want to share your analysis scripts written with bash, you’ll need to specify which qiime you used. You would do this as follows.

# Navigate to where you want your analysis
$ mkdir qiime-analysis
$ cd qiime-analysis
$ pixi init
 Created ~/qiime-analysis/pixi.toml

Then to specify that you want pixi to look in the bioconda database for the programs you type:

$ pixi project channel add bioconda
 Added bioconda (https://conda.anaconda.org/bioconda/)

After this you can type

pixi add qiime

to install the latest version of qiime. Now you can use this programme (in qiime-analysis) by typing pixi run qiime.

4.2 Singularity containers

With singularity containers, you can install software from any OS, on any other OS that supports the container format. This is especially useful if the thing you want to run, is not available in any conda channel.