Skip to content

The HPC is changing

We will soon be switching to a new High Performance Cluster, called Double Helix. This will mean that some of the commands you use to connect to the HPC and call modules will change. We will inform you by email when you are switching over, allowing you to make the necessary changes to your scripts. Please check our HPC changeover notes for more details on what will change.

Using containers within the Research Environment

You can work with containers in the RE using Singularity. A limited number of container repositories can be accessed and require the use of a proxy. For security reasons we cannot allow pushing out of the environment.

This page will highlight some best practices to work with containers within the Research Environment.

Licensing considerations

Please note if you choose a self-install route you will be solely and fully responsible for acquiring any licences required for the use of and access to the relevant software package. GEL expect all software to be correctly licensed by the researcher where the self-installation route is employed. In no event shall GEL be liable to you or any third parties for any claim, damages or other liability, whether such liability arises in contract, tort (including negligence), breach of statutory duty, misrepresentation, restitution and on an indemnity basis or otherwise, arising from, out of or in connection with software self-installed by the researcher or the use or other dealings by the researcher in the software.

Any links to third party software available on this User Guide are provided “as is” without warranty of any kind, either expressed or implied, and such software is to be used at your own risk. No advice or information, whether oral or written, obtained by you from us or from this User Guide shall create any warranty in relation to the software.

Loading Singularity on the HPC

To use Singularity on the HPC please type the following: module load tools/singularity/3.8.3

Caching (Singularity)

Whenever you create an image with Singularity within the HPC, the files are automatically cached. The cached files are located in /home/<username>/.singularity/. However, it could be that you are submitting and creating an image via a compute node in an interactive session. In that case the caching will output the file there which may potentially flood the compute node's memory. You can redirect this location by setting the environment variable SINGULARITY_CACHEDIR.

For example, we recommend placing the environment variable in your .bashrc script as follows SINGULARITY_CACHEDIR="/re_gecip/my_GECIP_/username/singularity_cache/".

To view your current cache you can use the command singularity cache list and singularity cache list --all to view all the individual blobs that have been pulled.

To clean up your cache you can use the command: singularity cache clean

List of available repositories

There are various container repositories available which have been whitelisted for the HPC. To ensure the correct use and security of our system, the default URLs are blocked by our firewall. Instead, these repositories may be accessed by Singularity using URLs that are routed via the artifactory. These artifactory URLs are as follows:

  • Docker: docker-remote.artifactory.aws.gel.ac
  • Quay.io: docker-quay-io.artifactory.aws.gel.ac

The URLs used inside singularity commands should be updated using the following example:

Example URL adjustment
1
2
3
4
5
# Outside the Research Environment:
singularity pull bcftools_v1.13.sif docker://quay.io/biocontainers/bcftools:1.13--h3a49de5_0

# On our HPC:
singularity pull bcftools_v1.13.sif docker://docker-quay-io.artifactory.aws.gel.ac/biocontainers/bcftools:1.13--h3a49de5_0

Please refer to the documentation for the container for details on how to run it.

Example quay.io: bcftools

In this example we will use bcftools 1.13, which is available on https://quay.io/repository/biocontainers/bcftools?tab=info.

First load singularity, pull the container and build a singularity image so you do not need to pull the container every time. Then run the basic command, mount the /gel_data_resources/ folder and run a simple bcftools view command on a VCF from our aggV2 dataset.

Some containers may be sizeable, so we recommend pulling and/or creating images via an interactive session. The bcftools container of this example is ~234 Mb, but they can easily reach >Gb depending on the software complexity. Please also note the caching section above.

Running bcftools via containers
1
2
3
4
5
6
7
8
9
module load tools/singularity/3.8.3

singularity pull bcftools_v1.13.sif docker://docker-quay-io.artifactory.aws.gel.ac/biocontainers/bcftools:1.13--h3a49de5_0

singularity exec bcftools_v1.13.sif bcftools --version

singularity exec --bind /nas/weka.gel.zone/pgen_int_data_resources:/nas/weka.gel.zone/pgen_int_data_resources --bind /gel_data_resources:/gel_data_resources \
bcftools_v1.13.sif \
bcftools view -h /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr18_53719754_56316484.vcf.gz | head

Mounting drives and environment variables

In the above example we use the --bind argument to mount the /genomes folder to the container. By default containers will not have the same drives mounted to them, so this needs to be added manually. An added complication of our file system is that we generally make use of relative paths. For instance, the actual path of our /genomes/ folder is /nas/weka.gel.zone/pgen_genomes/. On a day to day basis you will not find any hindrance of this, however for containers it is something to be aware of. In fact, you will first need to --bind the full path, and then add another --bind for the relative path. As we can understand that this can be rather frustrating, we provide a list of useful file paths and relative paths for to ensure a path of least resistance.

binds of interest
MOUNT_GENOMES='--bind /nas/weka.gel.zone/pgen_genomes:/nas/weka.gel.zone/pgen_genomes --bind /genomes:/genomes'

MOUNT_GEL_DATA_RESOURCES='--bind /nas/weka.gel.zone/pgen_int_data_resources:/nas/weka.gel.zone/pgen_int_data_resources --bind /gel_data_resources:/gel_data_resources'

MOUNT_PUBLIC_DATA_RESOURCES='--bind /nas/weka.gel.zone/pgen_public_data_resources:/nas/weka.gel.zone/pgen_public_data_resources --bind /public_data_resources:/public_data_resources'

MOUNT_SCRATCH='--bind /nas/weka.gel.zone/re_scratch:/nas/weka.gel.zone/re_scratch --bind /re_scratch:/re_scratch'

MOUNT_re_gecip='--bind /nas/weka.gel.zone/re_gecip:/nas/weka.gel.zone/re_gecip --bind /re_gecip:/re_gecip'

MOUNT_DISCOVERY_FORUM='--bind /nas/weka.gel.zone/discovery_forum:/nas/weka.gel.zone/discovery_forum --bind /discovery_forum:/discovery_forum'

Below shows an example where we are using two of these variables to save the header of an aggV2 vcf into a .txt file. The example assumes that you also ran the initial bcftools example shown above and therefore have the bcftools singularity image made. Please note that you should change the file path to your own folders, and check whether you need to use the GECIP or Discovery Forum example.

Example combined mounts
1
2
3
4
5
6
7
# GECIP example
singularity exec $MOUNT_GEL_DATA_RESOURCES $MOUNT_re_gecip bcftools_v1.13.sif \
bcftools view -h /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr18_53719754_56316484.vcf.gz > /re_gecip/<YOUR_FILE_PATH>/sing_cont_bcftools_header_test.txt

# Discovery Forum example
singularity exec $MOUNT_GEL_DATA_RESOURCES $MOUNT_DISCOVERY_FORUM bcftools_v1.13.sif \
bcftools view -h /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr18_53719754_56316484.vcf.gz > /discovery_forum/<YOUR_FILE_PATH>/sing_cont_bcftools_header_test.txt

Working with containers within a workflow

Two ways of going about with this, either pull the container directly within a task of the workflow or create an image beforehand and let the workflow call upon the image. You can also add some of the --bind examples from above into the SINGULARITY_MOUNTS variable.

Caching within Cromwell
submit-docker = """
module load tools/singularity/3.8.3
SINGULARITY_MOUNTS='--bind /nas/weka.gel.zone/re_scratch:/nas/weka.gel.zone/re_scratch \
                    --bind /nas/weka.gel.zone/pgen_genomes:/nas/weka.gel.zone/pgen_genomes \
                    --bind /nas/weka.gel.zone/re_gecip:/nas/weka.gel.zone/re_gecip \
                    --bind /nas/weka.gel.zone/discovery_forum:/nas/weka.gel.zone/discovery_forum'

if [ -z $SINGULARITY_CACHEDIR ];
then
    CACHE_DIR=$HOME/.singularity/cache
else
    CACHE_DIR=$SINGULARITY_CACHEDIR
fi

mkdir -p $CACHE_DIR
LOCK_FILE=$CACHE_DIR/singularity_pull_flock

flock --exclusive --timeout 900 $LOCK_FILE \
singularity exec docker://${docker} \
echo "Sucessfully pulled ${docker}"

bsub \
-q ${lsf_queue} \
-P ${lsf_project} \
-J ${job_name} \
-cwd ${cwd} \
-o ${out} \
-e ${err} \
-n ${cpu} \
-R 'rusage[mem=${memory_mb}] span[hosts=1]' \
-M ${memory_mb} \
singularity exec --containall $SINGULARITY_MOUNTS --bind ${cwd}:${docker_cwd} docker://${docker} ${job_shell} ${docker_script}
"""