About the HPC¶
Genomics England provides a High Performance Computing Cluster (HPC) called Helix. You should use this to run all production-worthy scripts and workflows.
About high performance computing
High performance computing (HPC) is the application of collection of computers to crunch large values of data. HPC also refers to the cluster of computers itself. An HPC is collocation of computer cores organised in nodes.
HPCs are usually accessed from the command line via ssh. When you do so, you access a login node. These nodes have limited computer power, but they serve as a platform for scripting and job submission.
A job is a task you want to perform that will be placed in a queue and run accordingly. For large jobs, i.e. jobs that deal with lots of data, it is important to specify the number of cores and memory needed to run your job. You can read more on how to do that, as well as how to run an array of jobs, which are jobs that are identical except for their input, in our HPC guidelines.
Genomics England Helix setup¶
For most of the nodes, Helix uses the IBM Load Sharing Facility (LSF) platform to schedule jobs, by organising each job requirement and managing the job queue in an efficient and fair way.
Each node has a fixed number of 'job slots’. A job typically consumes one slot (though this can be more for parallel jobs), and the standard policy is one job slot per core. Consequently, the cluster has a maximum concurrent job limit to allow a fast dispatch. In our infrastructure, this means each compute node can accommodate 34 concurrent jobs (batch jobs).
For interactive jobs, we allow more job slots per node with the assumption that interactive workloads are not resource intensive. Interactive sessions should be used mostly as a way to submit jobs to the cluster, rather than for running day-to-day activities directly.
This is our main production grid. All workloads are expected to be submitted to this grid targeting the right queue.
|Cluster name||Number of physical compute nodes||Operating systems||Queueing system and schedulers||Number of CPU cores||Total number of job slots||Available queues|
|helix||54||CentOS 7.6.1810||LSF||34 cores/node = 1,836||1,836||inter
|1 GPU (2x v100) node||CentOS 7.6.1810||LSF
Accessed using the reservation ID gpu1 via LSF ("brsvs" shows current reservations, you need to be added to a reservation to access)
View cluster information¶
To view cluster information (LSF version, Cluster name, Master host) and check if your environment is correctly setup, run command
Resources in LSF¶
LSF tracks resource availability and usage. LSF Jobs can use defined resources to request specific resource.
All hosts have static numeric resources. e.g
maxmem total physical memory
ncpus number of CPUs
maxtmp maximum available space in /tmp
cpuf CPU factor (relative performance)
as well as all hosts have dynamic numeric resources. e.g
mem available memory
tmp available space in /tmp
ut CPU utilisation
Additionally resources can be, OS and ARCH boolean resources per host. This allows easy targeting of correct platforms. Example generic and specific OS resources
ub1604 host is running Ubuntu 16.04.
dsk host has local disk /scratch with 2TB space
Ways to specify resources strings requirement (-R option)
Select: It is a logical expression built from a set of resource names
Order: The order string is used for host sorting and selection
Usage: It is used to specify resource reservations for jobs
Span: A span string specifies the locality of a parallel job.
Same: The same string specifies that all processes of a parallel job must run on hosts with the same resource
Helix was released in 2020, replacing its predecessor Pegasus.