Skip to content

The HPC is changing

We will soon be switching to a new High Performance Cluster, called Double Helix. This will mean that some of the commands you use to connect to the HPC and call modules will change. We will inform you by email when you are switching over, allowing you to make the necessary changes to your scripts. Please check our HPC changeover notes for more details on what will change.

Small Variant workflow

pipeline version: v2.0.8

Overview

The Small Variant workflow (previously the Gene Variant workflow) finds variants within a list of genes. It outputs these into tsv files containing variants, variant consequences and the participants with alternative alleles at these loci.

Given a list of participant VCFs and a list of query genes, the workflow aggregates the VCFs into a single multi-sample VCF file keeping only variants within query genes, and annotates the variants. The default annotations include functional consequences, allele frequencies, disease associations, and pubmed citations. All included annotations are listed in the Processes section. You can use the workflow to answer common queries like “Which participants have rare, missense variants in my gene(s) of interest?”

The workflow is written in Nextflow DSL1 and uses containerised R (and the RLabKey API), Python 3, bcftools, and VEP

Note

In the RE (HPC), if no participant VCF list is provided, the workflow performs a LabKey query to retrieve a list of standard Illumina VCF file paths for all germline rare disease and cancer participants. The default selection drops Dragen-called genomes, in order to get the largest sample called by a single tool. You can use alternate VCF files (e.g. platypus, Dragen) by supplying a VCF list. In CloudRE (CloudOS), a participant VCF list is required. In both the RE and CloudRE, the minimum sample size (for each build included) is four participants.

Design

Figure 1 shows the connected processes of the workflow as a directed acyclic graph (DAG).

Figure 1. Small Variant DAG

Usage

RE (HPC)

Warning

Before running the workflow in the RE (HPC), you must configure your R LabKey API access.

Using the submission script

Navigate to your gecip or discovery forum folder, create a directory for your analysis, e.g. my_analysis. Copy the submission script only, not the entire workflow directory.

mkdir my_analysis
cd my_analysis
cp /pgen_int_data_resources/workflows/rdp_small_variant/main/submit.sh .

The submission script simply loads the required Nextflow and singularity modules and uses the Nextflow command line tool to run the workflow (see Nextflow documentation). Replace <project_code> with your LSF Project Code.

#!/bin/bash

#BSUB -P <project_code>
#BSUB -q inter
#BSUB -J smlv
#BSUB -o logs/%J_smlv.stdout
#BSUB -e logs/%J_smlv.stderr

LSF_JOB_ID=${LSB_JOBID:-default}
export NXF_LOG_FILE="logs/${LSF_JOB_ID}_smlv.log"

module purge
module load singularity/4.1.1 nextflow/22.10.5

mkdir -p logs

small_variant='/gel_data_resources/workflows/rdp_small_variant/main'

nextflow run "${small_variant}"/main.nf \
    --project_code "<project_code>" \
    --data_release "main-programme_v18_2023-12-21" \
    --gene_input "${small_variant}"/input/gene_list.txt \
    --sample_input "${small_variant}"/input/sample_list.txt \
    --use_sample_input false \
    --outdir "results" \
    -profile cluster \
    -ansi-log false \
    -resume

The highlighted lines are required for the workflow to run. You can edit the input to each parameter, but do not remove them.

The example --gene-input includes three genes (BRCA1, TNF, ENSG00000227518); the example --sample_input is a four column tab-delimited file with a single entry participant_id platekey genome_build file_path. This entry in the default sample file is a reminder of the column content and file structure. If you use your own sample file, it should have no header. By default, the --sample_input file is not used because --use_sample_input is set to false.

To run the workflow on the example genes in the default LabKey query sample, run submit.sh with the above values.

bsub < submit.sh

The above will perform the example LabKey query to select participant VCFs, and write the results to a results/ subdirectory. The output directory can be changed with --outdir. All workflow processes are run with containers and container images are cached in a singularity/ subdirectory of your working directory. -profile is used to select a particular configuration; use -profile cluster for workflow runs on HPC. -resume only runs processes affected by parameter changes on re-runs of the workflow.

To change any other workflow parameters, you can edit your copy of the submission script. For example, to use your own sample file, use --sample_input <sample_file> and set --use_sample_input true. See below for a detailed description of workflow Processes and querying and setting workflow Parameters.

CloudRE (CloudOS)

Open the CloudOS application from the CloudRE WorkSpace desktop and log in with your credentials.

  1. select the Pipeline & Tools icon from the side menu
  2. select WORKSPACE TOOLS tab
  3. type "small-variant" into the search field and select the gel-small-variant tile

    From the next screen, use front-end menu to select inputs and generate the Nextflow CLI command.

    Warning

    In CloudRE (CloudOS), you must supply a VCF list. See Input section below for the format of the sample input file.

  4. give the analysis a name ("gene_set_2" in the screen shot)

  5. select Branch main
  6. select Nextflow Profile cloud
  7. under Data & Parameters type/select the required inputs
    • gene_input - select using the file explorer menus to the right
    • sample_input - select using the file explorer menus to the right
    • use_sample_input - set this value to true

Processes

  1. Get coordinates for query genes (fetch_coords) Retrieves genomic coordinates for each query gene.

  2. Get sample VCF list (fetch_samples) If no sample list is provided, by default the workflow will perform a LabKey query, selecting all germline genomes for cancer and rare disease. Alternatively a sample list with header () can be provided with parameter --sample_input.

    NB. A list of query samples is currently required on CloudOS, as there is no access to LabKey.

  3. Chunk participant VCF list (chunk_vcfs) Chunks the participant file for each build. The number of rows in each chunk is 1000, or half the number of rows in the original participant file if that is less than 1000.

  4. Merge VCFs by chunk, selecting query genes regions (first_round_merge) Merges single-sample VCF in chunks of 1000 (default) files.

  5. Merge and norm to multi-sample VCF (second_round_merge) After merging the multi-sample VCFs produced by the first round merge, bcftools norm is used to "(l)eft-align and normalize indels, check if REF alleles match the reference, split multi-allelic sites into multiple rows; recover multi-allelics from multiple rows."

  6. Compute and fill VCF INFO tags (fill_tags_query) INFO tags computed and added with bcftools plugin +fill-tags:

    • AN Total number of alleles in called genotypes
    • AC Allele count in genotypes
    • AC_Hom Allele counts in homozygous genotypes
    • AC_Het Allele counts in heterozygous genotypes
    • AC_Hemi Allele counts in hemizygous genotypes
    • NS Number of samples with data
    • MAF Minor Allele frequency
  7. VEP-annotate variants (annotate) VEP annotations added as well as custom ClinVar and plugin LOFTEE fields (these are included in the Nextflow configuration files conf/cluster.config and conf/cloud.config and added to the workflow with the appropriate -profile). See Running VEP for a detailed description of options and annotations.

    • sift b predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids (both prediction term and score)
    • ccds adds the Consensus Coding Sequence (CCDS) transcript identifer (where available) to the output
    • uniprot adds best match accessions for translated protein products from three UniProt-related databases (SWISSPROT, TREMBL and UniParc)
    • hgvs adds Human Genome Variation Society (HGVS) nomenclature based on Ensembl stable identifiers. Both coding and protein sequence names are added where appropriate
    • symbol adds the gene symbol (e.g. HGNC) (where available)
    • numbers adds affected exon and intron numbering
    • domains adds names of overlapping protein domains
    • regulatory looks for overlaps with regulatory regions
    • canonical adds a flag indicating if the transcript is the canonical transcript for the gene
    • protein adds the Ensembl protein identifier to the output where appropriate
    • biotype adds the biotype of the transcript or regulatory feature
    • tsl adds the transcript support level for this transcript
    • appris adds the APPRIS isoform annotation for this transcript
    • gene_phenotype indicates if the overlapped gene is associated with a phenotype, disease or trait
    • af adds the global allele frequency (AF) from 1000 Genomes Phase 3 data for any known co-located variant
    • af_1kg adds allele frequency from continental populations (AFR,AMR,EAS,EUR,SAS) of 1000 Genomes Phase 3
    • af_esp includes allele frequency from NHLBI-ESP populations
    • af_gnomad includes allele frequency from Genome Aggregation Database (gnomAD) exome populations
    • max_af reports the highest allele frequency observed in any population from 1000 genomes, ESP or gnomAD
    • pubmed reports Pubmed IDs for publications that cite existing variant
    • variant_class outputs the Sequence Ontology variant class
    • mane adds a flag indicating if the transcript is the MANE Select or MANE Plus Clinical transcript for the gene

    VEP --custom ClinVar fields: CLNDN, CLNDNINCL, CLNDISDB, CLNDISDBINCL, CLNHGVS, CLNREVSTAT, CLNSIG, CLNSIGCONF, CLNSIGINCL, CLNVC, CLNVCSO, CLNVI

    VEP --plugin LOFTEE (Loss-Of-Function Transcript Effect Estimator) fields assess low- and high-confidence, stop-gained, splice site disrupting, and frameshift variants

  8. Combine +fill-tags summary and VEP annotation (sum_and_annotate) Collects the bcftools +fill-tags computed INFO tags and VEP annotations into the main output - a single variant annotation file.

  9. Write containerised tool version software_versions.yml (dump_versions) Writes tool versions to log file (software_versions.yml)

Input

Gene input

You must provide a list of query genes. NB. This file should include 1 column, and no header. The file can include both HGNC symbols and Ensembl IDs, e.g.,

gene_list.txt
BRCA1
ENSG00000227518
TNF

Query genes in the input list that are not found in BioMart reference data are written to <build>_genes_not_found.txt.

Note

The maximum number of genes per query is 10. If you need to query more than 10 genes, please run the workflow with multiple smaller gene sets. For a large set of genes, consider using aggV2 (see aggV2 Code Book::Functional Annotation Queries).

Sample input

In the RE (HPC), the default sample (if no sample file is supplied) is derived from a LabKey query of all rare disease and cancer germline participants from the latest data release. Below is an excerpt from the default SQL query made to LabKey in the RE (HPC)

Excerpt from: fetch_coords.R
if (
    as.numeric(
        str_replace(
            unlist(
                str_split(argv$data_version, "_"))[2], "v", "")) > 9) {
    sql_add_1 <- ", delivery_version"
    sql_add_2 <- "AND g.delivery_version != 'Dragen_Pipeline2.0'"
    sql_add_3 <- "AND g_newer.delivery_version != 'Dragen_Pipeline2.0'"
}

sql_query <- stringr::str_interp("
    SELECT g.participant_id, platekey, delivery_id, delivery_date, type,
        file_path, normalised_consent_form, genome_build
    ${sql_add_1}
    FROM genome_file_paths_and_types AS g
    INNER JOIN participant AS p ON p.participant_id = g.participant_id
    WHERE type IN ('rare disease germline', 'cancer germline')
    AND g.participant_id NOT IN (${excluded_participant_ids})
    AND genome_build = '${argv$genome_build}'
    AND file_sub_type = 'Standard VCF'
    AND normalised_consent_form != 'cohort-tracerx'
    ${sql_add_2}
    AND NOT EXISTS (
        SELECT 1 FROM genome_file_paths_and_types AS g_newer
        WHERE g_newer.participant_id = g.participant_id
        AND g_newer.type = g.type
        AND g_newer.genome_build = g.genome_build
        AND g_newer.file_sub_type = g.file_sub_type
        AND g_newer.delivery_date > g.delivery_date
        ${sql_add_3})
") %>%
stringr::str_squish()

Sample input file

Optionally, you can provide a list of sample VCFs in a tab-delimited sample file. NB. This file should include 4 columns (participant_id, platekey, genome_build, file_path), and no header.

In CloudRE (CloudOS), a sample list is required as LabKey is not yet accessible from CloudOS. We recommend using the Cohort Browser - Data Science (icon panel on left) > Cohorts > Source data (tab) - select latest data from the drop down menu under Phenotypic Data Releases (currently source_data_100kv17_covid5) - select New cohort button - see the Building cohorts with Cohort Browser in CloudOS tutorial here for more detail

Sample input file considerations

  • use the latest GEL data release
  • use a minimum sample size of four particpants (for each build included)
  • use the latest delivery_date
  • use only participants called with one tool (the default selection in the RE drops Dragen-called genomes)

Output

The workflow will create the following in your working directory:

results/         # Directory containing the main output (default --outdir)
work/            # Directory containing the intermediate output
singularity/     # Directory containing container images
.nextflow.log    # Nextflow log file

For each query gene and build, the main output is an annotated list of variants,

  • <build>_<gene>_<ensemblid>_annotated_variants.tsv

as well as a single multi-sample VCF file and index,

  • <build>_<gene>_<ensemblid>_left_norm_tagged.vcf.gz
  • <build>_<gene>_<ensemblid>_left_norm_tagged.vcf.gz.tbi

Parameters

Workflow parameters defined in the configuration files (e.g. params.data_version = main-programme_v17_2023-03-30), can be overriden on the command line.

nextflow run main.nf \
--data_version main-programme_v18_2023-12-21 \
-profile cluster

Note

On the command line, prefix workflow parameters with --; prefix Nextflow options with -.

For a complete list of workflow parameters (configurable parameters have a params. prefix).

nextflow config -flat -profile <profile> main.nf | grep params

Nextflow documentation

For Nextflow command line documentation use nextflow help, and nextflow <command> -help for help on a particular command. To print the workflow configuration use nextflow config -profile <profile> - configurable parameters have a params. prefix. See Parameters for use on the command line. See also Nextflow documentation and Nextflow training.