Skip to content

Small Variant workflow

pipeline version: v3.1.2

The Small Variant workflow finds variants within a list of genes. It outputs these into tsv files containing variants, variant consequences and the participants with alternative alleles at these loci.

Germline genomes only

The Small Variant workflow is intended for use on germline genomes only.

Given a list of participant VCFs and a list of query genes, the workflow aggregates the VCFs into a single multi-sample VCF file keeping only variants within query genes, and annotates the variants. The default annotations include functional consequences, allele frequencies, disease associations, and pubmed citations. All included annotations are listed in the Design section. You can use the workflow to answer common queries like “Which participants have rare, missense variants in my gene(s) of interest?”

The workflow is written in Nextflow DSL2 and uses containerised Python 3 (and the Python API for LabKey), bcftools, and VEP

Sample selection

On HPC, if no samplesheet is provided, the workflow performs a LabKey query to retrieve all germline rare disease and cancer participants, for both Illumina V2 (GRCh37) and Illumina V4 (GRCh38). This retrieves that largest sample called by a single tool. To use Dragen (GRCh38), set parameter --dragen true (default false). On CloudOS, a samplesheet is required. The minimum sample size for each delivery version (genome build) included is four participants.

Usage

HPC

  1. copy the LSF submission script /gel_data_resources/workflows/rdp_small_variant/v3.1.2/submit.bsub to your re_gecip or discovery_forum folder
  2. create your gene and (optional) samplesheet input files
  3. update the submission script:
    • insert your LSF Project Code
    • set parameters --gene_input and --samplesheet (optional, if using a custom sample subset) to your input files
  4. run the workflow, bsub < submit.bsub
submit.bsub
#!/bin/bash

#BSUB -P <project_code>
#BSUB -q inter
#BSUB -J smlv
#BSUB -o logs/%J_smlv.stdout
#BSUB -e logs/%J_smlv.stderr

PROJECT_CODE=''
SCRATCH="/re_scratch/${USER}"
WORKFLOW_PARENT_DIR='/gel_data_resources/workflows/rdp_small_variant'
WORKFLOW="${WORKFLOW_PARENT_DIR}/v3.1.2"

# Warning message to use latest version
LATEST_VERSION=$(find $WORKFLOW_PARENT_DIR -maxdepth 1 -type d -not -name '.*' -printf '%f\n' | sort | tail -n 1)
CURRENT_VERSION=$(basename $WORKFLOW)
if [[ "$LATEST_VERSION" != "$CURRENT_VERSION" ]]; then
    echo "WARNING: Small Variant $LATEST_VERSION is available. You are using $CURRENT_VERSION."
    echo "Please use $LATEST_VERSION and refer to the RE documentation for updated features."
fi

# Error message for missing SCRATCH
if [ ! -d "${SCRATCH}" ]; then
    echo "ERROR: Directory ${SCRATCH} does not exist. Create a scratch folder in /re_scratch/re_gecip/<domain>/<user_folder>, or /re_scratch/discovery_forum/<domain>/<user_folder>, and update the line SCRATCH=\"/re_scratch/${USER}\" in your submission script to point to the new scratch folder. Please raise a Service Desk ticket if you have any trouble creating a scratch folder."
    exit 1
fi

LSF_JOB_ID=${LSB_JOBID:-default}
export NXF_LOG_FILE="logs/${LSF_JOB_ID}_smlv.log"
export SINGULARITY_TMPDIR="${SCRATCH}"

module purge
module load singularity/4.1.1 nextflow/24.04.2-with-plugins

mkdir -p logs

nextflow run "${WORKFLOW}"/main.nf \
    --project_code "${PROJECT_CODE}" \
    --labkey_project_name "main-programme/main-programme_v18_2023-12-21" \
    --gene_input "${WORKFLOW}"/input/gene_list.txt \
    --publish_all false \
    --outdir "results" \
    -work-dir "${SCRATCH}" \
    -profile hpc \
    -ansi-log false \
    -resume

Default parameters

The default --gene-input argument is a file which includes 3 genes (BRCA1, TNF, ENSG00000227518). The --samplesheet parameter is null and a LabKey query returns rare disease and cancer germline samples for both Illumina V2 (GRCh37) and Illumina V4 (GRCh38). Results are written to the results/ subdirectory (change with --outdir). All workflow processes are run with containers and images are cached in your default singularity cache, $HOME/.singularity/cache.

To change workflow parameters, edit your copy of the submission script. See Parameters for additional details on querying and setting workflow parameters. See Input files for input file formats.

Tutorial video

CloudOS

Bug

There is currently a bug in the CloudOS workflow configuration. The workflow succeeds, with all outputs written to the working directory, but they are not published to the main results page.

Open the CloudOS desktop application and log in with your credentials.

  1. create a gene query file and samplesheet
  2. select Advanced Analytics > Pipeline & Tools > My Workspace Tools
  3. type "small-variant" into the search field and select the gel-small-variant tile
  4. update front-end menu
    • Project select one of your project folders
    • Pipeline version select Branches: main
    • Nextflow Profile select cloudos
    • Parameter setup select custom
  5. in Parameters add parameters (minimum required: gene_input, samplesheet) and use menu to select inputs created in step 1
  6. in Analysis details select Resumable and click Run Analysis

Design

Figure 1 shows the connected processes of the workflow as a directed acyclic graph (DAG).

Figure 1. Small Variant DAG

  • Validate input arguments (VALIDATE_ARGS)
  • Get coordinates for query genes (FETCH_COORDS)
  • Get sample VCF list (FETCH_SAMPLES)
  • Chunk participant VCF list (CHUNK_VCFS)
  • Merge single-sample VCFs by chunk, selecting query genes regions (FIRST_ROUND_MERGE)
  • Merge and norm to multi-sample VCF (SECOND_ROUND_MERGE)
  • Compute and fill VCF INFO tags (FILL_TAGS)
  • VEP-annotate variants (VEP_ANNOTATE)
  • Combine +fill-tags computed INFO tags and VEP annotation (COMBINE_ANNOTATIONS)
  • Write containerised tool version software_versions.yml (DUMP_VERSIONS)
FILL_TAGS INFO tags

Computed using bcftools plugin +fill-tags:

  • AN Total number of alleles in called genotypes
  • AC Allele count in genotypes
  • AC_Hom Allele counts in homozygous genotypes
  • AC_Het Allele counts in heterozygous genotypes
  • AC_Hemi Allele counts in hemizygous genotypes
  • NS Number of samples with data
  • MAF Minor Allele frequency
VEP_ANNOTATE annotations

VEP annotations and custom ClinVar and plugin LOFTEE fields added (See Running VEP for a detailed description of options and annotations):

  • sift b predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids (both prediction term and score)
  • ccds adds the Consensus Coding Sequence (CCDS) transcript identifer (where available) to the output
  • uniprot adds best match accessions for translated protein products from three UniProt-related databases (SWISSPROT, TREMBL and UniParc)
  • hgvs adds Human Genome Variation Society (HGVS) nomenclature based on Ensembl stable identifiers. Both coding and protein sequence names are added where appropriate
  • symbol adds the gene symbol (e.g. HGNC) (where available)
  • numbers adds affected exon and intron numbering
  • domains adds names of overlapping protein domains
  • regulatory looks for overlaps with regulatory regions
  • canonical adds a flag indicating if the transcript is the canonical transcript for the gene
  • protein adds the Ensembl protein identifier to the output where appropriate
  • biotype adds the biotype of the transcript or regulatory feature
  • tsl adds the transcript support level for this transcript
  • appris adds the APPRIS isoform annotation for this transcript
  • gene_phenotype indicates if the overlapped gene is associated with a phenotype, disease or trait
  • af adds the global allele frequency (AF) from 1000 Genomes Phase 3 data for any known co-located variant
  • af_1kg adds allele frequency from continental populations (AFR,AMR,EAS,EUR,SAS) of 1000 Genomes Phase 3
  • af_esp includes allele frequency from NHLBI-ESP populations
  • af_gnomad includes allele frequency from Genome Aggregation Database (gnomAD) exome populations
  • max_af reports the highest allele frequency observed in any population from 1000 genomes, ESP or gnomAD
  • pubmed reports Pubmed IDs for publications that cite existing variant
  • variant_class outputs the Sequence Ontology variant class
  • mane adds a flag indicating if the transcript is the MANE Select or MANE Plus Clinical transcript for the gene

VEP --custom ClinVar fields: CLNDN, CLNDNINCL, CLNDISDB, CLNDISDBINCL, CLNHGVS, CLNREVSTAT, CLNSIG, CLNSIGCONF, CLNSIGINCL, CLNVC, CLNVCSO, CLNVI

VEP --plugin LOFTEE (Loss-Of-Function Transcript Effect Estimator) fields assess low- and high-confidence, stop-gained, splice site disrupting, and frameshift variants

Nextflow documentation

For Nextflow command line documentation use nextflow help, and nextflow <command> -help for help on a particular command. To print the workflow configuration use nextflow config -profile <profile> - configurable parameters have a params. prefix. See Parameters for use on the command line. See also Nextflow documentation and Nextflow training.