Small Variant workflow
The Small Variant workflow finds variants within a list of genes. It outputs these into tsv files containing variants, variant consequences and the participants with alternative alleles at these loci.
Germline genomes only
The Small Variant workflow is intended for use on germline genomes only.
Given a list of participant VCFs and a list of query genes, the workflow aggregates the VCFs into a single multi-sample VCF file keeping only variants within query genes, and annotates the variants. The default annotations include functional consequences, allele frequencies, disease associations, and pubmed citations. All included annotations are listed in the Design section. You can use the workflow to answer common queries like “Which participants have rare, missense variants in my gene(s) of interest?”
The workflow is written in Nextflow DSL2 and uses containerised Python 3 (and the Python API for LabKey), bcftools, and VEP
Sample selection
On HPC, if no samplesheet is provided, the workflow performs a LabKey query to retrieve all germline rare disease and cancer participants, for both Illumina V2 (GRCh37) and Illumina V4 (GRCh38). This retrieves that largest sample called by a single tool. To use Dragen (GRCh38), set parameter --dragen true
(default false
). On CloudOS, a samplesheet is required. The minimum sample size for each delivery version (genome build) included is four participants.
Usage¶
HPC¶
- copy the LSF submission script
/gel_data_resources/workflows/rdp_small_variant/v3.1.2/submit.bsub
to yourre_gecip
ordiscovery_forum
folder - create your gene and (optional) samplesheet input files
- update the submission script:
- insert your LSF Project Code
- set parameters
--gene_input
and--samplesheet
(optional, if using a custom sample subset) to your input files
- run the workflow,
bsub < submit.bsub
submit.bsub
#!/bin/bash
#BSUB -P <project_code>
#BSUB -q inter
#BSUB -J smlv
#BSUB -o logs/%J_smlv.stdout
#BSUB -e logs/%J_smlv.stderr
PROJECT_CODE=''
SCRATCH="/re_scratch/${USER}"
WORKFLOW_PARENT_DIR='/gel_data_resources/workflows/rdp_small_variant'
WORKFLOW="${WORKFLOW_PARENT_DIR}/v3.1.2"
# Warning message to use latest version
LATEST_VERSION=$(find $WORKFLOW_PARENT_DIR -maxdepth 1 -type d -not -name '.*' -printf '%f\n' | sort | tail -n 1)
CURRENT_VERSION=$(basename $WORKFLOW)
if [[ "$LATEST_VERSION" != "$CURRENT_VERSION" ]]; then
echo "WARNING: Small Variant $LATEST_VERSION is available. You are using $CURRENT_VERSION."
echo "Please use $LATEST_VERSION and refer to the RE documentation for updated features."
fi
# Error message for missing SCRATCH
if [ ! -d "${SCRATCH}" ]; then
echo "ERROR: Directory ${SCRATCH} does not exist. Create a scratch folder in /re_scratch/re_gecip/<domain>/<user_folder>, or /re_scratch/discovery_forum/<domain>/<user_folder>, and update the line SCRATCH=\"/re_scratch/${USER}\" in your submission script to point to the new scratch folder. Please raise a Service Desk ticket if you have any trouble creating a scratch folder."
exit 1
fi
LSF_JOB_ID=${LSB_JOBID:-default}
export NXF_LOG_FILE="logs/${LSF_JOB_ID}_smlv.log"
export SINGULARITY_TMPDIR="${SCRATCH}"
module purge
module load singularity/4.1.1 nextflow/24.04.2-with-plugins
mkdir -p logs
nextflow run "${WORKFLOW}"/main.nf \
--project_code "${PROJECT_CODE}" \
--labkey_project_name "main-programme/main-programme_v18_2023-12-21" \
--gene_input "${WORKFLOW}"/input/gene_list.txt \
--publish_all false \
--outdir "results" \
-work-dir "${SCRATCH}" \
-profile hpc \
-ansi-log false \
-resume
Default parameters
The default --gene-input
argument is a file which includes 3 genes (BRCA1, TNF, ENSG00000227518). The --samplesheet
parameter is null
and a LabKey query returns rare disease and cancer germline samples for both Illumina V2 (GRCh37) and Illumina V4 (GRCh38). Results are written to the results/ subdirectory (change with --outdir
). All workflow processes are run with containers and images are cached in your default singularity cache, $HOME/.singularity/cache
.
To change workflow parameters, edit your copy of the submission script. See Parameters for additional details on querying and setting workflow parameters. See Input files for input file formats.
Tutorial video
CloudOS¶
Bug
There is currently a bug in the CloudOS workflow configuration. The workflow succeeds, with all outputs written to the working directory, but they are not published to the main results page.
Open the CloudOS desktop application and log in with your credentials.
- create a gene query file and samplesheet
- see Building cohorts with Cohort Browser in CloudOS)
- see Input files for input file formats
- select Advanced Analytics > Pipeline & Tools > My Workspace Tools
- type "small-variant" into the search field and select the gel-small-variant tile
- update front-end menu
- Project select one of your project folders
- Pipeline version select
Branches: main
- Nextflow Profile select
cloudos
- Parameter setup select
custom
- in Parameters add parameters (minimum required:
gene_input
,samplesheet
) and use menu to select inputs created in step 1 - in Analysis details select
Resumable
and click Run Analysis
Design¶
Figure 1 shows the connected processes of the workflow as a directed acyclic graph (DAG).
Figure 1. Small Variant DAG
- Validate input arguments (
VALIDATE_ARGS
) - Get coordinates for query genes (
FETCH_COORDS
) - Get sample VCF list (
FETCH_SAMPLES
) - Chunk participant VCF list (
CHUNK_VCFS
) - Merge single-sample VCFs by chunk, selecting query genes regions (
FIRST_ROUND_MERGE
) - Merge and norm to multi-sample VCF (
SECOND_ROUND_MERGE
) - Compute and fill VCF INFO tags (
FILL_TAGS
) - VEP-annotate variants (
VEP_ANNOTATE
) - Combine +fill-tags computed INFO tags and VEP annotation (
COMBINE_ANNOTATIONS
) - Write containerised tool version software_versions.yml (
DUMP_VERSIONS
)
FILL_TAGS
INFO tags
Computed using bcftools
plugin +fill-tags
:
AN
Total number of alleles in called genotypesAC
Allele count in genotypesAC_Hom
Allele counts in homozygous genotypesAC_Het
Allele counts in heterozygous genotypesAC_Hemi
Allele counts in hemizygous genotypesNS
Number of samples with dataMAF
Minor Allele frequency
VEP_ANNOTATE
annotations
VEP annotations and custom ClinVar and plugin LOFTEE fields added (See Running VEP for a detailed description of options and annotations):
sift b
predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids (both prediction term and score)ccds
adds the Consensus Coding Sequence (CCDS) transcript identifer (where available) to the outputuniprot
adds best match accessions for translated protein products from three UniProt-related databases (SWISSPROT, TREMBL and UniParc)hgvs
adds Human Genome Variation Society (HGVS) nomenclature based on Ensembl stable identifiers. Both coding and protein sequence names are added where appropriatesymbol
adds the gene symbol (e.g. HGNC) (where available)numbers
adds affected exon and intron numberingdomains
adds names of overlapping protein domainsregulatory
looks for overlaps with regulatory regionscanonical
adds a flag indicating if the transcript is the canonical transcript for the geneprotein
adds the Ensembl protein identifier to the output where appropriatebiotype
adds the biotype of the transcript or regulatory featuretsl
adds the transcript support level for this transcriptappris
adds the APPRIS isoform annotation for this transcriptgene_phenotype
indicates if the overlapped gene is associated with a phenotype, disease or traitaf
adds the global allele frequency (AF) from 1000 Genomes Phase 3 data for any known co-located variantaf_1kg
adds allele frequency from continental populations (AFR,AMR,EAS,EUR,SAS) of 1000 Genomes Phase 3af_esp
includes allele frequency from NHLBI-ESP populationsaf_gnomad
includes allele frequency from Genome Aggregation Database (gnomAD) exome populationsmax_af
reports the highest allele frequency observed in any population from 1000 genomes, ESP or gnomADpubmed
reports Pubmed IDs for publications that cite existing variantvariant_class
outputs the Sequence Ontology variant classmane
adds a flag indicating if the transcript is the MANE Select or MANE Plus Clinical transcript for the gene
VEP --custom
ClinVar fields: CLNDN, CLNDNINCL, CLNDISDB, CLNDISDBINCL, CLNHGVS, CLNREVSTAT, CLNSIG, CLNSIGCONF, CLNSIGINCL, CLNVC, CLNVCSO, CLNVI
VEP --plugin
LOFTEE (Loss-Of-Function Transcript Effect Estimator) fields assess low- and high-confidence, stop-gained, splice site disrupting, and frameshift variants
Nextflow documentation¶
For Nextflow command line documentation use nextflow help
, and nextflow <command> -help
for help on a particular command. To print the workflow configuration use nextflow config -profile <profile>
- configurable parameters have a params.
prefix. See Parameters for use on the command line. See also Nextflow documentation and Nextflow training.