Structural Variant workflow¶

The Structural Variant workflow allows you to find all structural variants in a list of genes or genomic regions.

The workflow (previously the SV/CNV workflow) extracts canonical structural variants (insertions, deletions, duplications, inversions, translocations) and copy number variants within query regions defined by a list of genes or coordinates, from Illumina Structural Variant VCF files for germline and/or somatic cohorts. The main output is a list of observed Manta-called SVs and Canvas-called CNVs within query regions. CNVs on sex chromosomes depend on sex chromosome karyotype and should be interpreted with caution.

The workflow is written in Nextflow DSL2 and uses containerised R (and the RLabKey API), Python 3, bcftools, and bedtools.

Note

In the RE (HPC), if no participant VCF list is provided, the workflow performs a LabKey query to retrieve a list of standard Illumina VCF file paths for all germline rare disease and cancer participants. The default selection drops Dragen-called genomes, in order to get the largest sample called by a single tool. You can use alternate VCF files (e.g. platypus, Dragen) by supplying a VCF list. In CloudRE (CloudOS), a participant VCF list is required.

Design¶

Figure. Relationship between workflow processes in Structural Variant 3.0

Processes¶

check the validity of parameter arguments (VALIDATE_ARGS)
get coordinates and region information (DEFINE_REGION)
get a list of VCF files for each cohort (DEFINE_COHORT)
chunk VCF lists (SPLIT)
query chunked VCFs for MANTA-called structural variants(QUERY_SV)
query chunked VCFs for CANVAS-called copy-number variants (QUERY_CNV)
merge chunked variant outputs (MERGE_CHUNKS)
add variant query region overlap information (ADD_COLUMNS)
log containerised tool versions (DUMP_VERSIONS)

Usage¶

RE (HPC)¶

Warning

Before running the workflow in the RE (HPC), you must configure your R LabKey API access.

Using the submission script¶

Navigate to your gecip or discovery forum folder, create a directory for your analysis, e.g. my_analysis, and copy the submission script

mkdir my_analysis
cd my_analysis

Copy the submission script only, not the entire directory.

cp /pgen_int_data_resources/workflows/rdp_structural_variant/main/submit.sh .

The submission script simply loads the required Nextflow and singularity modules and uses the Nextflow command line tool to run the workflow.

replace <project_code> with your LSF Project Code.

replace <path_to_input_file> with a path to an input file, i.e., either a one gene per line file for --input_type 'gene', or a bed file of regions for --input_type 'region'. Create this input file in your working directory, e.g., my_analysis/. See example input files here:

/pgen_int_data_resources/workflows/rdp_structural_variant/main/input/genes.txt for --input_type 'gene', or
/pgen_int_data_resources/workflows/rdp_structural_variant/main/input/regions.bed for --input_type 'region'

#!/bin/bash

#BSUB -P <project_code>
#BSUB -q inter
#BSUB -R rusage[mem=10000]
#BSUB -M 12000
#BSUB -J strv
#BSUB -o logs/%J_strv.stdout
#BSUB -e logs/%J_strv.stderr

LSF_JOB_ID=${LSB_JOBID:-default}
export NXF_LOG_FILE="logs/${LSF_JOB_ID}_strv.log"
export NXF_DISABLE_CHECK_LATEST=true

module purge
module load singularity/4.1.1 nextflow/23.10

mkdir -p ./logs/

structural_variant='/pgen_int_data_resources/workflows/rdp_structural_variant/main'

nextflow run "${structural_variant}"/main.nf \
    --project_code '<project_code>' \
    --data_version 'main-programme_v18_2023-12-21' \
    --input_file '<path_to_input_file>' \
    --input_type 'gene' \
    --sample_type 'all' \
    --genome_build 'both' \
    -profile cluster \
    -ansi-log false \
    -resume

run the workflow on the default LabKey query sample.
```
bsub < submit.sh
```

In the RE (HPC), the default sample (if no sample file is supplied) is derived from a LabKey query based on the arguments to --sample_type, --genome_build, and --dragen. The default LabKey query draws data from two tables: genome_file_paths_and_types and participant. For reference, the full LabKey SQL query is included below. The workflow further filters this query based on --sample_type, --genome_build, and --dragen. In particular, given a participant_id is associated with 1 or more platekeys, we select the latest delivery/genome, grouping on delivery_version, cohort_type, variation type (CNV, SV), and participant_id.

define_cohort.sql

SELECT DISTINCT
    g.participant_id,
    CAST(g.delivery_date AS TIMESTAMP) AS date_timestamp,
    g.delivery_version,
    g.genome_build,
    g.type AS cohort_type,
    g.file_path AS sv_vcf_filepath
FROM
    genome_file_paths_and_types g
INNER JOIN
    participant p ON p.participant_id = g.participant_id
WHERE
    LOWER(g.file_sub_type) = 'structural vcf'
AND (
        g.type IN ('rare disease germline') OR
        (g.type IN ('cancer germline', 'cancer somatic') AND NOT LOWER(g.delivery_version)='v2')
    )
AND
    LOWER(g.file_path) NOT LIKE '%itd.vcf.gz'
AND
    LOWER(p.programme_consent_status) = 'consenting'

Tutorial video¶

CloudRE (CloudOS)¶

Open the CloudOS application from the CloudRE WorkSpace desktop and log in with your credentials.

select the Advanced Analytics icon from the side menu, Pipeline & Tools, and the WORKSPACE TOOLS tab
type "structural-variant" into the search field and select the gel-structural-variant tile

From the next screen, use front-end menu to select inputs and generate the Nextflow CLI command.

Warning

In CloudRE (CloudOS), you must supply a VCF list (see Input section below).

give the analysis a name
select Branch main
select Nextflow Profile cloud
under Data & Parameters type/select the required inputs (Note. do not use 's around arguments; CloudOS inserts these automatically, and will add \ to any quotes you use)

In CloudRE (CloudOS), you are required to provide a tab-delimited sample file. For reference, see the RE (HPC) query above.

Input¶

Below are the few input parameters you need to edit to define your query. See Parameters for the full list of pipeline parameters and their use on the command line.

--project_code LSF Project Code

--input_type 'gene'|'region'

Retrieve structural variants within regions defined by 'gene'(s) or 'region'(s). Default 'gene'.

--input_file__

For input_type = 'gene', a path to a one-per-line text file of query genes which can include both HGNC symbols and Ensembl IDs. For input_type = 'region', a path to a headless four-column tab-delimited bed file (chr, start, stop, name) of query regions. Default '/pgen_int_data_resources/workflows/rdp_structural_variant/main/input/genes.txt', for default input_type = 'gene'.

Select structural VCF sample type. This parameter is used to define the LabKey query when no sample file is supplied. Default 'all'.: 'all': A combination of germline and somatic, rare disease and cancer genomes; 'ca-all': A combination of germline and somatic, cancer genomes only; 'germline': Germline only, rare disease and cancer genomes; 'ca-germline': Germline only, cancer genomes only; 'rd-germline': Germline only, rare disease genomes only; 'ca-somatic': Somatic only, cancer genomes only

--sample_file

Either null, or a path to a tab-delimited sample file with header: cohort_type (germline or somatic), genome_build (GRCh37, GRCh38 or both) sv_vcf_filepath. Default null, performs a LabKey query to collect a list of structural VCFs.

--data_version

This parameter is used to define the LabKey query when no sample file is supplied. Default 'main-programme_v17_2023-03-30'.

--outdir

A path to an output directory. Default 'results', writes all output to a directory called 'results' in the current working directory.

--genome_build 'GRCh38'|'GRCh37'|'both'

This parameter is used to define the LabKey query when no sample file is supplied. Default 'both'.

--dragen

Boolean indicating whether query samples include dragen-aligned VCFs. This parameter is used to define the LabKey query when no sample file is supplied. Default false.

Parameters¶

Default workflow parameters defined in the configuration files, can be overridden on the command line. For example, to use coordinates to specify a region(s) of interest defined in a bed file, change input_type to region and supply a bed file to input_file.

nextflow run "${structural_variant}"/main.nf \
    --input_file "${structural_variant}/input/regions.bed" \
    --input_type 'region' \
    --sample_type 'all' \
    --genome_build 'both' \
    -profile cluster \
    -resume

Parameters and options

On the command line, prefix workflow parameters with --; prefix Nextflow options with -.

For a complete list of workflow parameters (configurable parameters have a params. prefix).

structural_variant='/pgen_int_data_resources/workflows/rdp_structural_variant/main'

nextflow config -flat -profile cluster "${structural_variant}"/main.nf | grep params.

Output¶

file	content
SVmergedResult_chr<N>_<gene>_<ensembl>_<build>_<cohort>.tsv	Manta-called SVs observed in query regions
CNVmergedResult_chr<N>_<gene>_<ensembl>_<build>_<cohort>.tsv	Canvas-called CNVs observed in query regions
<build>_adjusted_input.tsv	Query genes/regions with HGNC and Ensembl IDs, chromosome, coordinates, etc.
inaccessible_vcf.tsv	VCFs for all queries and genome builds that could not be accessed
pipeline_info/	Directory containing timestamped pipeline run information (execution_report_.html, execution_timeline_.html, execution_trace_.txt, pipeline_dag_.html)

SVmergedResult_*.tsvCNVmergedResult_*.tsv*_adjusted_input.tsv

A tab-delimited text file with a header line and the following columns:

column	description
SAMPLE	Participant platekey ID
CHROM	Chromosome
POS	Start position of the variant
ID	Variant ID
REF	Reference allele
ALT	Alternate allele
QUAL	Quality
FILTER	Filter applied to data
INFO/SVTYPE	Type of structural variant: duplication (DUP), deletion (DEL), inversion (INV), insertion (INS) or translocation (BND)
INFO/SVLEN	Variant size
INFO/END	End/Continuation position of the variant
GT	Genotype (0/1 = heterozygote, 1/1 = homozygote) - germline
GQ	Genotype quality - germline
PR	Spanning paired-read support for the ref and alt alleles in the order listed
SR	Split reads for the ref and alt alleles in the order listed, for reads where P(allele\|read) > 0.999
OVERLAP	'PARTIAL' or 'TOTAL' overlap of the query region, by the variant
OVERLAP_CNT	Count of query region base pairs overlapped by the variant
OVERLAP_PCT	Percentage of query region overlapped by the variant

A tab-delimited text file with a header line and the following columns (depedent on dragen=true or false, cnv_pass_only=true or false, germline or somatic):

column	description
SAMPLE	Participant platekey ID
CHROM	Chromosome
POS	Start position of the variant
ID	Variant ID
REF	Reference allele
ALT	Alternate allele
FILTER	Filter applied to data
INFO/END	End position of the variant
INFO/SVTYPE	CNV for variants with copy number != 2. LOH for variants with copy number = 2, and loss of heterozygosity
CN	Copy number
GT	Genotype - `'germline'`, `'germline'` + `dragen`, '`somatic`' + `dragen`
MCC	Major chromosome count (equal to copy number for LOH regions) - except `'germline'` + `dragen` or `'germline'` + `'GRCh37'`
OVERLAP	'PARTIAL' or 'TOTAL' overlap of the query region, by the variant
OVERLAP_CNT	Count of query region base pairs overlapped by the variant
OVERLAP_PCT	Percentage of query region overlapped by the variant

A tab-delimited text file with a header line and the following columns (first four columns only for input_type = 'region', where column four values are region identifiers):

column	description
chr	chromosome
start	start position of gene region
end	end position of gene region
gene	HGNC gene symbol
ensemblID	Ensembl gene ID
internalID	identifier generated internally which will be used as a prefix for output files: `chr_gene_ensemblID`
posTag	position tag (chromosome_start_end) aiding the identifier deduplication step
originalInput	specifies what has been used as an input by you, either "gene" or "ensembl", referring to either gene symbol or Ensembl ID
entryDuplicate	NA, or containing a comma-separated string of internalIDs. The internalIDs listed here are considered the same entry due to their identical genomic position.

Nextflow documentation

For Nextflow command line documentation use nextflow help, and nextflow <command> -help for help on a particular command. See Nextflow documentation and Nextflow training.