Structural Variant workflow¶
The Structural Variant workflow allows you to find all structural variants in a list of genes or genomic regions.
The workflow (previously the SV/CNV workflow) extracts canonical structural variants (insertions, deletions, duplications, inversions, translocations) and copy number variants within query regions defined by a list of genes or coordinates, from Illumina Structural Variant VCF files for germline and/or somatic cohorts. The main output is a list of observed Manta-called SVs and Canvas-called CNVs within query regions. CNVs on sex chromosomes depend on sex chromosome karyotype and should be interpreted with caution.
The workflow is written in Nextflow DSL2 and uses containerised R (and the RLabKey API), Python 3, bcftools, and bedtools.
Note
In the RE (HPC), if no participant VCF list is provided, the workflow performs a LabKey query to retrieve a list of standard Illumina VCF file paths for all germline rare disease and cancer participants. The default selection drops Dragen-called genomes, in order to get the largest sample called by a single tool. You can use alternate VCF files (e.g. platypus, Dragen) by supplying a VCF list. In CloudRE (CloudOS), a participant VCF list is required.
Design¶
Figure. Relationship between workflow processes in Structural Variant 3.0
Processes¶
- check the validity of parameter arguments (
VALIDATE_ARGS
) - get coordinates and region information (
DEFINE_REGION
) - get a list of VCF files for each cohort (
DEFINE_COHORT
) - chunk VCF lists (
SPLIT
) - query chunked VCFs for MANTA-called structural variants(
QUERY_SV
) - query chunked VCFs for CANVAS-called copy-number variants (
QUERY_CNV
) - merge chunked variant outputs (
MERGE_CHUNKS
) - add variant query region overlap information (
ADD_COLUMNS
) - log containerised tool versions (
DUMP_VERSIONS
)
Usage¶
RE (HPC)¶
Warning
Before running the workflow in the RE (HPC), you must configure your R LabKey API access.
Using the submission script¶
Navigate to your gecip or discovery forum folder, create a directory for your analysis, e.g. my_analysis
, and copy the submission script
Copy the submission script only, not the entire directory.
The submission script simply loads the required Nextflow and singularity modules and uses the Nextflow command line tool to run the workflow (see Nextflow documentation).
- replace
<project_code>
with your LSF Project Code. -
replace
<path_to_input_file>
with a path to an input file, i.e., either a one gene per line file for--input_type 'gene'
, or a bed file of regions for--input_type 'region'
. Create this input file in your working directory, e.g., my_analysis/. See example input files here:- /pgen_int_data_resources/workflows/rdp_structural_variant/main/input/genes.txt for --input_type 'gene', or
- /pgen_int_data_resources/workflows/rdp_structural_variant/main/input/regions.bed for --input_type 'region'
#!/bin/bash #BSUB -P <project_code> #BSUB -q inter #BSUB -J structural_variant #BSUB -o structural_variant_%J.stdout #BSUB -e structural_variant_%J.stderr module purge module load tools/singularity/3.8.3 bio/nextflow/22.10.5 lang/Java/17.0.2 structural_variant='/pgen_int_data_resources/workflows/rdp_structural_variant/main' nextflow run "${structural_variant}"/main.nf \ --input_file '<path_to_input_file>' \ --input_type 'gene' \ --sample_type 'all' \ --genome_build 'both' \ -profile cluster \ -resume
-
run the workflow on the default LabKey query sample.
In the RE (HPC), the default sample (if no sample file is supplied) is derived from a LabKey query based on the arguments to --sample_type
, --genome_build
, and --dragen
. The default LabKey query draws data from three tables: genome_file_paths_and_types
, rare_disease_analysis
, and cancer_analysis
. For reference, the full script used to define the participant list based on --sample_type
, --genome_build
, and --dragen
(defaults: 'all'
, 'both'
, true
) is included below.
Expand to show define_cohort.R
#!/usr/bin/env -S Rscript --vanilla
suppressPackageStartupMessages({
library(argparser)
library(tidyverse)
library(Rlabkey)
})
labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")
p <- arg_parser("Define region.")
p <- argparser::add_argument(p, "--data-version",
help = "GEL data release path."
)
p <- add_argument(p, "--sample-type", help = "Sample type.")
p <- add_argument(p, "--sample-file", help = "Sample file.")
p <- add_argument(p, "--dragen", help = "Dragen.")
p <- add_argument(p, "--excluded-participant-ids",
help = "File with excluded participants."
)
argv <- parse_args(p)
sample_type <- stringr::str_replace(argv$sample_type, "-", "_")
# provided sample
if (!nchar(argv$sample_file) == 0) {
sample_file <- read_tsv(argv$sample_file, show_col_types = FALSE)
sample_file_list <- split(sample_file, sample_file[["cohort_type"]])
map(
names(sample_file_list),
~ write_tsv(
sample_file_list[[.x]],
str_interp("${sample_type}_${.x}_cohort.tsv"),
col_names = TRUE
)
)
quit(save = "no")
}
main_programme <- file.path("/main-programme", argv$data_version)
excluded_participant_ids <- readr::read_lines(argv$excluded_participant_ids)
dragen <- as.logical(argv$dragen)
# cancer
ca_query <- stringr::str_interp("
SELECT DISTINCT
participant_id,
tumour_sample_platekey,
germline_sample_platekey
FROM cancer_analysis
WHERE participant_id NOT IN (${excluded_participant_ids})
") %>%
stringr::str_squish()
ca_df <- labkey.executeSql(
folderPath = main_programme,
schemaName = "lists",
colNameOpt = "rname",
maxRows = 10000000,
sql = ca_query
)
ca_cohort <- ca_df %>%
tidyr::gather(
key = "type",
value = "plate_key",
tumour_sample_platekey,
germline_sample_platekey,
-participant_id
) %>%
dplyr::mutate(
programme = "cancer",
type = case_when(
type == "tumour_sample_platekey" ~ "somatic",
type == "germline_sample_platekey" ~ "germline"
)
) %>%
dplyr::select(participant_id, programme, type, platekey = plate_key)
# rare disease
rd_query <- stringr::str_interp("
SELECT DISTINCT participant_id, plate_key
FROM rare_disease_analysis
WHERE participant_id NOT IN (${excluded_participant_ids})
") %>%
stringr::str_squish()
rd_df <- labkey.executeSql(
folderPath = main_programme,
schemaName = "lists",
colNameOpt = "rname",
maxRows = 10000000,
sql = rd_query
)
rd_cohort <- rd_df %>%
tibble::add_column(
programme = "rare_disease",
type = "germline"
) %>%
select(participant_id, programme, type, platekey = plate_key)
# combine cancer and rare disease cohorts
cohort_pre_filt <- dplyr::bind_rows(rd_cohort, ca_cohort) %>%
dplyr::filter(!is.na(platekey) & platekey != "N/A") %>%
dplyr::distinct()
# vcf file paths
base_query <- stringr::str_interp("
SELECT DISTINCT
participant_id, platekey, type, genome_build, filename, file_path,
file_sub_type, file_type, delivery_date
FROM genome_file_paths_and_types
WHERE participant_id NOT IN (${excluded_participant_ids})
AND file_sub_type = 'Structural VCF'
") %>% stringr::str_squish()
data_release_version <- stringr::str_extract(main_programme, "v[0-9]+") %>%
stringr::str_extract(., "[0-9]+") %>%
as.numeric()
if (data_release_version < 10) {
structural_vcf_query <- base_query
} else if (!dragen) {
structural_vcf_query <- stringr::str_c(
base_query, " AND delivery_version IN ('V2', 'V4')"
)
} else {
structural_vcf_query <- stringr::str_c(
base_query, " AND delivery_version IN ('Dragen_Pipeline2.0')"
)
}
structural_vcf_paths <- labkey.executeSql(
folderPath = main_programme,
schemaName = "lists",
colNameOpt = "rname",
maxRows = 10000000,
sql = structural_vcf_query
)
if (dragen) {
aux_sv <- structural_vcf_paths %>%
dplyr::filter(grepl("SV", file_path)) %>%
dplyr::arrange(participant_id, dplyr::desc(delivery_date)) %>%
dplyr::distinct(participant_id, platekey, genome_build, .keep_all = TRUE)
aux_cnv <- structural_vcf_paths %>%
dplyr::filter(grepl("cnv|CNV", file_path)) %>%
dplyr::arrange(participant_id, dplyr::desc(delivery_date)) %>%
dplyr::distinct(participant_id, platekey, genome_build, .keep_all = TRUE)
structural_vcf_paths <-
dplyr::full_join(aux_sv, aux_cnv) %>%
dplyr::arrange(platekey)
svcnv_and_tiering_paths <- structural_vcf_paths %>%
dplyr::rename(
programme_type = type,
sv_vcf_filepath = file_path
)
} else if (!dragen) {
structural_vcf_paths <- structural_vcf_paths %>%
dplyr::arrange(participant_id, dplyr::desc(delivery_date)) %>%
dplyr::distinct(participant_id, platekey, genome_build, .keep_all = TRUE)
sub_sv_paths <- structural_vcf_paths %>%
dplyr::rename(sv_vcf_filepath = file_path) %>%
dplyr::select(-filename, -file_sub_type, -file_type) %>%
tibble::add_column(disease_type = NA, disease_sub_type = NA) %>%
dplyr::select(
participant_id, platekey, type, disease_type, disease_sub_type,
genome_build, sv_vcf_filepath
) %>%
dplyr::filter(
(type == "rare disease germline") |
(type == "cancer germline")
)
tiering_svcnv_query <- stringr::str_interp("
SELECT DISTINCT
participant_id, tumour_sample_platekey, germline_sample_platekey,
tumour_sv_vcf, disease_type, disease_sub_type
FROM cancer_analysis
WHERE participant_id NOT IN (${excluded_participant_ids})
") %>%
stringr::str_squish()
tiering_svcnv_paths <- labkey.executeSql(
folderPath = main_programme,
schemaName = "lists",
colNameOpt = "rname",
maxRows = 10000000,
sql = tiering_svcnv_query
)
# genomes in cancer_analysis table are GRCh38
sub_tiering_svcnv_paths <- tiering_svcnv_paths %>%
dplyr::select(-germline_sample_platekey) %>%
dplyr::rename(
sv_vcf_filepath = tumour_sv_vcf,
platekey = tumour_sample_platekey
) %>%
tibble::add_column(
genome_build = "GRCh38",
type = "cancer somatic"
) %>%
dplyr::select(
participant_id, platekey, type, disease_type, disease_sub_type,
genome_build, sv_vcf_filepath
)
svcnv_and_tiering_paths <-
dplyr::bind_rows(sub_sv_paths, sub_tiering_svcnv_paths) %>%
dplyr::rename(programme_type = type)
}
# bind to input and clean
input_paths_merge_filt <-
dplyr::left_join(
cohort_pre_filt,
svcnv_and_tiering_paths,
by = "platekey",
suffix = c("", "_drop")
) %>%
dplyr::select(-contains("_drop")) %>%
dplyr::filter(platekey != "N/A") %>%
tidyr::unite(
programme, type,
sep = " ", col = "pt_check", remove = FALSE
) %>%
dplyr::mutate(pt_check = stringr::str_replace(pt_check, "_", " ")) %>%
dplyr::filter(pt_check == programme_type) %>%
dplyr::select(-pt_check) %>%
dplyr::mutate(
cohort_type = dplyr::case_when(
programme_type == "cancer germline" ~ "germline",
programme_type == "cancer somatic" ~ "somatic",
programme_type == "rare disease germline" ~ "germline"
)
)
if (argv$sample_type %in% c("germline", "ca-germline", "rd-germline")) {
input_paths_merge_filt %>%
dplyr::filter(
dplyr::case_when(
argv$sample_type == "germline" ~ type == argv$sample_type,
argv$sample_type == "rd-germline" ~
programme_type == "rare disease germline",
argv$sample_type == "ca-germline" ~ programme_type == "cancer germline"
)
) %>%
dplyr::select(cohort_type, genome_build, sv_vcf_filepath) %>%
tidyr::drop_na() %>%
readr::write_tsv(
stringr::str_interp("${sample_type}_cohort.tsv"),
col_names = TRUE
)
}
if (argv$sample_type %in% c("somatic", "ca-somatic")) {
input_paths_merge_filt %>%
dplyr::filter(
dplyr::case_when(
argv$sample_type == "somatic" ~ type == argv$sample_type,
argv$sample_type == "ca-somatic" ~ programme_type == "cancer somatic"
)
) %>%
dplyr::select(cohort_type, genome_build, sv_vcf_filepath) %>%
tidyr::drop_na() %>%
readr::write_tsv(
stringr::str_interp("${sample_type}_cohort.tsv"),
col_names = TRUE
)
}
if (argv$sample_type %in% c("all", "ca-all")) {
for (cohort in c("germline", "somatic")) {
input_paths_merge_filt %>%
dplyr::filter(
dplyr::case_when(
argv$sample_type == "all" ~ type == cohort,
argv$sample_type == "ca-all" ~
(programme == "cancer" & type == cohort)
)
) %>%
dplyr::select(cohort_type, genome_build, sv_vcf_filepath) %>%
tidyr::drop_na() %>%
readr::write_tsv(
stringr::str_interp("${sample_type}_${cohort}_cohort.tsv"),
col_names = TRUE
)
}
}
CloudRE (CloudOS)¶
Open the CloudOS application from the CloudRE WorkSpace desktop and log in with your credentials.
- select the Advanced Analytics icon from the side menu, Pipeline & Tools, and the WORKSPACE TOOLS tab
- type "structural-variant" into the search field and select the gel-structural-variant tile
From the next screen, use front-end menu to select inputs and generate the Nextflow CLI command.
Warning
In CloudRE (CloudOS), you must supply a VCF list (see Input section below).
- give the analysis a name
- select Branch main
- select Nextflow Profile
cloud
- under Data & Parameters type/select the required inputs (Note. do not use
'
s around arguments; CloudOS inserts these automatically, and will add\
to any quotes you use)
In CloudRE (CloudOS), you are required to provide a tab-delimited sample file. For reference, see the RE (HPC) query above.
Input¶
Below are the few input parameters you need to edit to define your query. See Parameters for the full list of pipeline parameters and their use on the command line.
--input_type
'gene'|'region'
- Retrieve structural variants within regions defined by
'gene'
(s) or'region'
(s). Default'gene'
. --input_file__
- For
input_type = 'gene'
, a path to a one-per-line text file of query genes which can include both HGNC symbols and Ensembl IDs. Forinput_type = 'region'
, a path to a headless four-column tab-delimited bed file (chr, start, stop, name) of query regions. Default'/pgen_int_data_resources/workflows/rdp_structural_variant/main/input/genes.txt'
, for defaultinput_type = 'gene'
. --sample_type__
'all'|'ca-all'|'germline'|'ca-germline'|'rd-germline'|'ca-somatic'
-
- Select structural VCF sample type. This parameter is used to define the LabKey query when no sample file is supplied. Default
'all'
. 'all'
: A combination of germline and somatic, rare disease and cancer genomes'ca-all'
: A combination of germline and somatic, cancer genomes only'germline'
: Germline only, rare disease and cancer genomes'ca-germline'
: Germline only, cancer genomes only'rd-germline'
: Germline only, rare disease genomes only'ca-somatic'
: Somatic only, cancer genomes only
- Select structural VCF sample type. This parameter is used to define the LabKey query when no sample file is supplied. Default
--sample_file
- Either
null
, or a path to a tab-delimited sample file with header (). Defaultnull
, performs a LabKey query to collect a list of structural VCFs. --data_version
- This parameter is used to define the LabKey query when no sample file is supplied. Default
'main-programme_v17_2023-03-30'
. --outdir
- A path to an output directory. Default
'results'
, writes all output to a directory called 'results' in the current working directory. --genome_build
'GRCh38'|'GRCh37'|'both'
- This parameter is used to define the LabKey query when no sample file is supplied. Default
'both'
. --dragen
- Boolean indicating whether query samples include dragen-aligned VCFs. This parameter is used to define the LabKey query when no sample file is supplied. Default
false
.
Parameters¶
Default workflow parameters defined in the configuration files, can be overridden on the command line. For example, to use coordinates to specify a region(s) of interest defined in a bed file, change input_type
to 'regionand supply a bed file to
input_file`.
nextflow run "${structural_variant}"/main.nf \
--input_file "${structural_variant}/input/regions.bed" \
--input_type 'region' \
--sample_type 'all' \
--genome_build 'both' \
-profile cluster \
-resume
Parameters and options
On the command line, prefix workflow parameters with --
; prefix Nextflow options with -
.
For a complete list of workflow parameters (configurable parameters have a params.
prefix).
structural_variant='/pgen_int_data_resources/workflows/rdp_structural_variant/main'
nextflow config -flat -profile cluster "${structural_variant}"/main.nf | grep params.
Output¶
file | content |
---|---|
SVmergedResult_chr<N>_<gene>_<ensembl>_<build>_<cohort>.tsv | Manta-called SVs observed in query regions |
CNVmergedResult_chr<N>_<gene>_<ensembl>_<build>_<cohort>.tsv | Canvas-called CNVs observed in query regions |
<build>_adjusted_input.tsv | Query genes/regions with HGNC and Ensembl IDs, chromosome, coordinates, etc. |
inaccessible_vcf.tsv | VCFs for all queries and genome builds that could not be accessed |
pipeline_info/ | Directory containing timestamped pipeline run information (execution_report_*.html, execution_timeline_*.html, execution_trace_*.txt, pipeline_dag_*.html) |
A tab-delimited text file with a header line and the following columns:
column | description |
---|---|
SAMPLE | Participant platekey ID |
CHROM | Chromosome |
POS | Start position of the variant |
ID | Variant ID |
REF | Reference allele |
ALT | Alternate allele |
QUAL | Quality |
FILTER | Filter applied to data |
INFO/SVTYPE | Type of structural variant: duplication (DUP), deletion (DEL), inversion (INV), insertion (INS) or translocation (BND) |
INFO/SVLEN | Variant size |
INFO/END | End/Continuation position of the variant |
GT | Genotype (0/1 = heterozygote, 1/1 = homozygote) - germline |
GQ | Genotype quality - germline |
PR | Spanning paired-read support for the ref and alt alleles in the order listed |
SR | Split reads for the ref and alt alleles in the order listed, for reads where P(allele|read) > 0.999 |
OVERLAP | 'PARTIAL' or 'TOTAL' overlap of the query region, by the variant |
OVERLAP_CNT | Count of query region base pairs overlapped by the variant |
OVERLAP_PCT | Percentage of query region overlapped by the variant |
A tab-delimited text file with a header line and the following columns (depedent on dragen=true
or false
, cnv_pass_only=true
or false
, germline
or somatic
):
column | description |
---|---|
SAMPLE | Participant platekey ID |
CHROM | Chromosome |
POS | Start position of the variant |
ID | Variant ID |
REF | Reference allele |
ALT | Alternate allele |
FILTER | Filter applied to data |
INFO/END | End position of the variant |
INFO/SVTYPE | CNV for variants with copy number != 2. LOH for variants with copy number = 2, and loss of heterozygosity |
CN | Copy number |
GT | Genotype - 'germline' , 'germline' + dragen , 'somatic ' + dragen |
MCC | Major chromosome count (equal to copy number for LOH regions) - except 'germline' + dragen or 'germline' + 'GRCh37' |
OVERLAP | 'PARTIAL' or 'TOTAL' overlap of the query region, by the variant |
OVERLAP_CNT | Count of query region base pairs overlapped by the variant |
OVERLAP_PCT | Percentage of query region overlapped by the variant |
A tab-delimited text file with a header line and the following columns (first four columns only for input_type = 'region'
, where column four values are region identifiers):
column | description |
---|---|
chr | chromosome |
start | start position of gene region |
end | end position of gene region |
gene | HGNC gene symbol |
ensemblID | Ensembl gene ID |
internalID | identifier generated internally which will be used as a prefix for output files: chr_gene_ensemblID |
posTag | position tag (chromosome_start_end) aiding the identifier deduplication step |
originalInput | specifies what has been used as an input by you, either "gene" or "ensembl", referring to either gene symbol or Ensembl ID |
entryDuplicate | NA, or containing a comma-separated string of internalIDs. The internalIDs listed here are considered the same entry due to their identical genomic position. |
Nextflow documentation
For Nextflow command line documentation use nextflow help
, and nextflow <command> -help
for help on a particular command. See Nextflow documentation and Nextflow training.