Skip to content

Structural Variant workflow

The Structural Variant workflow allows you to find all structural variants in a list of genes or genomic regions.

The workflow (previously the SV/CNV workflow) extracts canonical structural variants (insertions, deletions, duplications, inversions, translocations) and copy number variants within query regions defined by a list of genes or coordinates, from Illumina Structural Variant VCF files for germline and/or somatic cohorts. The main output is a list of observed Manta-called SVs and Canvas-called CNVs within query regions. CNVs on sex chromosomes depend on sex chromosome karyotype and should be interpreted with caution.

The workflow is written in Nextflow DSL2 and uses containerised R (and the RLabKey API), Python 3, bcftools, and bedtools.

Note

In the RE (HPC), if no participant VCF list is provided, the workflow performs a LabKey query to retrieve a list of standard Illumina VCF file paths for all germline rare disease and cancer participants. The default selection drops Dragen-called genomes, in order to get the largest sample called by a single tool. You can use alternate VCF files (e.g. platypus, Dragen) by supplying a VCF list. In CloudRE (CloudOS), a participant VCF list is required.

Design

Figure. Relationship between workflow processes in Structural Variant 3.0

Processes

  • check the validity of parameter arguments (VALIDATE_ARGS)
  • get coordinates and region information (DEFINE_REGION)
  • get a list of VCF files for each cohort (DEFINE_COHORT)
  • chunk VCF lists (SPLIT)
  • query chunked VCFs for MANTA-called structural variants(QUERY_SV)
  • query chunked VCFs for CANVAS-called copy-number variants (QUERY_CNV)
  • merge chunked variant outputs (MERGE_CHUNKS)
  • add variant query region overlap information (ADD_COLUMNS)
  • log containerised tool versions (DUMP_VERSIONS)

Usage

RE (HPC)

Warning

Before running the workflow in the RE (HPC), you must configure your R LabKey API access.

Using the submission script

Navigate to your gecip or discovery forum folder, create a directory for your analysis, e.g. my_analysis, and copy the submission script

mkdir my_analysis
cd my_analysis

Copy the submission script only, not the entire directory.

cp /pgen_int_data_resources/workflows/rdp_structural_variant/main/submit.sh .

The submission script simply loads the required Nextflow and singularity modules and uses the Nextflow command line tool to run the workflow (see Nextflow documentation).

  1. replace <project_code> with your LSF Project Code.
  2. replace <path_to_input_file> with a path to an input file, i.e., either a one gene per line file for --input_type 'gene', or a bed file of regions for --input_type 'region'. Create this input file in your working directory, e.g., my_analysis/. See example input files here:

    • /pgen_int_data_resources/workflows/rdp_structural_variant/main/input/genes.txt for --input_type 'gene', or
    • /pgen_int_data_resources/workflows/rdp_structural_variant/main/input/regions.bed for --input_type 'region'
    #!/bin/bash
    
    #BSUB -P <project_code>
    #BSUB -q inter
    #BSUB -J structural_variant
    #BSUB -o structural_variant_%J.stdout
    #BSUB -e structural_variant_%J.stderr
    
    module purge
    module load tools/singularity/3.8.3 bio/nextflow/22.10.5 lang/Java/17.0.2
    
    structural_variant='/pgen_int_data_resources/workflows/rdp_structural_variant/main'
    
    nextflow run "${structural_variant}"/main.nf \
        --input_file '<path_to_input_file>' \
        --input_type 'gene' \
        --sample_type 'all' \
        --genome_build 'both' \
        -profile cluster \
        -resume
    
  3. run the workflow on the default LabKey query sample.

    bsub < submit.sh
    

In the RE (HPC), the default sample (if no sample file is supplied) is derived from a LabKey query based on the arguments to --sample_type, --genome_build, and --dragen. The default LabKey query draws data from three tables: genome_file_paths_and_types, rare_disease_analysis, and cancer_analysis. For reference, the full script used to define the participant list based on --sample_type, --genome_build, and --dragen (defaults: 'all', 'both', true) is included below.

Expand to show define_cohort.R
#!/usr/bin/env -S Rscript --vanilla

suppressPackageStartupMessages({
library(argparser)
library(tidyverse)
library(Rlabkey)
})

labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

p <- arg_parser("Define region.")
p <- argparser::add_argument(p, "--data-version",
help = "GEL data release path."
)
p <- add_argument(p, "--sample-type", help = "Sample type.")
p <- add_argument(p, "--sample-file", help = "Sample file.")
p <- add_argument(p, "--dragen", help = "Dragen.")
p <- add_argument(p, "--excluded-participant-ids",
help = "File with excluded participants."
)
argv <- parse_args(p)

sample_type <- stringr::str_replace(argv$sample_type, "-", "_")

# provided sample
if (!nchar(argv$sample_file) == 0) {
sample_file <- read_tsv(argv$sample_file, show_col_types = FALSE)
sample_file_list <- split(sample_file, sample_file[["cohort_type"]])

map(
    names(sample_file_list),
    ~ write_tsv(
    sample_file_list[[.x]],
    str_interp("${sample_type}_${.x}_cohort.tsv"),
    col_names = TRUE
    )
)

quit(save = "no")
}


main_programme <- file.path("/main-programme", argv$data_version)
excluded_participant_ids <- readr::read_lines(argv$excluded_participant_ids)
dragen <- as.logical(argv$dragen)

# cancer
ca_query <- stringr::str_interp("
SELECT DISTINCT
    participant_id,
    tumour_sample_platekey,
    germline_sample_platekey
FROM cancer_analysis
WHERE participant_id NOT IN (${excluded_participant_ids})
") %>%
stringr::str_squish()

ca_df <- labkey.executeSql(
folderPath = main_programme,
schemaName = "lists",
colNameOpt = "rname",
maxRows = 10000000,
sql = ca_query
)

ca_cohort <- ca_df %>%
tidyr::gather(
    key = "type",
    value = "plate_key",
    tumour_sample_platekey,
    germline_sample_platekey,
    -participant_id
) %>%
dplyr::mutate(
    programme = "cancer",
    type = case_when(
    type == "tumour_sample_platekey" ~ "somatic",
    type == "germline_sample_platekey" ~ "germline"
    )
) %>%
dplyr::select(participant_id, programme, type, platekey = plate_key)

# rare disease
rd_query <- stringr::str_interp("
SELECT DISTINCT participant_id, plate_key
FROM rare_disease_analysis
WHERE participant_id NOT IN (${excluded_participant_ids})
") %>%
stringr::str_squish()

rd_df <- labkey.executeSql(
folderPath = main_programme,
schemaName = "lists",
colNameOpt = "rname",
maxRows = 10000000,
sql = rd_query
)

rd_cohort <- rd_df %>%
tibble::add_column(
    programme = "rare_disease",
    type = "germline"
) %>%
select(participant_id, programme, type, platekey = plate_key)

# combine cancer and rare disease cohorts
cohort_pre_filt <- dplyr::bind_rows(rd_cohort, ca_cohort) %>%
dplyr::filter(!is.na(platekey) & platekey != "N/A") %>%
dplyr::distinct()

# vcf file paths
base_query <- stringr::str_interp("
SELECT DISTINCT
    participant_id, platekey, type, genome_build, filename, file_path,
    file_sub_type, file_type, delivery_date
FROM genome_file_paths_and_types
WHERE participant_id NOT IN (${excluded_participant_ids})
AND file_sub_type = 'Structural VCF'
") %>% stringr::str_squish()

data_release_version <- stringr::str_extract(main_programme, "v[0-9]+") %>%
stringr::str_extract(., "[0-9]+") %>%
as.numeric()

if (data_release_version < 10) {
structural_vcf_query <- base_query
} else if (!dragen) {
structural_vcf_query <- stringr::str_c(
    base_query, " AND delivery_version IN ('V2', 'V4')"
)
} else {
structural_vcf_query <- stringr::str_c(
    base_query, " AND delivery_version IN ('Dragen_Pipeline2.0')"
)
}

structural_vcf_paths <- labkey.executeSql(
folderPath = main_programme,
schemaName = "lists",
colNameOpt = "rname",
maxRows = 10000000,
sql = structural_vcf_query
)

if (dragen) {
aux_sv <- structural_vcf_paths %>%
    dplyr::filter(grepl("SV", file_path)) %>%
    dplyr::arrange(participant_id, dplyr::desc(delivery_date)) %>%
    dplyr::distinct(participant_id, platekey, genome_build, .keep_all = TRUE)

aux_cnv <- structural_vcf_paths %>%
    dplyr::filter(grepl("cnv|CNV", file_path)) %>%
    dplyr::arrange(participant_id, dplyr::desc(delivery_date)) %>%
    dplyr::distinct(participant_id, platekey, genome_build, .keep_all = TRUE)

structural_vcf_paths <-
    dplyr::full_join(aux_sv, aux_cnv) %>%
    dplyr::arrange(platekey)

svcnv_and_tiering_paths <- structural_vcf_paths %>%
    dplyr::rename(
    programme_type = type,
    sv_vcf_filepath = file_path
    )
} else if (!dragen) {
structural_vcf_paths <- structural_vcf_paths %>%
    dplyr::arrange(participant_id, dplyr::desc(delivery_date)) %>%
    dplyr::distinct(participant_id, platekey, genome_build, .keep_all = TRUE)

sub_sv_paths <- structural_vcf_paths %>%
    dplyr::rename(sv_vcf_filepath = file_path) %>%
    dplyr::select(-filename, -file_sub_type, -file_type) %>%
    tibble::add_column(disease_type = NA, disease_sub_type = NA) %>%
    dplyr::select(
    participant_id, platekey, type, disease_type, disease_sub_type,
    genome_build, sv_vcf_filepath
    ) %>%
    dplyr::filter(
    (type == "rare disease germline") |
        (type == "cancer germline")
    )

tiering_svcnv_query <- stringr::str_interp("
    SELECT DISTINCT
    participant_id, tumour_sample_platekey, germline_sample_platekey,
    tumour_sv_vcf, disease_type, disease_sub_type
    FROM cancer_analysis
    WHERE participant_id NOT IN (${excluded_participant_ids})
") %>%
    stringr::str_squish()

tiering_svcnv_paths <- labkey.executeSql(
    folderPath = main_programme,
    schemaName = "lists",
    colNameOpt = "rname",
    maxRows = 10000000,
    sql = tiering_svcnv_query
)

# genomes in cancer_analysis table are GRCh38
sub_tiering_svcnv_paths <- tiering_svcnv_paths %>%
    dplyr::select(-germline_sample_platekey) %>%
    dplyr::rename(
    sv_vcf_filepath = tumour_sv_vcf,
    platekey = tumour_sample_platekey
    ) %>%
    tibble::add_column(
    genome_build = "GRCh38",
    type = "cancer somatic"
    ) %>%
    dplyr::select(
    participant_id, platekey, type, disease_type, disease_sub_type,
    genome_build, sv_vcf_filepath
    )

svcnv_and_tiering_paths <-
    dplyr::bind_rows(sub_sv_paths, sub_tiering_svcnv_paths) %>%
    dplyr::rename(programme_type = type)
}

# bind to input and clean
input_paths_merge_filt <-
dplyr::left_join(
    cohort_pre_filt,
    svcnv_and_tiering_paths,
    by = "platekey",
    suffix = c("", "_drop")
) %>%
dplyr::select(-contains("_drop")) %>%
dplyr::filter(platekey != "N/A") %>%
tidyr::unite(
    programme, type,
    sep = " ", col = "pt_check", remove = FALSE
) %>%
dplyr::mutate(pt_check = stringr::str_replace(pt_check, "_", " ")) %>%
dplyr::filter(pt_check == programme_type) %>%
dplyr::select(-pt_check) %>%
dplyr::mutate(
    cohort_type = dplyr::case_when(
    programme_type == "cancer germline" ~ "germline",
    programme_type == "cancer somatic" ~ "somatic",
    programme_type == "rare disease germline" ~ "germline"
    )
)

if (argv$sample_type %in% c("germline", "ca-germline", "rd-germline")) {
input_paths_merge_filt %>%
    dplyr::filter(
    dplyr::case_when(
        argv$sample_type == "germline" ~ type == argv$sample_type,
        argv$sample_type == "rd-germline" ~
        programme_type == "rare disease germline",
        argv$sample_type == "ca-germline" ~ programme_type == "cancer germline"
    )
    ) %>%
    dplyr::select(cohort_type, genome_build, sv_vcf_filepath) %>%
    tidyr::drop_na() %>%
    readr::write_tsv(
    stringr::str_interp("${sample_type}_cohort.tsv"),
    col_names = TRUE
    )
}

if (argv$sample_type %in% c("somatic", "ca-somatic")) {
input_paths_merge_filt %>%
    dplyr::filter(
    dplyr::case_when(
        argv$sample_type == "somatic" ~ type == argv$sample_type,
        argv$sample_type == "ca-somatic" ~ programme_type == "cancer somatic"
    )
    ) %>%
    dplyr::select(cohort_type, genome_build, sv_vcf_filepath) %>%
    tidyr::drop_na() %>%
    readr::write_tsv(
    stringr::str_interp("${sample_type}_cohort.tsv"),
    col_names = TRUE
    )
}

if (argv$sample_type %in% c("all", "ca-all")) {
for (cohort in c("germline", "somatic")) {
    input_paths_merge_filt %>%
    dplyr::filter(
        dplyr::case_when(
        argv$sample_type == "all" ~ type == cohort,
        argv$sample_type == "ca-all" ~
            (programme == "cancer" & type == cohort)
        )
    ) %>%
    dplyr::select(cohort_type, genome_build, sv_vcf_filepath) %>%
    tidyr::drop_na() %>%
    readr::write_tsv(
        stringr::str_interp("${sample_type}_${cohort}_cohort.tsv"),
        col_names = TRUE
    )
}
}

CloudRE (CloudOS)

Open the CloudOS application from the CloudRE WorkSpace desktop and log in with your credentials.

  • select the Advanced Analytics icon from the side menu, Pipeline & Tools, and the WORKSPACE TOOLS tab
  • type "structural-variant" into the search field and select the gel-structural-variant tile

From the next screen, use front-end menu to select inputs and generate the Nextflow CLI command.

Warning

In CloudRE (CloudOS), you must supply a VCF list (see Input section below).

  • give the analysis a name
  • select Branch main
  • select Nextflow Profile cloud
  • under Data & Parameters type/select the required inputs (Note. do not use 's around arguments; CloudOS inserts these automatically, and will add \ to any quotes you use)

In CloudRE (CloudOS), you are required to provide a tab-delimited sample file. For reference, see the RE (HPC) query above.

Input

Below are the few input parameters you need to edit to define your query. See Parameters for the full list of pipeline parameters and their use on the command line.

--input_type 'gene'|'region'
Retrieve structural variants within regions defined by 'gene'(s) or 'region'(s). Default 'gene'.
--input_file__
For input_type = 'gene', a path to a one-per-line text file of query genes which can include both HGNC symbols and Ensembl IDs. For input_type = 'region', a path to a headless four-column tab-delimited bed file (chr, start, stop, name) of query regions. Default '/pgen_int_data_resources/workflows/rdp_structural_variant/main/input/genes.txt', for default input_type = 'gene'.
--sample_type__ 'all'|'ca-all'|'germline'|'ca-germline'|'rd-germline'|'ca-somatic'
Select structural VCF sample type. This parameter is used to define the LabKey query when no sample file is supplied. Default 'all'.
'all': A combination of germline and somatic, rare disease and cancer genomes
'ca-all': A combination of germline and somatic, cancer genomes only
'germline': Germline only, rare disease and cancer genomes
'ca-germline': Germline only, cancer genomes only
'rd-germline': Germline only, rare disease genomes only
'ca-somatic': Somatic only, cancer genomes only
--sample_file
Either null, or a path to a tab-delimited sample file with header (). Default null, performs a LabKey query to collect a list of structural VCFs.
--data_version
This parameter is used to define the LabKey query when no sample file is supplied. Default 'main-programme_v17_2023-03-30'.
--outdir
A path to an output directory. Default 'results', writes all output to a directory called 'results' in the current working directory.
--genome_build 'GRCh38'|'GRCh37'|'both'
This parameter is used to define the LabKey query when no sample file is supplied. Default 'both'.
--dragen
Boolean indicating whether query samples include dragen-aligned VCFs. This parameter is used to define the LabKey query when no sample file is supplied. Default false.

Parameters

Default workflow parameters defined in the configuration files, can be overridden on the command line. For example, to use coordinates to specify a region(s) of interest defined in a bed file, change input_type to 'regionand supply a bed file toinput_file`.

nextflow run "${structural_variant}"/main.nf \
    --input_file "${structural_variant}/input/regions.bed" \
    --input_type 'region' \
    --sample_type 'all' \
    --genome_build 'both' \
    -profile cluster \
    -resume

Parameters and options

On the command line, prefix workflow parameters with --; prefix Nextflow options with -.

For a complete list of workflow parameters (configurable parameters have a params. prefix).

structural_variant='/pgen_int_data_resources/workflows/rdp_structural_variant/main'

nextflow config -flat -profile cluster "${structural_variant}"/main.nf | grep params.

Output

file content
SVmergedResult_chr<N>_<gene>_<ensembl>_<build>_<cohort>.tsv Manta-called SVs observed in query regions
CNVmergedResult_chr<N>_<gene>_<ensembl>_<build>_<cohort>.tsv Canvas-called CNVs observed in query regions
<build>_adjusted_input.tsv Query genes/regions with HGNC and Ensembl IDs, chromosome, coordinates, etc.
inaccessible_vcf.tsv VCFs for all queries and genome builds that could not be accessed
pipeline_info/ Directory containing timestamped pipeline run information (execution_report_*.html, execution_timeline_*.html, execution_trace_*.txt, pipeline_dag_*.html)

A tab-delimited text file with a header line and the following columns:

column description
SAMPLE Participant platekey ID
CHROM Chromosome
POS Start position of the variant
ID Variant ID
REF Reference allele
ALT Alternate allele
QUAL Quality
FILTER Filter applied to data
INFO/SVTYPE Type of structural variant: duplication (DUP), deletion (DEL), inversion (INV), insertion (INS) or translocation (BND)
INFO/SVLEN Variant size
INFO/END End/Continuation position of the variant
GT Genotype (0/1 = heterozygote, 1/1 = homozygote) - germline
GQ Genotype quality - germline
PR Spanning paired-read support for the ref and alt alleles in the order listed
SR Split reads for the ref and alt alleles in the order listed, for reads where P(allele|read) > 0.999
OVERLAP 'PARTIAL' or 'TOTAL' overlap of the query region, by the variant
OVERLAP_CNT Count of query region base pairs overlapped by the variant
OVERLAP_PCT Percentage of query region overlapped by the variant

A tab-delimited text file with a header line and the following columns (depedent on dragen=true or false, cnv_pass_only=true or false, germline or somatic):

column description
SAMPLE Participant platekey ID
CHROM Chromosome
POS Start position of the variant
ID Variant ID
REF Reference allele
ALT Alternate allele
FILTER Filter applied to data
INFO/END End position of the variant
INFO/SVTYPE CNV for variants with copy number != 2.
LOH for variants with copy number = 2, and loss of heterozygosity
CN Copy number
GT Genotype - 'germline', 'germline' + dragen, 'somatic' + dragen
MCC Major chromosome count (equal to copy number for LOH regions) - except 'germline' + dragen or 'germline' + 'GRCh37'
OVERLAP 'PARTIAL' or 'TOTAL' overlap of the query region, by the variant
OVERLAP_CNT Count of query region base pairs overlapped by the variant
OVERLAP_PCT Percentage of query region overlapped by the variant

A tab-delimited text file with a header line and the following columns (first four columns only for input_type = 'region', where column four values are region identifiers):

column description
chr chromosome
start start position of gene region
end end position of gene region
gene HGNC gene symbol
ensemblID Ensembl gene ID
internalID identifier generated internally which will be used as a prefix for output files: chr_gene_ensemblID
posTag position tag (chromosome_start_end) aiding the identifier deduplication step
originalInput specifies what has been used as an input by you, either "gene" or "ensembl", referring to either gene symbol or Ensembl ID
entryDuplicate NA, or containing a comma-separated string of internalIDs. The internalIDs listed here are considered the same entry due to their identical genomic position.

Nextflow documentation

For Nextflow command line documentation use nextflow help, and nextflow <command> -help for help on a particular command. See Nextflow documentation and Nextflow training.


Last update: November 17, 2023