SVCNV version 1.2¶

This workflow extracts large structural and copy number variants the specified gene region(s) or input regions from Illumina Structural Variant VCF files for either germline or somatic cohorts.

Description¶

While most of the research is focussed on small variants, large structural variants (SVs) are often overlooked. Here we provide a workflow that focusses on the extraction of SVs from our cohort data. This is the sister workflow to the genevariantWorkflow.

The workflow is written in WDL and is run using the Cromwell engine - version 1.0. It relies on four external tools in order to execute the workflow. These are:

Please find more detailed information on various steps of the workflow in the "Further reading" section to learn more about them.

Briefly, the workflow, submitted to the queue, consists of various steps which will generate subsequent scripts which are submitted to the HPC in the and queue. This may lead up to a few hundred jobs being submitted from your username, but most of these should not run for a very long time (traffic depending). The main output files will be saved in a folder called by default "final_output/data/".

Important notes

Using genes as input: The input should be a list of gene names (each gene on a new line in a .txt file). This can be a mix of HGNC gene symbols and Ensembl gene IDs. The workflow will do the rest for you.
Using regions as input: The input should be a list of regions (each region on a new line with chr/startPos/endPos in tab-delimited format as a .bed file). If using a region file, the fourth column of your file can be used to provide region IDs which will be used in the workflow. If left empty, the workflow itself will generate region IDs (R###). Any overlapping regions will be merged, and the region IDs will be concatenated and used as a new region ID. If you supplied region IDs that are duplicated, the workflow will crash. Making these unique per entry will resolve the issue.
GRCh37 vs GRCh38: If you choose both, when you provide gene names, the workflow will use our mapping file to retrieve coordinates for both GRCh37 and GRCh38 genome builds. However, when you have regions as an input, you will need to take care of the correct locations and specify which genome build you are interested. If you are interested in both build, you will need to run the workflow twice and make sure the coordinates are correct for each build.
The workflow will crash if the same the final output folder already contains files with the same file name. For example, if you ran the workflow for the BRCA1 gene on day 1, and you do so unchanged on day 2 (for instance as part of a list of genes) it will crash trying to copy the final data to the output folder. This is done in order to protect your outputs from accidental overwriting - if you do not need a gene's file any more please remove it before re-running the workflow in the same location.
Please run the workflow on a small number of genes or regions only. The workflow has been limited ten entries. Sometimes the workflow may take up to a full day (or two) to complete, even on a small number of genes.
In order to save disk space, please delete the content of output folders "cromwell-workflow-logs" and "cromwell-executions" after you have checked the run was successful. Those folders simply contain redundant data and files that can be useful for debugging when the run fails. Additionally, the content of folders "svcnvCatch_output/call_logs" and "svcnvCatch_output``/workflow_logs" can also be deleted safely after a successful run, if not needed.
Best practices: run one query (at max ten genes or ten regions) per directory. So, that means you should copy the workflow as indicated below and rename to something relevant for each query you have.
If you need to re-start the workflow use ./makeClean.sh first.

Workflow files¶

The workflow files are located in:

/gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v1.2/

The workflow is contained within two files specified by the .wdl suffix. These are:

svcnvCatchCore.wdl (Main workflow script scattering over the genome build, and input regions)
svcnvQueries.wdl (A sub workflow script containing the individual queries)

There is one file that specifies runtime inputs, such as the input file name, and Release Programme Version. This is:

svcnvCatchInputs.json

There is another file that specifies remaining options for the workflow, such as the output dir for overall output and output dir for log-files. This is:

svcnvCatchOptions.json

There are two files which specify which genes or regions need to be used. This is:

gene_list.txt
region_list.bed

There is a single shell script that is used to submit the workflow to the HPC. This is:

submit_script.sh (see instructions below)

There is a file detailing how Cromwell interacts with the LSF scheduler. This is:

cromwell.conf (do not edit)

Finally, there is a file containing file-paths to retrieve the gene coordinates. This is:

GRCH38_GRCH37_files.txt (do not edit)

Instructions¶

Overview¶

Usage (tell me what I need to change to run it)¶

What to edit to run the workflow? And how to start it?

Before you start, please check whether you have configured your R LabKey API for the HPC.

I am interested in SVs and CNVs affecting genes¶

Copy all the files from our /gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v1.2/ folder to your own workflow folder, and cd there.
Edit the gene_list.txt input file (a .txt file containing a list of HGNC gene symbols or Ensembl gene IDs)
- For example:
nano gene_list.txt
- Inside gene_list.txt add the gene names of interest (here BRCA1, TNF and an Ensembl ID ENSG00000227518)
```
BRCA1
ENSG00000227518
TNF
```
- Save the file (Ctrl+O)
Edit submit_script.sh to specify a valid project code. Please check your project code at LSF Project Codes. This will correspond to your respective GECIP or Discovery Forum group.
Edit svcnvCatchInputs.json to specify a valid project code, cohort type, and input type and file.
- As you are interested in genes, set svcnvCatch.input_type": "gene" and "svcnvCatch.input_file": "gene_list.txt"
- Cohort type: You can specify whether you specifically want to screen germline, somatic, or both types of vcf's. Set "svcnvCatch.sample_type": "<type>" to "germline", "rd-germline", "ca-germline", "somatic", or "all".
- Choice of genome build: You can specify if you want GRCh37, GRCh38 or both. Set "svcnvCatch.input_genome_build": "<type>" to "both", "GRCh37", or "GRCh38". Note that all cancer samples are aligned to GRCh38.
- Choice of data release: You can specify the data release. Set "svcnvCatch.release_version": "<vesrion>" to "/main-programme/main-programme_v19_2024-10-31", or another version.
- Location of reference file: You need to specify the path for your working folder where the path for the reference genomes is located. You may choose not to change it as you can read the location in /gel_data_resouces. For instance, "svcnvCatch.GRCH37_and_GRCH38": is set to "/gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v1.2/GRCH38_GRCH37_files.txt".
- HPC project code: please check your project code at LSF Project Codes and update "svcnvCatch.project" accordingly. E.g "df_yourcompanyname" OR "re_gecip_yourdomain".
- Set "svcnvCatch.dragen": "true" to query dragen aligned files. Set it to "svcnvCatch.dragen": "false", to query the default files of the 100,000 genome project.
OPTIONALLY, edit file submit_script.sh to specify the main working directory (the default is set to your current working directory) - note that in this case, all relative paths in config and input files need to refer to this directory:
- #BSUB -cwd /filepath/to/your/workflow_folder/
Please make sure you have read the IMPORTANT note in the "Description" section above.
Once you have edited all the above files accordingly, you are ready to run the script.
- In the terminal (HPC), enter the following line:
bsub < submit_script.sh

I am interested in SVs and CNVs within my region of interest¶

Copy all the files from our /gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v1.2/ folder to your own workflow folder, and cd there.
Edit the region_list.bed input file (a tab delimited .bed file containing a your regions of interest)
- For example:
nano region_list.bed
- Inside region_list.bed add your regions of interest. By using the correct chromosome annotation you will retrieve results for GRCh37 (chr = ##) or GRCh38 (chr = chr##).
```
chr21 25800000 25900000 myRegion1_GRCh38  
chr21 25910000 26000000 myRegion2_GRCh38  
21 26010000 26020000 myRegion3_GRCh37  
21 26030000 26040000 myRegion4_GRCh37
```
- Save the file (Ctrl+ *
Edit submit_script.sh to specify a valid project code. Please check your project code at LSF Project Codes. This will correspond to your respective GECIP or Discovery Forum group.
Edit svcnvCatchInputs.json to specify a valid project code, cohort type, and input type and file.
- As you are interested in genes, set svcnvCatch.input_type": "region" and "svcnvCatch.input_file": "region_list.bed"
- Cohort type: You can specify whether you specifically want to screen germline, somatic, or both types of vcf's. Set "svcnvCatch.sample_type": "<type>" to "germline", "rd-germline", "ca-germline", "somatic", or "all".
- Choice of genome build: You can specify if you want GRCh37 or GRCh38. Set "svcnvCatch.input_genome_build": "<type>" to "GRCh37", or "GRCh38". Note that all cancer samples are aligned to GRCh38. Note that the workflow does not map regions between the different build and therefore for region input you cannot set input_genome_build to 'both'.
- Choice of data release: You can specify the data release. Set "svcnvCatch.release_version": "<vesrion>" to "/main-programme/main-programme_v19_2024-10-31", or another version.
- Location of reference file: You need to specify the path for your working folder where the path for the reference genomes is located. You may choose not to change it as you can read the location in /gel_data_resouces. For instance, "svcnvCatch.GRCH37_and_GRCH38": is set to "/gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v1.2/GRCH38_GRCH37_files.txt".
- HPC project code: please check your project code at LSF Project Codes and update "svcnvCatch.project" accordingly. E.g "df_yourcompanyname" OR "re_gecip_yourdomain".
OPTIONALLY, edit file submit_script.sh to specify the main working directory (the default is set to your current working directory) - note that in this case, all relative paths in config and input files need to refer to this directory:
- #BSUB -cwd /filepath/to/your/workflow_folder/
Please make sure you have read the IMPORTANT note in the "Description" section above.
Once you have edited all the above files accordingly, you are ready to run the script.
- In the terminal (HPC), enter the following line:
bsub < submit_script.sh

SV and CNV queries details

All queries filter for variants with filter PASS.

Cancer queries include only tumour samples, excluding the matched germline samples.

Samples with more than one alignment had only the most recent one returned.

SV queries:
all SV types are retrieved: duplications (DUP), deletions (DEL), inversions (INV), insertions (INS) and translocations (BND)
output format:

germline: [%SAMPLE] %CHROM %POS %ID %REF %ALT %QUAL %FILTER %INFO/SVTYPE %INFO/SVLEN %INFO/END [%GT] [%GQ] [%PR] [%SR]

somatic: [%SAMPLE] %CHROM %POS %ID %REF %ALT %QUAL %FILTER %INFO/SVTYPE %INFO/SVLEN %INFO/END [%PR] [%SR] %CSQT %CSQR %cosmic %clinvar
CNV queries:
CNVs with copy number different than 2, and CN = 2 where loss of heterozygosity (LOH) have been observed are retrieved.
output format:

all: [%SAMPLE] %CHROM %POS %ID %REF %ALT %INFO/END %INFO/SVTYPE [%RC] [%BC] [%CN]

Inputs and outputs¶

Type	Type	File or Folder	Description	Requires editing
Input	File	gene_list.txt	.txt file containing a list of HGNC gene symbols or Ensembl gene IDs (new line separation).	yes (for genes as input)
Input	File	region_list.bed	.bed file containing chr/start/end positions for the region of interest	yes (for regions as input)
Input	File	svcnvCatchCore.wdl	Main workflow script.	no
Input	File	svcnvQueries.wdl	A sub workflow script, scattering over genome build.	no
Input	File	svcnvCatchInputs.json	Specifies runtime inputs such as the input file name and type, cohort type, genome build and Release Programme Version.	yes (see instructions)
Input	File	svcnvCatchOptions.json	Specifies output dir for overall output and output dir for log-files.	no (optional)
Input	File	cromwell.conf	Configuration file on how Cromwell interacts with the LSF scheduler.	no
Input	File	GRCH38_GRCH37_files.txt	Contains the locations of files with gene identifiers and their respective regions.	no
Input	File	submit_script.sh	The final submission script to run the workflow.	yes (for project code and optionally working directory)
Output	File	svcnvCatch_output/data/adjustedInput~{genome_version}~{timestamp}.tsv	The file that is created based on your input. This is used further downstream in the workflow. Any duplicate regions are filtered out and merged (see below).	NA
Output	File	svcnvCatch_output/data/CNVmergedFailed~{queryID}___~{genome_version}~{cohort_type}~{timestamp}_.tsv	Contains filepaths of vcf's that could not be accessed during the workflow run.	NA
Output	File	svcnvCatch_output/data/CNVmergedResult~{queryID}___~{genome_version}~{cohort_type}~{timestamp}_.tsv	Contains all the positions of CNVs detected for the input region or gene (=queryID).	NA
Output	File	svcnvCatch_output/data/SVmergedFailed~{queryID}___~{genome_version}~{cohort_type}~{timestamp}_.tsv	Contains filepaths of vcf's that could not be accessed during the workflow run.	NA
Output	File	svcnvCatch_output/data/SVmergedResult~{queryID}___~{genome_version}~{cohort_type}~{timestamp}_.tsv	Contains all the positions of SVs detected for the input region or gene (=queryID).	NA
Output	File	svcnvCatch_output/data/coordinates~{genome_version}_.bed	Coordinates file generated and used by the workflow.	NA
Output	File	svcnvCatch_output/data/logfile~{timestamp}.log_	A log file that is generated within the workflow containing all the versions of the software and settings that were used.	NA
Output	Folder	svcnvCatch_output/data/	Will contain the data output from the workflow when it successfully completes.	NA
Output	Folder	svcnvCatch_output/call_logs/	Contains all the log files that are generated during the query.	NA
Output	Folder	svcnvCatch_output/workflow_logs/	Will contain the running workflow log.	NA
Output	Folder	cromwell-workflow-logs/	Will contain the overall workflow log.	NA
Output	Folder	cromwell-executions/	All the temporary files for each run are stored within the genetic_report folder. Once the jobs have been completed, they will be copied over to the final output folder specified by the user. This will also contain the cromwell database for callcaching. In case an identical query is being done, the workflow will check this database to see if it has been done before, and simply extract the data again without running the full script again.	NA

Output breakdown¶

Adjusted Input file (adjustedInput_~{genome_version}_~{timestamp}.tsv):

If your input were a set of regions, this file will only contain the first four columns, where the fourth column are region identifiers.

Column	Description
chr	chromosome
start	start position of gene region
end	end position of gene region
gene	HGNC gene symbol
ensemblID	Ensembl gene ID
internalID	identifier generated internally which will be used as a prefix for output files: `chr_gene_ensemblID`
posTag	position tag (chromosome_start_end) aiding the identifier deduplication step
originalInput	specifies what has been used as an input by the user, either "gene" or "ensembl", referring to either gene symbol or Ensembl ID
entryDuplicate	NA, or containing a comma-separated string of internalIDs. The internalIDs listed here are considered the same entry due to their identical genomic position.

Coordinates file coordinates_~{genome_version}.bed:

Coordinates bed file used by the workflow.

Unrecognised input file unrecognised_input_~{genome_version}_~{timestamp}.tsv:

Similar to the input file specifying the entries which were not detected in the flat file containing all gene and Ensembl IDs. This file will not be generated if all inputs were valid, nor if you selected to use regions as an input.

SV Merged Result SVmergedResult__~{queryID}_~{genome_version}_~{cohort_type}_~{timestamp}_.tsv:

Column	Description
platekey	platekey ID of participant
chr	Chromosome
pos	Start position of the variant
ID	Variant ID
REF	Reference allele
ALT	Alternate allele
Qual	quality
Filter	PASS
INFO/SVTYPE	Type of structural variant: duplication (DUP), deletion (DEL), inversion (INV), insertion (INS) or translocation (BND)
INFO/SVLEN	Variant size
INFO/END	End/Continuation position of the variant
GT	[germline] Genotype. 0/1 = heterozygote, 1/1 = homozygote
GQ	[germline] Genotype quality
PR	Spanning paired-read support for the ref and alt alleles in the order listed
SR	Split reads for the ref and alt alleles in the order listed, for reads where P(allele
CSQT	[somatic] Consequence type as predicted by IAE. Format: GenotypeIndex
CSQR	[somatic] Predicted regulatory consequence type. Format: GenotypeIndex
cosmic	[somatic] The numeric identifier for the variant in the Catalogue of Somatic Mutations in Cancer (COSMIC) database. Format: GenotypeIndex
clinvar	[somatic] Clinical significance. Format: GenotypeIndex

CNV Merged Result CNVmergedResult_~{queryID}_~{genome_version}_~{cohort_type}_~{timestamp}.tsv:

Column	Description
platekey	platekey ID of participant
chr	Chromosome
pos	Start position of the variant
ID	Variant ID
REF	Reference allele
ALT	Alternate allele
INFO/END	End position of the variant
INFO/SVTYPE	CNV for variants with copy number != 2. LOH for variants with copy number = 2, and loss of heterozygosity
RC	Average counts per bin
BC	Bin count
CN	Copy number

CNV Failed Result ( CNVmergedFailed_~{queryID}_~{genome_version}_~{cohort_type}_~{timestamp}.tsv:

This file will contain filepaths from vcf's that could not be accessed. If everything was accessed, it will contain a message stating this.

SV Failed Result SVmergedFailed_~{queryID}_~{genome_version}_~{cohort_type}_~{timestamp}_.tsv:

This file will contain filepaths from vcf's that could not be accessed. If everything was accessed, it will contain a message stating this.

Plots¶

Version v1.1.1 or later outputs plots for each gene/region, somatic/germline and genome build. For each of these combinations a .pdf file is generated for SV and CNV separately. On the plots, the right y-axis summarises vcf information. N is the number of participants with SV/CNV that start and end at the same position. The remaining of the fields give the median across all samples that have that variant. Grey variants do not overlap CDSs while the one in colours do.