Extract SVs/CNVs overlapping genes or regions, version 2.0.1¶
This workflow extracts large structural and copy number variants the specified gene region(s) or input regions from Illumina Structural Variant VCF files for either germline or somatic cohorts.
Major changes¶
- New possible cohort: ca-all so that both cancer somatic and cancer germline can be queried on the same run
- Inclusion of prevalence report (summary_geneName_sampleType.txt) for main cancer indications
- Inclusion of MCC for CNV both germline and somatic on 100kGP genomes. Also, on somatic CNV for dragen realignments
- CNV_pass_only option: CNV are assigned for the whole genome (some parts are neutral CN = 2). Canvas struggles to call (FILTER = PASS) focal CNVs, but removing PASS can retrieve those. Use it with care!
- Removal of Resultsfailed files instead one single inaccessibleVCF.tsv* file list all, if any, vcf could not be accessed (probably due to permission issues, please open a ticket if it happens to you)
- Workflow files are not organised in folders! You only need to change the submit_script.sh and the files in the input/ dir!
Description¶
While most of the research is focussed on small variants, large structural variants (SVs) are often overlooked. Here we provide a workflow that focusses on the extraction of SVs from our cohort data. This is the sister workflow to the genevariantWorkflow.
The workflow is written in WDL and is run using the Cromwell engine - version 1.0. It relies on four external tools in order to execute the workflow. These are:
Please find more detailed information on various steps of the workflow in the "Further reading" section to learn more about them.
Briefly, the workflow, submitted to the final_output/data/
".
Important notes
-
Using genes as input: The input should be a list of gene names (each gene on a new line in a .txt file). This can be a mix of HGNC gene symbols and Ensembl gene IDs. The workflow will do the rest for you.
-
Using regions as input: The input should be a list of regions (each region on a new line with
chr/startPos/endPos
in tab-delimited format as a .bed file). If using a region file, the fourth column of your file can be used to provide region IDs which will be used in the workflow. If left empty, the workflow itself will generate region IDs (R###). Any overlapping regions will be merged, and the region IDs will be concatenated and used as a new region ID. If you supplied region IDs that are duplicated, the workflow will crash. Making these unique per entry will resolve the issue. -
GRCh37 vs GRCh38: If you choose both, when you provide gene names, the workflow will use our mapping file to retrieve coordinates for both GRCh37 and GRCh38 genome builds. However, when you have regions as an input, you will need to take care of the correct locations and specify which genome build you are interested. If you are interested in both build, you will need to run the workflow twice and make sure the coordinates are correct for each build.
-
The workflow will crash if the same the final output folder already contains files with the same file name. For example, if you ran the workflow for the BRCA1 gene on day one, and you do so unchanged on day day (for instance as part of a list of genes) it will crash trying to copy the final data to the output folder. This is done in order to protect your outputs from accidental overwriting - if you do not need a gene's file any more please remove it before re-running the workflow in the same location.
-
Please run the workflow on a small number of genes or regions only. The workflow has been limited 10 entries. The time taken largely depends on the region size and on the traffic on our HPC. Therefore, sometimes the workflow may take up to a full day (or two) to complete, even on a small number of genes.
-
In order to save disk space, please delete the content of output folders "
cromwell-workflow-logs
" and "cromwell-executions
" after you have checked the run was successful. Those folders simply contain redundant data and files that can be useful for debugging when the run fails. Additionally, the content of folders "svcnvCatch_output/call_logs
" and "svcnvCatch_output``/workflow_logs
" can also be deleted safely after a successful run, if not needed. -
Best practices: run one query (at max 10 genes or 10 regions) per directory. So, that means you should copy the workflow as indicated below and rename to something relevant for each query you have.
-
If you need to re-start the workflow use
./makeClean.sh
first.
Workflow files¶
The workflow files are located in:
/gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v2.0.1/
The files are organised as follows:
.
├── README.md
├── cromwell.conf
├── input
│ ├── gene_list.txt
│ ├── region_list.bed
│ └── svcnvCatchInputs.json
├── logs
├── makeClean.sh
├── src
│ ├── GRCH38_GRCH37_files.txt
│ ├── defineCohort.R
│ ├── plotResults.R
│ ├── reference_transcripts.txt
│ ├── summariseResults.R
│ ├── svcnvCatchOptions.json
│ └── svcnvQueries.wdl
├── submit_script.sh
└── svcnvCatchCore.wdl
You should only edit files in the input dir and the submit_script.sh!
The workflow is contained within two files specified by the .wdl suffix. These are:
svcnvCatchCore.wdl
(Main workflow script scattering over the genome build, and input regions)svcnvQueries.wdl
(A sub workflow script containing the individual queries)
There is one file that specifies runtime inputs, such as the input file name, and release programme version. This is:
svcnvCatchInputs.json
There is another file that specifies remaining options for the workflow, such as the output dir for overall output and output dir for log-files. This is:
svcnvCatchOptions.json
There are two files which specify which genes or regions need to be used. This is:
gene_list.txt
region_list.bed
There is a single shell script that is used to submit the workflow to the HPC. This is:
submit_script.sh
(see instructions below)
There is a file detailing how Cromwell interacts with the LSF scheduler. This is:
cromwell.conf
(do not edit)
Finally, there is a file containing file-paths to retrieve the gene coordinates. This is:
GRCH38_GRCH37_files.txt
(do not edit)
Instructions¶
Usage (tell me what I need to change to run it)¶
What to edit to run the workflow? And how to start it?
Short answer: only edit the files inside input dir and the submit_script.sh. See below for further details.
Before you start, please check whether you have configured your R LabKey API for the HPC. Most user feedback and troubleshooting linked issues to an incorrect setup of the R LabKey API. We have also changed the documentation based on this feedback to make it easier for all users to set it up without issue.
I am interested in SVs and CNVs affecting genes¶
-
Copy all the files from our
/gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v2.0.1/
folder to your own workflow folder, andcd
there. -
Edit the
./input/gene_list.txt
input file (a .txt file containing a list of HGNC gene symbols or Ensembl gene IDs)- For example:
nano ./input/gene_list.txt
- Inside gene_list.txt add the gene names of interest (here BRCA1, TNF and an Ensembl ID ENSG00000227518)
- Save the file (Ctrl+O)
-
Edit
submit_script.sh
to specify a valid project code. Please check your project code at LSF Project Codes. This will correspond to your respective GECIP or Discovery Forum group. -
Edit
./input/svcnvCatchInputs.json
to specify a valid project code, cohort type, and input type and file.- As you are interested in genes, set
svcnvCatch.input_type": "gene"
and"svcnvCatch.input_file": "gene_list.txt"
- Cohort type: You can specify whether you specifically want to screen germline, somatic, or both types of vcf's. Set
"svcnvCatch.sample_type": "<type>"
to"germline"
,"rd-germline"
,"ca-germline"
,"somatic"
,"ca-all"
or"all".
- Choice of genome build: You can specify if you want GRCh37, GRCh38 or both. Set
"svcnvCatch.input_genome_build": "<type>"
to"both"
,"GRCh37"
, or"GRCh38"
. Note that all cancer samples are aligned to GRCh38. - Choice of data release: You can specify the data release. Set
"svcnvCatch.release_version": "<version>"
to"/main-programme/main-programme_v18_2023-12-21"
, or another version. - Location of reference file: You need to specify the path for your working folder where the path for the reference genomes is located. You may choose not to change it as you can read the location in /gel_data_resouces. For instance,
"svcnvCatch.GRCH37_and_GRCH38"
: is set to"/gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v2.0.1/src/GRCH38_GRCH37_files.txt".
- HPC project code: please check your project code at LSF Project Codes and update
"svcnvCatch.project"
accordingly. E.g"df_yourcompanyname"
OR"re_gecip_yourdomain"
. -
Set
"svcnvCatch.dragen": "true"
to query dragen aligned files. Set it to"svcnvCatch.dragen": "false"
, to query the default files of the 100,000 genome project. -
Set
"svcnvCatch.CNV_pass_only": "false"
to ignore FILTER and query CNV for all the bins in the genome. This aims to bypass Canvas issues with calling focal CNVs. Use with care! - Set
"svcnvCatch.calculate_prevalence_in_cancer": "true"
to compute prevalences for the Cancer programme. Do not work with dragen files.
- As you are interested in genes, set
-
OPTIONALLY, edit file
submit_script.sh
to specify the main working directory (the default is set to your current working directory) - note that in this case, all relative paths in config and input files need to refer to this directory:#BSUB -cwd /filepath/to/your/workflow_folder/
-
Please make sure you have read the IMPORTANT note in the "Description" section above.
-
Once you have edited all the above files accordingly, you are ready to run the script.
- In the terminal (HPC), enter the following line:
bsub < submit_script.sh
I am interested in SVs and CNVs within my region of interest¶
-
Copy all the files from our
/gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v2.0.1/
folder to your own workflow folder, andcd
there. -
Edit the
./input/region_list.bed
input file (a tab delimited .bed file containing a your regions of interest)- For example:
nano ./input/region_list.bed
- Inside region_list.bed add your regions of interest. By using the correct chromosome annotation you will retrieve results for GRCh37 (chr = ##) or GRCh38 (chr = chr##).
chr21 25800000 25900000 myRegion1_GRCh38 chr21 25910000 26000000 myRegion2_GRCh38 21 26010000 26020000 myRegion3_GRCh37 21 26030000 26040000 myRegion4_GRCh37
- Save the file (Ctrl+O)
-
Edit
submit_script.sh
to specify a valid project code. Please check your project code at LSF Project Codes. This will correspond to your respective GECIP or Discovery Forum group. -
Edit
./input/``svcnvCatchInputs.json
to specify a valid project code, cohort type, and input type and file.- As you are interested in genes, set
svcnvCatch.input_type": "region"
and"svcnvCatch.input_file": "region_list.bed"
- Cohort type: You can specify whether you specifically want to screen germline, somatic, or both types of vcf's. Set
"svcnvCatch.sample_type": "<type>"
to"germline"
,"rd-germline"
,"ca-germline"
,"somatic"
,"ca-all",
or"all"
. - Choice of genome build: You can specify if you want GRCh37 or GRCh38. Set
"svcnvCatch.input_genome_build": "<type>"
to"GRCh37"
, or"GRCh38"
. Note that all cancer samples are aligned to GRCh38. Note that the workflow does not map regions between the different build and therefore for region input you cannot set input_genome_build to 'both'. - Choice of data release: You can specify the data release. Set
"svcnvCatch.release_version": "<version>"
to"/main-programme/main-programme_v18_2023-12-21"
, or another version. - Location of reference file: You need to specify the path for your working folder where the path for the reference genomes is located. You may choose not to change it as you can read the location in /gel_data_resouces. For instance,
"svcnvCatch.GRCH37_and_GRCH38"
: is set to"/gel_data_resources/workflows/BRS_tools_svcnvWorkflow/v2.0.1/src/GRCH38_GRCH37_files.txt".
- HPC project code: please check your project code at LSF Project Codes and update
"svcnvCatch.project"
accordingly. E.g"df_yourcompanyname"
OR"re_gecip_yourdomain"
. -
Set
"svcnvCatch.dragen": "true"
to query dragen aligned files. Set it to"svcnvCatch.dragen": "false"
, to query the default files of the 100,000 genome project. -
Set
"svcnvCatch.CNV_pass_only": "false"
to ignore FILTER and query CNV for all the bins in the genome. This aims to bypass Canvas issues with calling focal CNVs. Use with care! - Set
"svcnvCatch.calculate_prevalence_in_cancer": "true"
to compute prevalences for the Cancer programme. Do not work with dragen files.
- As you are interested in genes, set
-
OPTIONALLY, edit file
submit_script.sh
to specify the main working directory (the default is set to your current working directory) - note that in this case, all relative paths in config and input files need to refer to this directory:#BSUB -cwd /filepath/to/your/workflow_folder/
-
Please make sure you have read the IMPORTANT note in the "Description" section above.
-
Once you have edited all the above files accordingly, you are ready to run the script.
- In the terminal (HPC), enter the following line:
bsub < submit_script.sh
SV and CNV queries details":**
All queries filter for variants with filter PASS.
Cancer queries include only tumour samples, excluding the matched germline samples.
Samples with more than one alignment had only the most recent one returned.
-
SV queries:
all SV types are retrieved: duplications (DUP), deletions (DEL), inversions (INV), insertions (INS) and translocations (BND)
output format:germline: [%SAMPLE] %CHROM %POS %ID %REF %ALT %QUAL %FILTER %INFO/SVTYPE %INFO/SVLEN %INFO/END [%GT] [%GQ] [%PR] [%SR]
somatic: [%SAMPLE] %CHROM %POS %ID %REF %ALT %QUAL %FILTER %INFO/SVTYPE %INFO/SVLEN %INFO/END [%PR] [%SR] %CSQT %CSQR %cosmic %clinvar
-
CNV queries:
CNVs with copy number different than 2, and CN = 2 where loss of heterozygosity (LOH) have been observed are retrieved.
output format:all: [%SAMPLE] %CHROM %POS %ID %REF %ALT %FILTER %INFO/END %INFO/SVTYPE [%RC] [%BC] [%CN] [%MCC] [%GT]
- please note that MCC and GT do not apply to all cases, see below for details
Inputs and outputs¶
Type | Type | Path and file | Description | Requires editing |
---|---|---|---|---|
Input | File | ./input/gene_list.txt | .txt file containing a list of HGNC gene symbols or Ensembl gene IDs (new line separation). | yes (for genes as input) |
Input | File | ./input/region_list.bed | .bed file containing chr/start/end positions for the region of interest | yes (for regions as input) |
Input | File | ./svcnvCatchCore.wdl | Main workflow script. | no |
Input | File | ./src/svcnvQueries.wdl | A sub workflow script, scattering over genome build. | no |
Input | File | ./input/svcnvCatchInputs.json | Specifies runtime inputs such as the input file name and type, cohort type, genome build and Release Programme Version. | yes (see instructions) |
Input | File | ./src/svcnvCatchOptions.json | Specifies output dir for overall output and output dir for log-files. | no (optional) |
Input | File | ./cromwell.conf | Configuration file on how Cromwell interacts with the LSF scheduler. | no |
Input | File | ./src/GRCH38_GRCH37_files.txt | Contains the locations of files with gene identifiers and their respective regions. | no |
Input | File | ./submit_script.sh | The final submission script to run the workflow. | yes (for project code and optionally working directory) |
Output | File | ./svcnvCatch_output/data/adjustedInput_~{genome_version}_~{timestamp}.tsv |
The file that is created based on your input. This is used further downstream in the workflow. Any duplicate regions are filtered out and merged (see below). | NA |
Output | File | ./svcnvCatch_output/data/inacessibleVCF_~{timestamp}.tsv |
Contains filepaths of vcf's that could not be accessed during the workflow run. | NA |
Output | File | ./svcnvCatch_output/data/CNVmergedResult_~{queryID}_~{genome_version}_~{cohort_type}_~{timestamp}.tsv |
Contains all the positions of CNVs detected for the input region or gene (=queryID). | NA |
Output | File | ./svcnvCatch_output/data/SVmergedResult_~{queryID}_~{genome_version}_~{cohort_type}_~{timestamp}.tsv |
Contains all the positions of SVs detected for the input region or gene (=queryID). | NA |
Output | File | ./svcnvCatch_output/data/summary_~{queryID}_~{cohort_type}.tsv |
[optional] Contain prevalence for cancer cohort. Does not apply to Dragen files. | NA |
Output | File | ./svcnvCatch_output/data/data_~{queryID}_~{cohort_type}.tsv |
[optional] Combined data to create prevalence report. Does not apply to Dragen files. | NA |
Output | File | ./svcnvCatch_output/data/coordinates_~{genome_version}.bed |
Coordinates file generated and used by the workflow. | NA |
Output | File | ./svcnvCatch_output/data/logfile_~{timestamp}.log |
A log file that is generated within the workflow containing all the versions of the software and settings that were used. | NA |
Output | Folder | ./svcnvCatch_output/data/ |
Will contain the data output from the workflow when it successfully completes. | NA |
Output | Folder | ./svcnvCatch_output/call_logs/ |
Contains all the log files that are generated during the query. | NA |
Output | Folder | ./svcnvCatch_output/workflow_logs/ |
Will contain the running workflow log. | NA |
Output | Folder | ./cromwell-workflow-logs/ |
Will contain the overall workflow log. | NA |
Output | Folder | ./cromwell-executions/ |
All the temporary files for each run are stored within the genetic_report folder. Once the jobs have been completed, they will be copied over to the final output folder specified by the user. This will also contain the cromwell database for callcaching. In case an identical query is being done, the workflow will check this database to see if it has been done before, and simply extract the data again without running the full script again. |
NA |
Output breakdown¶
Adjusted Input file (adjustedInput_~{genome_version}_~{timestamp}.tsv
):
If your input were a set of regions, this file will only contain the first four columns, where the fourth column are region identifiers.
Column | Description |
---|---|
chr | chromosome |
start | start position of gene region |
end | end position of gene region |
gene | HGNC gene symbol |
ensemblID | Ensembl gene ID |
internalID | identifier generated internally which will be used as a prefix for output files: chr_gene_ensemblID |
posTag | position tag (chromosome_start_end) aiding the identifier deduplication step |
originalInput | specifies what has been used as an input by the user, either "gene" or "ensembl", referring to either gene symbol or Ensembl ID |
entryDuplicate | NA, or containing a comma-separated string of internalIDs. The internalIDs listed here are considered the same entry due to their identical genomic position. |
Coordinates file coordinates_~{genome_version}.bed
):
Coordinates bed file used by the workflow.
Unrecognised input file unrecognised_input_~{genome_version}_~{timestamp}.tsv
:
Similar to the input file specifying the entries which were not detected in the flat file containing all gene and Ensembl IDs. This file will not be generated if all inputs were valid, nor if you selected to use regions as an input.
SV Merged Result (SVmergedResult_~{queryID}_~{genome_version}_~{cohort_type}_~{timestamp}.tsv
):
Column | Description |
---|---|
platekey | platekey ID of participant |
chr | Chromosome |
pos | Start position of the variant |
ID | Variant ID |
REF | Reference allele |
ALT | Alternate allele |
Qual | quality |
Filter | PASS |
INFO/SVTYPE | Type of structural variant: duplication (DUP), deletion (DEL), inversion (INV), insertion (INS) or translocation (BND) |
INFO/SVLEN | Variant size |
INFO/END | End/Continuation position of the variant |
GT | [germline] Genotype. 0/1 = heterozygote, 1/1 = homozygote |
GQ | [germline] Genotype quality |
PR | Spanning paired-read support for the ref and alt alleles in the order listed |
SR | Split reads for the ref and alt alleles in the order listed, for reads where P(allele |
CSQT | [somatic] Consequence type as predicted by IAE. Format: GenotypeIndex |
CSQR | [somatic] Predicted regulatory consequence type. Format: GenotypeIndex |
cosmic | [somatic] The numeric identifier for the variant in the Catalogue of Somatic Mutations in Cancer (COSMIC) database. Format: GenotypeIndex |
clinvar | [somatic] Clinical significance. Format: GenotypeIndex |
CNV Merged Result (CNVmergedResult_~{queryID}_~{genome_version}_~{cohort_type}_~{timestamp}.tsv
):
Column | Description |
---|---|
platekey | platekey ID of participant |
chr | Chromosome |
pos | Start position of the variant |
ID | Variant ID |
REF | Reference allele |
ALT | Alternate allele |
INFO/END | End position of the variant |
INFO/SVTYPE | CNV for variants with copy number != 2. LOH for variants with copy number = 2, and loss of heterozygosity |
RC | Average counts per bin |
BC | Bin count |
CN | Copy number |
MCC | [except germline_dragen or germline_GRCh37] Major chromosome count (equal to copy number for LOH regions) |
GT | [germline] Genotype. "." [germline+dragen] Genotype. [somatic+dragen] Genotype. |
inaccessible VCF files (inaccessibleVCF_~{timestamp}.tsv
):
This file will contain filepaths from VCFs that could not be accessed. If everything was accessed, it will contain a message stating this.
Prevalence reports (Cancer only) (summary_~{queryID}_~{cohort_type}.tsv
):
This file will prevalences for main cancer indications. It is only generated when requested in the svcnvCatchInput.json
.
Data for prevalence reports (Cancer only) (SVmergedFailed_~{queryID}_~{genome_version}_~{cohort_type}_~{timestamp}.tsv
):
This file will contain filepaths from VCFs that could not be accessed. If everything was accessed, it will contain a message stating this.
Plots¶
Version v1.1.1 or later outputs plots for each gene/region, somatic/germline and genome build. For each of these combinations a .pdf file is generated for SV and CNV separately. On the plots, the right y-axis summarises vcf information. N is the number of participants with SV/CNV that start and end at the same position. The remaining of the fields give the median across all samples that have that variant. Grey variants do not overlap CDSs while the one in colours do.