Gene centric SNV report for cancer participants¶
Deprecated
The cancer_tier_and_domain_variants
table in Labkey provides a a readily-accessible and up-to-date version of the data provided by this pipeline. We recommend querying the cancer_tier_and_domain_variants
table directly, rather than using this script.
Version control¶
version | data release | clinVar |
---|---|---|
v1.0-beta | DR11 | March, 2021 |
Summary¶
This page describes how to generate a "gene report" for SNVs found in participants of the Genomics England cancer cohort.
The report contains counts of all participants with a small genic variant of moderate or high impact, as identified by a set of SO terms presented in the description section. It presents a breakdown of participant counts per disease type (censored when <5) and percentage by cancer type, as well as a break down of the variants split by somatic/germline and most common variants for that gene.
The R script to generate this report can be found in:
~/gel_data_resources/example_scripts/gene-centric-snv-report/v1.0/scripts/01.functions.R
Description¶
The gene report is based on the counts of small genic variant of moderate or high impact, as identified by a set of SO terms below, for cancer patients.
SO term | Consequence type |
---|---|
SO:0001893 | transcript ablation |
SO:0001574 | splice_acceptor_variant |
SO:0001575 | splice_donor_variant |
SO:0001587 | stop_gained |
SO:0001589 | frameshift_variant |
SO:0001578 | stop_lost |
SO:0002012 | start_lost |
SO:0001889 | transcript_amplification |
SO:0001821 | inframe_insertion |
SO:0001822 | inframe_deletion |
SO:0001650 | inframe_variant |
SO:0001583 | missense_variant |
SO:0001630 | splice_region_variant |
The report defines deleterious variants those that
-
causes loss-of-function, i.e.
- splice_acceptor_variant
- splice_donor_variant
- start_lost
- stop_lost
- stop_gained
- frameshift_variant
- inframe_insertion
- inframe_variant
or are reported as
-
pathogenic/ likely pathogenic in ClinVar (for version, see Version Control box above)
Finally, some participants will have more than one sample. Similarly, some will carry more than one mutation on the query gene. Nonetheless, the script counts each participant only once, except when it is clearly counting the different variants.
Hypothetical example:¶
Say that for a given query gene, we have the following variants in our database:
participant | sample | variant |
---|---|---|
1 | 1.1 | c.196del |
1 | 1.2 | - |
1 | 1.3 | c.180A>T |
1 | 1.3 | c.196del |
2 | 1.1 | c.196del |
The resulting counts will be as follows:
- two patients with mutations on the query gene.
- c.196del: 2, c.180A>T: 1
Usage¶
The code was developed to work with R/4.0.2, so inside the RE, open a terminal and run:
Single queries¶
Then inside RStudio, source the script, load the SNVdb (only once), and query for your genes of interest, one at a time:
source("~/gel_data_resources/example_scripts/gene-centric-snv-report/v1.0/scripts/01.functions.R")
db <- loadSNVdb()
brca1 <- queryGene(gene_name = "BRCA1")
brca2 <- queryGene(gene_name = "BRCA2")
The output will be two files: one with summary counts (summary_gene_name.txt) and one with the full data to conduct further analysis as required (data_gene_name.tsv), where gene_name is the queried gene.
Multiple queries¶
Have a list of genes in a flat file, i.e. gene_list.txt:
Then inside RStudio, source the script, load the SNVdb (only once), and load your gene_list.txt:
# Source script, load SNVdb and your gene_list.txt
source("~/gel_data_resources/example_scripts/gene-centric-snv-report/v1.0/scripts/01.functions.R")
db <- loadSNVdb()
gene_list <- readLines("gene_list.txt")
## query all genes in the gene_list, concatenate and save results.
data <- lapply(gene_list, function(x){queryGene(x, saveData = F)}) %>% bind_rows()
write_tsv(data, 'my_data.tsv')
The output will be: one with summary counts (summary_gene_name.txt) per gene and one with the full data (concatenated for all genes) to conduct further analysis as required (my_data.tsv).
Input¶
loadSNVdb() does not require any input. Saving the output (as above) in a variable db will avoid referring to it again.
queryGene() will accept the following arguments:
argument | type | default | description |
---|---|---|---|
gene_name | required | - | string argument, name of the (ONE) gene to be queried for. Use uppercase letters. E.g. "BRCA1" or "NTRK3". |
variant | optional | NULL | character, it accepts one string or vector of string with the changes observed in the protein level. This is the description part of HGVS simple. E.g.: c("c.1961del", "c.3708T>G") |
relevance_type | optional | c("(likely)pathogenic", "other", "LoF", "path_LoF") | character, it accepts one string or vector of strings with variant relevance. Select at least one of the default values. For any deleterious variant use: c("(likely)pathogenic", "LoF", "path_LoF") Types are: * LoF: variants that cause a loss-of function, i.e. a variant with one of the following consequences according to CellBase: splice_acceptor_variant, splice_donor_variant, start_lost, stop_lost, stop_gained, frameshift_variant, inframe_insertion, inframe_variant. * (likely)pathogenic: pathogenic or likely pathogenic on ClinVar. * path_LoF: LoF AND (likely)pathogenic. * other: nor LoF or (likely)pathogenic. |
Output¶
summary_gene_name.txt presents summary information split into three parts:
- Overview counts, including
N: total number of patients with any small genic variant of moderate or high impact mutations on the query gene
D: number of patients with any deleterious mutations on the query gene
S: total number of patients with any somatic deleterious mutations on the query gene
G: total number of patients with any germline deleterious mutations on the query gene - Break-down by variant, including
n.deleterious: total number of patients with any deleterious mutations on the query gene per disease type
p.deleterious: percent of patients with any deleterious mutations on the query gene per disease type
n.total: total number of patients with any non-synonymous, splice site and RNA gene variants mutations on the query gene per disease type
p.total: percent of patients with any non-synonymous, splice site and RNA gene variants mutations on the query gene per disease type - Most common variants across all disease type, including
change: as seen in the protein level
consequence: as predicted by CellBase
clinical_relevance: pathogenic or likely pathogenic if listed in Clinvar as such. Empty otherwise. n: counts
Note that only counts > five are included in the report, across all of the above parts.
Example of summary file (the numbers shown below are synthetic):
data_gene_name.tsv contains all queried variants (filtered accordingly if variants and relevance_type arguments have been used, but multiple patient sample presented), in a table format where:
- participant_id: Genomics England unique participant identifier
- tumour_sample_platekey: somatic sample identifier
- disease_type: cancer type
- type: somatic/ germline
- gene: gene_name
- change: protein level alteration
- consequence: as predicted by CellBase
- clinical_significance: pathogenic or likely pathogenic if listed in ClinVar as such
- relevance: a combination of 7 and 8 (see input for more details.)
Details on loadSNVdb¶
The SNVdb loaded by the function is a compilation of all non-synonymous, splice site and RNA gene SNV variants found per sample for the participants in the cancer programme. Somatic variants are listed for all genes, germline variants are listed only for genes with indication of cancer predisposition as listed by Genomics England PanelApp. These are listed per sample as a csv file and generated by our internal cancer analysis team. The paths for the individual .csv files are provided in the cancer_analysis table in LabKey.
The data is complemented with a recent version (March, 2021) of ClinVar pathogenic and likely pathogenic variants.
Exporting your gene centric SNV report¶
The gene centric SNV report has been designed so that it does not contain any identifiable data, for example by masking counts less than five. For this reason, you should be able to export your report; all exports must be via Airlock, you must not copy the report by hand.
Help and support¶
Please reach out via the Genomics England Service Desk for any issues related to running this script, including "gene-centric-cancer-SNV-report" in the title/description of your inquiry.