Gene centric SNV report for cancer participants¶

Deprecated

The cancer_tier_and_domain_variants table in Labkey provides a a readily-accessible and up-to-date version of the data provided by this pipeline. We recommend querying the cancer_tier_and_domain_variants table directly, rather than using this script.

Version control¶

version	data release	clinVar
v1.0-beta	DR11	March, 2021

Summary¶

This page describes how to generate a "gene report" for SNVs found in participants of the Genomics England cancer cohort.

The report contains counts of all participants with a small genic variant of moderate or high impact, as identified by a set of SO terms presented in the description section. It presents a breakdown of participant counts per disease type (censored when <5) and percentage by cancer type, as well as a break down of the variants split by somatic/germline and most common variants for that gene.

The R script to generate this report can be found in:

~/gel_data_resources/example_scripts/gene-centric-snv-report/v1.0/scripts/01.functions.R

Description¶

The gene report is based on the counts of small genic variant of moderate or high impact, as identified by a set of SO terms below, for cancer patients.

SO term	Consequence type
SO:0001893	transcript ablation
SO:0001574	splice_acceptor_variant
SO:0001575	splice_donor_variant
SO:0001587	stop_gained
SO:0001589	frameshift_variant
SO:0001578	stop_lost
SO:0002012	start_lost
SO:0001889	transcript_amplification
SO:0001821	inframe_insertion
SO:0001822	inframe_deletion
SO:0001650	inframe_variant
SO:0001583	missense_variant
SO:0001630	splice_region_variant

The report defines deleterious variants those that

causes loss-of-function, i.e.
1. splice_acceptor_variant
2. splice_donor_variant
3. start_lost
4. stop_lost
5. stop_gained
6. frameshift_variant
7. inframe_insertion
8. inframe_variant
or are reported as
pathogenic/ likely pathogenic in ClinVar (for version, see Version Control box above)

Finally, some participants will have more than one sample. Similarly, some will carry more than one mutation on the query gene. Nonetheless, the script counts each participant only once, except when it is clearly counting the different variants.

Hypothetical example:¶

Say that for a given query gene, we have the following variants in our database:

participant	sample	variant
1	1.1	c.196del
1	1.2	-
1	1.3	c.180A>T
1	1.3	c.196del
2	1.1	c.196del

The resulting counts will be as follows:

two patients with mutations on the query gene.
c.196del: 2, c.180A>T: 1

Usage¶

The code was developed to work with R/4.0.2, so inside the RE, open a terminal and run:

$ module load R/4.0.2
$ rstudio

Single queries¶

Then inside RStudio, source the script, load the SNVdb (only once), and query for your genes of interest, one at a time:

source("~/gel_data_resources/example_scripts/gene-centric-snv-report/v1.0/scripts/01.functions.R")

db <- loadSNVdb()
brca1 <- queryGene(gene_name = "BRCA1")
brca2 <- queryGene(gene_name = "BRCA2")

The output will be two files: one with summary counts (summary_gene_name.txt) and one with the full data to conduct further analysis as required (data_gene_name.tsv), where gene_name is the queried gene.

Multiple queries¶

Have a list of genes in a flat file, i.e. gene_list.txt:

BRCA1
BRCA2
NTRK1

Then inside RStudio, source the script, load the SNVdb (only once), and load your gene_list.txt:

# Source script, load SNVdb and your gene_list.txt
source("~/gel_data_resources/example_scripts/gene-centric-snv-report/v1.0/scripts/01.functions.R")
db <- loadSNVdb()
gene_list <- readLines("gene_list.txt")

## query all genes in the gene_list, concatenate and save results.
data <- lapply(gene_list, function(x){queryGene(x, saveData = F)}) %>% bind_rows()
write_tsv(data, 'my_data.tsv')

The output will be: one with summary counts (summary_gene_name.txt) per gene and one with the full data (concatenated for all genes) to conduct further analysis as required (my_data.tsv).

Input¶

loadSNVdb() does not require any input. Saving the output (as above) in a variable db will avoid referring to it again.

queryGene() will accept the following arguments:

argument	type	default	description
gene_name	required	-	string argument, name of the (ONE) gene to be queried for. Use uppercase letters. E.g. "BRCA1" or "NTRK3".
variant	optional	NULL	character, it accepts one string or vector of string with the changes observed in the protein level. This is the description part of HGVS simple. E.g.: c("c.1961del", "c.3708T>G")
relevance_type	optional	c("(likely)pathogenic", "other", "LoF", "path_LoF")	character, it accepts one string or vector of strings with variant relevance. Select at least one of the default values. For any deleterious variant use: c("(likely)pathogenic", "LoF", "path_LoF") Types are: * LoF: variants that cause a loss-of function, i.e. a variant with one of the following consequences according to CellBase: splice_acceptor_variant, splice_donor_variant, start_lost, stop_lost, stop_gained, frameshift_variant, inframe_insertion, inframe_variant. * (likely)pathogenic: pathogenic or likely pathogenic on ClinVar. * path_LoF: LoF AND (likely)pathogenic. * other: nor LoF or (likely)pathogenic.

Output¶

summary_gene_name.txt presents summary information split into three parts:

Overview counts, including
N: total number of patients with any small genic variant of moderate or high impact mutations on the query gene
D: number of patients with any deleterious mutations on the query gene
S: total number of patients with any somatic deleterious mutations on the query gene
G: total number of patients with any germline deleterious mutations on the query gene
Break-down by variant, including
n.deleterious: total number of patients with any deleterious mutations on the query gene per disease type
p.deleterious: percent of patients with any deleterious mutations on the query gene per disease type
n.total: total number of patients with any non-synonymous, splice site and RNA gene variants mutations on the query gene per disease type
p.total: percent of patients with any non-synonymous, splice site and RNA gene variants mutations on the query gene per disease type
Most common variants across all disease type, including
change: as seen in the protein level
consequence: as predicted by CellBase
clinical_relevance: pathogenic or likely pathogenic if listed in Clinvar as such. Empty otherwise. n: counts

Note that only counts > five are included in the report, across all of the above parts.

Example of summary file (the numbers shown below are synthetic):

Gene: BRCA1
Variant(s):
Relevance: (likely)pathogenic, other, LoF, path_LoF

N: Total participants with at least one mutation on the query gene
D: Number of participants with at least one mutation on the query gene that causes loss-of-function and/or is pathogenic/likely pathogenic in ClinVar (March 2021)
S: Number of participants with at least one (likely) pathogenic/LoF somatic mutation on the query gene
G: Number of participants with at least one (likely) pathogenic/LoF germline mutation on the query gene. (Only reported if gene is internally listed as relevant for solid/blood tumour(s).)
Deleterious: a variant that is at least one of the following: likely pathogenic/ pathogenic on ClinVar, or have a consequence that according to CellBase causes loss-of-function, i.e. splice_acceptor_variant, splice_donor_variant, stop_gained, frameshift_variant, start_lost, stop_lost, inframe_deletion, inframe_insertion.)

N = 500
PL = 300

S = 150
G = 75

###### Counts per indications (all with n > five)
disease_type    n.deleterious   p.deleterious   n.total p.total
OVARIAN                         33  10.1    67  12.5
ENDOMETRIAL_CARCINOMA           77  6.2     99  9.7
COLORECTAL                      55  2.5     132 7.8
...

###### Variants (all with n > five)
change  consequence clinical_significance   n
c.1961del       frameshift_variant  Pathogenic  33
c.3708T>G       missense_variant                12
c.68_69del      frameshift_variant               7
c.1846_1848del  inframe_deletion                 4

data_gene_name.tsv contains all queried variants (filtered accordingly if variants and relevance_type arguments have been used, but multiple patient sample presented), in a table format where:

participant_id: Genomics England unique participant identifier
tumour_sample_platekey: somatic sample identifier
disease_type: cancer type
type: somatic/ germline
gene: gene_name
change: protein level alteration
consequence: as predicted by CellBase
clinical_significance: pathogenic or likely pathogenic if listed in ClinVar as such
relevance: a combination of 7 and 8 (see input for more details.)

Details on loadSNVdb¶

The SNVdb loaded by the function is a compilation of all non-synonymous, splice site and RNA gene SNV variants found per sample for the participants in the cancer programme. Somatic variants are listed for all genes, germline variants are listed only for genes with indication of cancer predisposition as listed by Genomics England PanelApp. These are listed per sample as a csv file and generated by our internal cancer analysis team. The paths for the individual .csv files are provided in the cancer_analysis table in LabKey.

The data is complemented with a recent version (March, 2021) of ClinVar pathogenic and likely pathogenic variants.

Exporting your gene centric SNV report¶

The gene centric SNV report has been designed so that it does not contain any identifiable data, for example by masking counts less than five. For this reason, you should be able to export your report; all exports must be via Airlock, you must not copy the report by hand.

Help and support¶

Please reach out via the Genomics England Service Desk for any issues related to running this script, including "gene-centric-cancer-SNV-report" in the title/description of your inquiry.