Skip to content

Gene centric SNV report for cancer participants

Deprecated

The cancer_tier_and_domain_variants table in Labkey provides a a readily-accessible and up-to-date version of the data provided by this pipeline. We recommend querying the cancer_tier_and_domain_variants table directly, rather than using this script.

Version control

version data release clinVar
v1.0-beta DR11 March, 2021

Summary

This page describes how to generate a "gene report" for SNVs found in participants of the Genomics England cancer cohort.

The report contains counts of all participants with a small genic variant of moderate or high impact, as identified by a set of SO terms presented in the description section. It presents a breakdown of participant counts per disease type (censored when <5) and percentage by cancer type, as well as a break down of the variants split by somatic/germline and most common variants for that gene.

The R script to generate this report can be found in:

~/gel_data_resources/example_scripts/gene-centric-snv-report/v1.0/scripts/01.functions.R

Description

The gene report is based on the counts of small genic variant of moderate or high impact, as identified by a set of SO terms below, for cancer patients.

SO term Consequence type
SO:0001893 transcript ablation
SO:0001574 splice_acceptor_variant
SO:0001575 splice_donor_variant
SO:0001587 stop_gained
SO:0001589 frameshift_variant
SO:0001578 stop_lost
SO:0002012 start_lost
SO:0001889 transcript_amplification
SO:0001821 inframe_insertion
SO:0001822 inframe_deletion
SO:0001650 inframe_variant
SO:0001583 missense_variant
SO:0001630 splice_region_variant

The report defines deleterious variants those that

  • causes loss-of-function, i.e.

    1. splice_acceptor_variant
    2. splice_donor_variant
    3. start_lost
    4. stop_lost
    5. stop_gained
    6. frameshift_variant
    7. inframe_insertion
    8. inframe_variant

    or are reported as

  • pathogenic/ likely pathogenic in ClinVar (for version, see Version Control box above)

Finally, some participants will have more than one sample. Similarly, some will carry more than one mutation on the query gene. Nonetheless, the script counts each participant only once, except when it is clearly counting the different variants.

Hypothetical example:

Say that for a given query gene, we have the following variants in our database:

participant sample variant
1 1.1 c.196del
1 1.2 -
1 1.3 c.180A>T
1 1.3 c.196del
2 1.1 c.196del

The resulting counts will be as follows:

  • two patients with mutations on the query gene.
  • c.196del: 2, c.180A>T: 1

Usage

The code was developed to work with R/4.0.2, so inside the RE, open a terminal and run:

$ module load R/4.0.2
$ rstudio

Single queries

Then inside RStudio, source the script, load the SNVdb (only once), and query for your genes of interest, one at a time:

source("~/gel_data_resources/example_scripts/gene-centric-snv-report/v1.0/scripts/01.functions.R")

db <- loadSNVdb()
brca1 <- queryGene(gene_name = "BRCA1")
brca2 <- queryGene(gene_name = "BRCA2")

The output will be two files: one with summary counts (summary_gene_name.txt) and one with the full data to conduct further analysis as required (data_gene_name.tsv), where gene_name is the queried gene.

Multiple queries

Have a list of genes in a flat file, i.e. gene_list.txt:

BRCA1
BRCA2
NTRK1

Then inside RStudio, source the script, load the SNVdb (only once), and load your gene_list.txt:

# Source script, load SNVdb and your gene_list.txt
source("~/gel_data_resources/example_scripts/gene-centric-snv-report/v1.0/scripts/01.functions.R")
db <- loadSNVdb()
gene_list <- readLines("gene_list.txt")

## query all genes in the gene_list, concatenate and save results.
data <- lapply(gene_list, function(x){queryGene(x, saveData = F)}) %>% bind_rows()
write_tsv(data, 'my_data.tsv')

The output will be: one with summary counts (summary_gene_name.txt) per gene and one with the full data (concatenated for all genes) to conduct further analysis as required (my_data.tsv).

Input

loadSNVdb() does not require any input. Saving the output (as above) in a variable db will avoid referring to it again.

queryGene() will accept the following arguments:

argument type default description
gene_name required - string argument, name of the (ONE) gene to be queried for. Use uppercase letters.
E.g. "BRCA1" or "NTRK3".
variant optional NULL character, it accepts one string or vector of string with the changes observed in the protein level. This is the description part of HGVS simple.
E.g.: c("c.1961del", "c.3708T>G")
relevance_type optional c("(likely)pathogenic", "other", "LoF", "path_LoF") character, it accepts one string or vector of strings with variant relevance. Select at least one of the default values.
For any deleterious variant use:
c("(likely)pathogenic", "LoF", "path_LoF")
Types are:
* LoF: variants that cause a loss-of function, i.e. a variant with one of the following consequences according to CellBase: splice_acceptor_variant, splice_donor_variant, start_lost, stop_lost, stop_gained, frameshift_variant, inframe_insertion, inframe_variant.
* (likely)pathogenic: pathogenic or likely pathogenic on ClinVar.
* path_LoF: LoF AND (likely)pathogenic.
* other: nor LoF or (likely)pathogenic.

Output

summary_gene_name.txt presents summary information split into three parts:

  1. Overview counts, including
    N: total number of patients with any small genic variant of moderate or high impact mutations on the query gene
    D: number of patients with any deleterious mutations on the query gene
    S: total number of patients with any somatic deleterious mutations on the query gene
    G: total number of patients with any germline deleterious mutations on the query gene
  2. Break-down by variant, including
    n.deleterious: total number of patients with any deleterious mutations on the query gene per disease type
    p.deleterious: percent of patients with any deleterious mutations on the query gene per disease type
    n.total: total number of patients with any non-synonymous, splice site and RNA gene variants mutations on the query gene per disease type
    p.total: percent of patients with any non-synonymous, splice site and RNA gene variants mutations on the query gene per disease type
  3. Most common variants across all disease type, including
    change: as seen in the protein level
    consequence: as predicted by CellBase
    clinical_relevance: pathogenic or likely pathogenic if listed in Clinvar as such. Empty otherwise. n: counts

Note that only counts > five are included in the report, across all of the above parts.

Example of summary file (the numbers shown below are synthetic):
Gene: BRCA1
Variant(s):
Relevance: (likely)pathogenic, other, LoF, path_LoF

N: Total participants with at least one mutation on the query gene
D: Number of participants with at least one mutation on the query gene that causes loss-of-function and/or is pathogenic/likely pathogenic in ClinVar (March 2021)
S: Number of participants with at least one (likely) pathogenic/LoF somatic mutation on the query gene
G: Number of participants with at least one (likely) pathogenic/LoF germline mutation on the query gene. (Only reported if gene is internally listed as relevant for solid/blood tumour(s).)
Deleterious: a variant that is at least one of the following: likely pathogenic/ pathogenic on ClinVar, or have a consequence that according to CellBase causes loss-of-function, i.e. splice_acceptor_variant, splice_donor_variant, stop_gained, frameshift_variant, start_lost, stop_lost, inframe_deletion, inframe_insertion.)

N = 500
PL = 300

S = 150
G = 75

###### Counts per indications (all with n > five)
disease_type    n.deleterious   p.deleterious   n.total p.total
OVARIAN                         33  10.1    67  12.5
ENDOMETRIAL_CARCINOMA           77  6.2     99  9.7
COLORECTAL                      55  2.5     132 7.8
...

###### Variants (all with n > five)
change  consequence clinical_significance   n
c.1961del       frameshift_variant  Pathogenic  33
c.3708T>G       missense_variant                12
c.68_69del      frameshift_variant               7
c.1846_1848del  inframe_deletion                 4

data_gene_name.tsv contains all queried variants (filtered accordingly if variants and relevance_type arguments have been used, but multiple patient sample presented), in a table format where:

  1. participant_id: Genomics England unique participant identifier
  2. tumour_sample_platekey: somatic sample identifier
  3. disease_type: cancer type
  4. type: somatic/ germline
  5. gene: gene_name
  6. change: protein level alteration
  7. consequence: as predicted by CellBase
  8. clinical_significance: pathogenic or likely pathogenic if listed in ClinVar as such
  9. relevance: a combination of 7 and 8 (see input for more details.)

Details on loadSNVdb

The SNVdb loaded by the function is a compilation of all non-synonymous, splice site and RNA gene SNV variants found per sample for the participants in the cancer programme. Somatic variants are listed for all genes, germline variants are listed only for genes with indication of cancer predisposition as listed by Genomics England PanelApp. These are listed per sample as a csv file and generated by our internal cancer analysis team. The paths for the individual .csv files are provided in the cancer_analysis table in LabKey.

The data is complemented with a recent version (March, 2021) of ClinVar pathogenic and likely pathogenic variants.

Exporting your gene centric SNV report

The gene centric SNV report has been designed so that it does not contain any identifiable data, for example by masking counts less than five. For this reason, you should be able to export your report; all exports must be via Airlock, you must not copy the report by hand.

Help and support

Please reach out via the Genomics England Service Desk for any issues related to running this script, including "gene-centric-cancer-SNV-report" in the title/description of your inquiry.