Skip to content

Somatic SVs and CNVs for a specific gene

getSVCNVperGene is an R package to query somatic samples for tiered structural and copy number variants in genes of interest. This package was developed by Genomics England.

The getSVCNVperGene package queries interpretation JSON files for somatic cancer samples, which contain tiered SVs and CNVs found in the tumour samples for our cancer participants.

If your research requires you to query other SVs and CNVs, we would recommend that you review our Structural Variant workflow that queries VCFs directly for any sample in the Genomics England database.

Due to an update to the Tiering pipeline that occurred between releases 15 and 16 of the Main Programme, the current version of the R getSVCNVperGene package, version 0.94, is no longer compatible with the SV and CNV tiering JSON files produced for releases v16 and above. The instructions below will allow you to query data provided for Main Programme releases prior to and including v15. If you are starting a new research project we recommend that you use a separate resource that has been created and detailed at the end of this page.

Instructions

To run the package you will need to:

  1. Load the package
  2. Run the package

Load the package

The newest version of this package is available in the Genomes England cluster for R version 4.1.0. In order to run it, you will need to load GRON and R, as below:

For using the package:

module load bio/gron/0.6.1
module load lang/R/4.1.0-foss-2019b
R

And then, load the library:

(We have updated this package in Aug 2022. Please take note of correctly assigning version 0.94)

library(Rlabkey)
library(getSVCNVperGene0.94)

The documentation for the functions getSV and getCNV explains how the functions can be called.

Run the package

There are two functions:

  1. getSV queries structural variants.
  2. getCNV queries copy number variants.

getSV inputs

argument type default description
gene required - string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET')
fusionOnly optional FALSE logical argument, select on SVs that result in the fusion of genes. Fusions here are define as any SV whose breakpoints are located in the coding region of two different genes, one of them being the gene name given to the argument gene above.
diseaseType optional NULL string argument, a diseases_type or a vector of disease types. If not given return all disease_type.
participantID optional NULL a file with one participantID per line. If not given, query all participants.
plateKey optional NULL a file with one sample platekey per line; note that participantID and plateKey should NOT be used simultaneously. If not given, query all sample.
release_version optional "/main-programme/main-programme_v15_2022-05-26" string argument containing the filepath for the Release Programme version. E.g. "/main-programme/main-programme_v15_2022-05-26" for release version /main-programme/main-programme_v15_2022-05-26.
fusionOnly TRUE/FALSE FALSE for getSV, you can return only fusions

The following example detects tiered structural variants in KRAS, PTEN and BRAF, which resulted in the fusion of genes:

getSV(gene = c("KRAS", "PTEN", "BRAF"), fusionOnly = TRUE, diseaseType = "Breast", release_version = "/main-programme/main-programme_v15_2022-05-26")

get SV output format

The function outputs a tab-separated value file, with 18 columns, which are:

column name description
query Gene name queried for
plate Genomics England identifier for sequenced sample, aka platekey
participant_id Genomics England identifier for participants
disease_type Tumour type
disease_sub_type Tumour sub-type
type SV type, BND (translocation), DEL (deletion), DUP (duplication), INV (inversion), INS (insertion), as defined by Manta
gene_name_bp1 If breakpoint 1 (bp1) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp1 falls in a non-coding region, this fields says 'empty'
gene_name_bp2 If breakpoint 2 (bp2) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp2 falls in a non-coding region, this fields says 'empty'
coordinates_bp1 Coordinates of bp1 in the chr:start:end. Bp1 is defined as the bp that occurs on the chromosome with lowest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp1 is the one with the lowest position
coordinates_bp2 Coordinates of bp2 in the chr:start:end. Bp2 is defined as the bp that occurs on the chromosome with highest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp2 is the one with the highest position
gene_txs_bp1 Additional annotation for gene on bp1, if any
gene_txs_bp2 Additional annotation for gene on bp2, if any
size Size of SV in basepairs. NA for translocations
tier Here, tier represents the level of actionability of the gene involved in the SV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes
mate Information for translocation, that indicates how the two chromosomes are merged. NA for other SV types
additionalTextualVariantAnnotations Additional information
actions Information about actionability, if any
cytobands Chromosomal cytoband information

getCNV inputs

argument type default description
gene required - string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET')
gain optional NULL logical argument, if TRUE selects only gain, if FALSE selects only losses, if NULL selects all
diseaseType optional NULL string argument, a diseases_type or a vector of disease types. If not given return all disease_type.
participantID optional NULL a file with one participantID per line. If not given, query all participants.
plateKey optional NULL a file with one sample platekey per line; note that participantID and plateKey should NOT be used simultaneously. If not given, query all sample.
release_version optional "/main-programme/main-programme_v15_2022-05-26" string argument containing the filepath for the Release Programme version. E.g. use "/main-programme/main-programme_v15_2022-05-26" for release version /main-programme/main-programme_v15_2022-05-26.

The following example detects tiered structural variants in KRAS, PTEN and BRAF, related to samples concerning breast cancer:

getCNV(gene = c("KRAS", "PTEN", "BRAF"), diseaseType = "Breast", release_version = "/main-programme/main-programme_v15_2022-05-26")

getCNV output format

The function outputs a tab-separated value file, with 12 columns, which are:

column name description
query Gene name queried for
plate Genomics England identifier for sequenced sample, aka platekey
participant_id Genomics England identifier for participants
disease_type Tumour type
disease_sub_type Tumour sub-type
tier Here, tier represents the level of actionability of the genes involved in the CNV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes, although not included
coordinates CNV coordinates in the format: chr:start-end
cn copy number of the region indicated in coordinates. Note that cn = 2 corresponds to LOH
type 'CNV' is the old annotation, indicating cn is different from what is expected. 'GAIN' cn > 2, 'LOSS' cn < 2, 'LOH' cn = 2
genes annotation
cytobands Chromosomal cytoband information
size Size of SV in basepairs. NA for translocations

For Main Programme Research Releases 16 and above

Changes in the tiering pipeline in release 16 introduced a breaking change in the format of the output JSON files. To counter this, and to increase the accessibility of the tiered complex variant data for Releases 16+, we have created SQLite3 resources which contains the same information which would be generated by the getSVCNVperGene package. These resources are located in:

/gel_data_resources/main_programme/tiering_data_cancer/GRCh38/tiered_svcnv_data

The data in these resources can be accessed from both R or Python using the relevant SQLite connectors. As SQLite3 is a common databasing file format other languages jave the possibility to connect, however, we have only tested and can only support R and Python implementations.

As SQLite3 is part of the python standard library you will simply need to import the library, create a connection and query the database with SQL. An example of this is included below:

import sqlite3
import pandas as pd

conn = sqlite3.connect("/gel_data_resources/main_programme/tiering_data_cancer/GRCh38")

df = pd.read_sql_query(sql_query, conn)

For R there is the need to install the additional DBI and RSQLite libraries before proceeding with a query such as:

library(DBI)

con <- dbConnect(RSQLite::SQLite(),
  "/gel_data_resources/main_programme/tiering_data_cancer/GRCh38",
  flags = SQLITE_RO)

df <- dbGetQuery(con, sq_query)

Once in a dataframe the data can be more easily processed or outputed to a file.

In both examples a query in SQL such as:

SELECT
  DISTINCT disease_sub_type,
  COUNT(disease_sub_type) AS count_of_disease_sub_types
FROM
  v17_cnv
WHERE
  gene = 'BRCA1' AND
  type = 'LOH'
GROUP BY disease_sub_type;
can be saved to the variable sql_str and used in the script

Note on variant calling

Structural variants (SVs) and long indel (>50bp) calling is performed with Manta (version 0.28.0) which combines paired and split-read evidence for SV discovery and scoring. Copy number variants (CNVs) are called with Canvas (version 1.3.1) which employs coverage and minor allele frequencies to assign copy-number data. These tools filter out the following variant calls:

  • Manta-called SVs with a normal sample depth near one or both variant break-ends three times higher than the chromosomal mean
  • Manta-called SVs with somatic quality score < 30
  • Manta-called somatic deletions and duplications with length > 10kb
  • Manta-called somatic small variant (<1kb) where fraction of reads with MAPQ0 around either break-end > 0.4
  • Canvas-called somatic CNVs with length < 10kb
  • Canvas-called somatic CNVs with quality score < 10

Last update: November 3, 2023