Somatic SVs and CNVs for a specific gene¶
getSVCNVperGene is an R package to query somatic samples for tiered structural and copy number variants in genes of interest. This package was developed by Genomics England.
The getSVCNVperGene package queries interpretation JSON files for somatic cancer samples, which contain tiered SVs and CNVs found in the tumour samples for our cancer participants.
If your research requires you to query other SVs and CNVs, we would recommend that you review our Structural Variant workflow that queries VCFs directly for any sample in the Genomics England database.
Due to an update to the Tiering pipeline that occurred between releases 15 and 16 of the Main Programme, the current version of the R getSVCNVperGene package, version 0.94, is no longer compatible with the SV and CNV tiering JSON files produced for releases v16 and above. The instructions below will allow you to query data provided for Main Programme releases prior to and including v15. If you are starting a new research project we recommend that you use a separate resource that has been created and detailed at the end of this page.
Instructions¶
To run the package you will need to:
- Load the package
- Run the package
Load the package¶
The newest version of this package is available in the Genomes England cluster for R version 4.1.0. In order to run it, you will need to load GRON and R, as below:
For using the package:
And then, load the library:
(We have updated this package in Aug 2022. Please take note of correctly assigning version 0.94)
The documentation for the functions getSV and getCNV explains how the functions can be called.
Run the package¶
There are two functions:
- getSV queries structural variants.
- getCNV queries copy number variants.
getSV inputs¶
argument | type | default | description |
---|---|---|---|
gene | required | - | string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET') |
fusionOnly | optional | FALSE | logical argument, select on SVs that result in the fusion of genes. Fusions here are define as any SV whose breakpoints are located in the coding region of two different genes, one of them being the gene name given to the argument gene above. |
diseaseType | optional | NULL | string argument, a diseases_type or a vector of disease types. If not given return all disease_type. |
participantID | optional | NULL | a file with one participantID per line. If not given, query all participants. |
plateKey | optional | NULL | a file with one sample platekey per line; note that participantID and plateKey should NOT be used simultaneously. If not given, query all sample. |
release_version | optional | "/main-programme/main-programme_v15_2022-05-26" | string argument containing the filepath for the Release Programme version. E.g. "/main-programme/main-programme_v15_2022-05-26" for release version /main-programme/main-programme_v15_2022-05-26. |
fusionOnly | TRUE/FALSE | FALSE | for getSV, you can return only fusions |
The following example detects tiered structural variants in KRAS, PTEN and BRAF, which resulted in the fusion of genes:
getSV(gene = c("KRAS", "PTEN", "BRAF"), fusionOnly = TRUE, diseaseType = "Breast", release_version = "/main-programme/main-programme_v15_2022-05-26")
get SV output format¶
The function outputs a tab-separated value file, with 18 columns, which are:
column name | description |
---|---|
query | Gene name queried for |
plate | Genomics England identifier for sequenced sample, aka platekey |
participant_id | Genomics England identifier for participants |
disease_type | Tumour type |
disease_sub_type | Tumour sub-type |
type | SV type, BND (translocation), DEL (deletion), DUP (duplication), INV (inversion), INS (insertion), as defined by Manta |
gene_name_bp1 | If breakpoint 1 (bp1) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp1 falls in a non-coding region, this fields says 'empty' |
gene_name_bp2 | If breakpoint 2 (bp2) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp2 falls in a non-coding region, this fields says 'empty' |
coordinates_bp1 | Coordinates of bp1 in the chr:start:end. Bp1 is defined as the bp that occurs on the chromosome with lowest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp1 is the one with the lowest position |
coordinates_bp2 | Coordinates of bp2 in the chr:start:end. Bp2 is defined as the bp that occurs on the chromosome with highest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp2 is the one with the highest position |
gene_txs_bp1 | Additional annotation for gene on bp1, if any |
gene_txs_bp2 | Additional annotation for gene on bp2, if any |
size | Size of SV in basepairs. NA for translocations |
tier | Here, tier represents the level of actionability of the gene involved in the SV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes |
mate | Information for translocation, that indicates how the two chromosomes are merged. NA for other SV types |
additionalTextualVariantAnnotations | Additional information |
actions | Information about actionability, if any |
cytobands | Chromosomal cytoband information |
getCNV inputs¶
argument | type | default | description |
---|---|---|---|
gene | required | - | string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET') |
gain | optional | NULL | logical argument, if TRUE selects only gain, if FALSE selects only losses, if NULL selects all |
diseaseType | optional | NULL | string argument, a diseases_type or a vector of disease types. If not given return all disease_type. |
participantID | optional | NULL | a file with one participantID per line. If not given, query all participants. |
plateKey | optional | NULL | a file with one sample platekey per line; note that participantID and plateKey should NOT be used simultaneously. If not given, query all sample. |
release_version | optional | "/main-programme/main-programme_v15_2022-05-26" | string argument containing the filepath for the Release Programme version. E.g. use "/main-programme/main-programme_v15_2022-05-26" for release version /main-programme/main-programme_v15_2022-05-26. |
The following example detects tiered structural variants in KRAS, PTEN and BRAF, related to samples concerning breast cancer:
getCNV(gene = c("KRAS", "PTEN", "BRAF"), diseaseType = "Breast", release_version = "/main-programme/main-programme_v15_2022-05-26")
getCNV output format¶
The function outputs a tab-separated value file, with 12 columns, which are:
column name | description |
---|---|
query | Gene name queried for |
plate | Genomics England identifier for sequenced sample, aka platekey |
participant_id | Genomics England identifier for participants |
disease_type | Tumour type |
disease_sub_type | Tumour sub-type |
tier | Here, tier represents the level of actionability of the genes involved in the CNV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes, although not included |
coordinates | CNV coordinates in the format: chr:start-end |
cn | copy number of the region indicated in coordinates. Note that cn = 2 corresponds to LOH |
type | 'CNV' is the old annotation, indicating cn is different from what is expected. 'GAIN' cn > 2, 'LOSS' cn < 2, 'LOH' cn = 2 |
genes | annotation |
cytobands | Chromosomal cytoband information |
size | Size of SV in basepairs. NA for translocations |
For Main Programme Research Releases 16 and above¶
Changes in the tiering pipeline in release 16 introduced a breaking change in the format of the output JSON files. To counter this, and to increase the accessibility of the tiered complex variant data for Releases 16+, we have created SQLite3 resources which contains the same information which would be generated by the getSVCNVperGene package. These resources are located in:
/gel_data_resources/main_programme/tiering_data_cancer/GRCh38/tiered_svcnv_data
The data in these resources can be accessed from both R or Python using the relevant SQLite connectors. As SQLite3 is a common databasing file format other languages jave the possibility to connect, however, we have only tested and can only support R and Python implementations.
As SQLite3 is part of the python standard library you will simply need to import the library, create a connection and query the database with SQL. An example of this is included below:
import sqlite3
import pandas as pd
conn = sqlite3.connect("/gel_data_resources/main_programme/tiering_data_cancer/GRCh38")
df = pd.read_sql_query(sql_query, conn)
For R there is the need to install the additional DBI
and RSQLite
libraries before proceeding with a query such as:
library(DBI)
con <- dbConnect(RSQLite::SQLite(),
"/gel_data_resources/main_programme/tiering_data_cancer/GRCh38",
flags = SQLITE_RO)
df <- dbGetQuery(con, sq_query)
Once in a dataframe the data can be more easily processed or outputed to a file.
In both examples a query in SQL such as:
SELECT
DISTINCT disease_sub_type,
COUNT(disease_sub_type) AS count_of_disease_sub_types
FROM
v17_cnv
WHERE
gene = 'BRCA1' AND
type = 'LOH'
GROUP BY disease_sub_type;
sql_str
and used in the script
Note on variant calling¶
Structural variants (SVs) and long indel (>50bp) calling is performed with Manta (version 0.28.0) which combines paired and split-read evidence for SV discovery and scoring. Copy number variants (CNVs) are called with Canvas (version 1.3.1) which employs coverage and minor allele frequencies to assign copy-number data. These tools filter out the following variant calls:
- Manta-called SVs with a normal sample depth near one or both variant break-ends three times higher than the chromosomal mean
- Manta-called SVs with somatic quality score < 30
- Manta-called somatic deletions and duplications with length > 10kb
- Manta-called somatic small variant (<1kb) where fraction of reads with MAPQ0 around either break-end > 0.4
- Canvas-called somatic CNVs with length < 10kb
- Canvas-called somatic CNVs with quality score < 10