Skip to content

The HPC is changing

We will soon be switching to a new High Performance Cluster, called Double Helix. This will mean that some of the commands you use to connect to the HPC and call modules will change. We will inform you by email when you are switching over, allowing you to make the necessary changes to your scripts. Please check our HPC changeover notes for more details on what will change.

Somatic SVs and CNVs for a specific gene

Deprecated

Due to an update to the Tiering pipeline that occurred between releases 15 and 16 of the 100kGP, the current version of the R getSVCNVperGene package, version 0.94, is no longer compatible with the SV and CNV tiering JSON files produced for releases v16 and above. The instructions below will allow you to query data provided for 100kGP releases prior to and including v15.

getSVCNVperGene is an R package to query somatic samples for tiered structural and copy number variants in genes of interest. This package was developed by Genomics England.

The getSVCNVperGene package queries interpretation JSON files for somatic cancer samples, which contain tiered SVs and CNVs found in the tumour samples for our cancer participants.

If your research requires you to query other SVs and CNVs, we would recommend that you review our Structural Variant workflow that queries VCFs directly for any sample in the Genomics England database.

Instructions

To run the package you will need to:

  1. Load the package
  2. Run the package

Load the package

The newest version of this package is available in the Genomes England cluster for R version 4.1.0. In order to run it, you will need to load GRON and R, as below:

For using the package:

module load bio/gron/0.6.1
module load lang/R/4.1.0-foss-2019b
R

And then, load the library:

(We have updated this package in Aug 2022. Please take note of correctly assigning version 0.94)

library(Rlabkey)
library(getSVCNVperGene0.94)

The documentation for the functions getSV and getCNV explains how the functions can be called.

Run the package

There are two functions:

  1. getSV queries structural variants.
  2. getCNV queries copy number variants.

getSV inputs

argument type default description
gene required - string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET')
fusionOnly optional FALSE logical argument, select on SVs that result in the fusion of genes. Fusions here are define as any SV whose breakpoints are located in the coding region of two different genes, one of them being the gene name given to the argument gene above.
diseaseType optional NULL string argument, a diseases_type or a vector of disease types. If not given return all disease_type.
participantID optional NULL a file with one participantID per line. If not given, query all participants.
plateKey optional NULL a file with one sample platekey per line; note that participantID and plateKey should NOT be used simultaneously. If not given, query all sample.
release_version optional "/main-programme/main-programme_v15_2022-05-26" string argument containing the filepath for the Release Programme version. E.g. "/main-programme/main-programme_v15_2022-05-26" for release version /main-programme/main-programme_v15_2022-05-26.
fusionOnly TRUE/FALSE FALSE for getSV, you can return only fusions

The following example detects tiered structural variants in KRAS, PTEN and BRAF, which resulted in the fusion of genes:

getSV(gene = c("KRAS", "PTEN", "BRAF"), fusionOnly = TRUE, diseaseType = "Breast", release_version = "/main-programme/main-programme_v15_2022-05-26")

get SV output format

The function outputs a tab-separated value file, with 18 columns, which are:

column name description
query Gene name queried for
plate Genomics England identifier for sequenced sample, aka platekey
participant_id Genomics England identifier for participants
disease_type Tumour type
disease_sub_type Tumour sub-type
type SV type, BND (translocation), DEL (deletion), DUP (duplication), INV (inversion), INS (insertion), as defined by Manta
gene_name_bp1 If breakpoint 1 (bp1) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp1 falls in a non-coding region, this fields says 'empty'
gene_name_bp2 If breakpoint 2 (bp2) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp2 falls in a non-coding region, this fields says 'empty'
coordinates_bp1 Coordinates of bp1 in the chr:start:end. Bp1 is defined as the bp that occurs on the chromosome with lowest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp1 is the one with the lowest position
coordinates_bp2 Coordinates of bp2 in the chr:start:end. Bp2 is defined as the bp that occurs on the chromosome with highest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp2 is the one with the highest position
gene_txs_bp1 Additional annotation for gene on bp1, if any
gene_txs_bp2 Additional annotation for gene on bp2, if any
size Size of SV in basepairs. NA for translocations
tier Here, tier represents the level of actionability of the gene involved in the SV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes
mate Information for translocation, that indicates how the two chromosomes are merged. NA for other SV types
additionalTextualVariantAnnotations Additional information
actions Information about actionability, if any
cytobands Chromosomal cytoband information

getCNV inputs

argument type default description
gene required - string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET')
gain optional NULL logical argument, if TRUE selects only gain, if FALSE selects only losses, if NULL selects all
diseaseType optional NULL string argument, a diseases_type or a vector of disease types. If not given return all disease_type.
participantID optional NULL a file with one participantID per line. If not given, query all participants.
plateKey optional NULL a file with one sample platekey per line; note that participantID and plateKey should NOT be used simultaneously. If not given, query all sample.
release_version optional "/main-programme/main-programme_v15_2022-05-26" string argument containing the filepath for the Release Programme version. E.g. use "/main-programme/main-programme_v15_2022-05-26" for release version /main-programme/main-programme_v15_2022-05-26.

The following example detects tiered structural variants in KRAS, PTEN and BRAF, related to samples concerning breast cancer:

getCNV(gene = c("KRAS", "PTEN", "BRAF"), diseaseType = "Breast", release_version = "/main-programme/main-programme_v15_2022-05-26")

getCNV output format

The function outputs a tab-separated value file, with 12 columns, which are:

column name description
query Gene name queried for
plate Genomics England identifier for sequenced sample, aka platekey
participant_id Genomics England identifier for participants
disease_type Tumour type
disease_sub_type Tumour sub-type
tier Here, tier represents the level of actionability of the genes involved in the CNV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes, although not included
coordinates CNV coordinates in the format: chr:start-end
cn copy number of the region indicated in coordinates. Note that cn = 2 corresponds to LOH
type 'CNV' is the old annotation, indicating cn is different from what is expected. 'GAIN' cn > 2, 'LOSS' cn < 2, 'LOH' cn = 2
genes annotation
cytobands Chromosomal cytoband information
size Size of SV in basepairs. NA for translocations