Skip to content

Somatic SVs and CNVs for a specific gene

The getSVCNVperGene package retrieves tiered structural and copy number variants from genes of interest within somatic samples processed through the Genomics England Interpretation pipeline.

The newly released version 2 of getSVCNVperGene supports 100kGP (or main programme) data releases and now includes compatibility with NHS-GMS.

For 100kGP data release version 15 and below, the package iterates through interpreted JSON files to extract variant information directly. For newer Main Programme releases (version 16 and higher) and for NHS-GMS, the package queries a pre-compiled database without the necessity of parsing multiple JSON files.

If your research requires you to query other SVs and CNVs, we would recommend that you review our Structural Variant workflow that queries VCFs directly for any sample in the Genomics England database.

Installing the package

We offer the package as a source tar file that can be installed directly:

/gel_data_resources/workflows/get_svcnv_per_gene/2.0.0/getSVCNVperGene_2.0.0.tar.gz

Alternatively, we provide a Singuarity image with the package and its dependencies pre-installed:

/gel_data_resources/workflows/get_svcnv_per_gene/2.0.0/getSVCNVperGene.sif

Installing from source

The package relies on several third-party R packages that must be installed and loaded in the session before use.

library(tidyverse)
library(Rlabkey)
library(RSQLite)

If you are using a version of Rlabkey after 3.0.0, please run the command below:

labkey.setWafEncoding(FALSE)

You can install the compiled package in the Research Environment using the following steps:

# Set default library location
.libPaths(c(.libPaths(), "/tools/aws-workspace-apps/ce/R/4.0.2"))

# Install package from source
install.packages("/gel_data_resources/workflows/get_svcnv_per_gene/2.0.0/getSVCNVperGene_2.0.0.tar.gz")

# Load package
library(getSVCNVperGene)

To uninstall the package, take the following steps:

detach("package:getSVCNVperGene", unload = TRUE)
remove.packages("getSVCNVperGene")

Using the Singularity container

You can run the containerised version of the package like:

module load singularity/4.1.1

singularity exec /gel_data_resources/workflows/get_svcnv_per_gene/2.0.0/getSVCNVperGene.sif Rscript <SCRIPT>

For more information on running Singularity containers in the Research Environment, please refer to this page.

Using the package

Using with 100kGP releases 15 and earlier

Due to a change in the tiering pipeline that occurred between data release 15 and 16 of 100kGP, the format of the output JSON files has changed. JSON files generated for release 15 and earlier differ in structure from those produced for release 16 and later. The new JSON format also applies to NHS-GMS samples. Therefore, when querying for variants from releases 15 and earlier, the package follows a different internal process and yields a different output format compared to querying for variants from release 16 and later.

The main functions of the package and their input parameters remain identical, regardless of the release being queried. However, the output format will vary. Detailed descriptions of both input and output are provided below.

Limited support for releases < 16

We will not be actively supporting feature requests for functionality relating to releases prior to release 16.

There are two functions:

  1. getSV queries structural variants.
  2. getCNV queries copy number variants.

Inputs

argument type default description
gene required - string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET')
fusionOnly optional FALSE logical argument, select on SVs that result in the fusion of genes. Fusions here are defined as any SV whose breakpoints are located in the coding region of two different genes, one of them being the gene name given to the argument gene above.
diseaseType optional NULL string argument, a diseases type or a vector of disease types. If given, this argument should correspond to diseases included in the cancer_analysis table under disease_type for 100K samples or clinical_indication_full_name for NHS-GMS. If not given, return all diseases.
participantID optional NULL string argument, a participant ID or a vector of participant IDs. If not given, query all participants.
plateKey optional NULL string argument, a platekey ID or a vector of platekeys; participantID and plateKey should NOT be used simultaneously. If not given, query all samples.
release_version optional "/main-programme/main-programme_v18_2023-12-21" string argument containing the filepath for the Release Programme version. E.g. "/main-programme/main-programme_v18_2023-12-21" for release version /main-programme/main-programme_v18_2023-12-21.

The following example detects tiered structural variants in KRAS, PTEN and BRAF related to samples concerning breast cancer, which may result in the fusion of genes.

getSV(gene = c("KRAS", "PTEN", "BRAF"), fusionOnly = TRUE, diseaseType = "Breast")

By default the package will query the most recent 100kGP release. To query for variants from a different release, such as an NHS-GMS release, use the release_version parameter. For example, the following command retrieves all BRCA1 SVs identified in NHS-GMS cancer participants:

getSV(gene = "BRCA1", release_version = "/nhs-gms/nhs-gms-release_v3_2024-03-18")

argument type default description
gene required - string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET')
gain optional NULL logical argument, if TRUE selects only gains, if FALSE selects only losses, if NULL selects all
diseaseType optional NULL string argument, a diseases type or a vector of disease types. If given, this argument should correspond to diseases included in the cancer_analysis table under disease_type for 100K samples or clinical_indication_full_name for NHS-GMS. If not given, return all diseases.
participantID optional NULL string argument, a participant ID or a vector of participant IDs. If not given, query all participants.
plateKey optional NULL string argument, a platekey ID or a vector of platekeys; participantID and plateKey should NOT be used simultaneously. If not given, query all samples.
release_version optional "/main-programme/main-programme_v18_2023-12-21" string argument containing the filepath for the Release Programme version. E.g. use "/main-programme/main-programme_v18_2023-12-21" for release version /main-programme/main-programme_v18_2023-12-21.

The following example detects tiered copy number variants in KRAS, PTEN and BRAF, related to samples concerning breast cancer:

getCNV(gene = c("KRAS", "PTEN", "BRAF"), diseaseType = "Breast")

By default the package will query the most recent 100kGP data release. To query for variants from a different release, such as an NHS-GMS release, use the release_version parameter. For example, the following command retrieves all BRCA1 CNVs identified in NHS-GMS cancer participants:

getCNV(gene = "BRCA1", release_version = "/nhs-gms/nhs-gms-release_v3_2024-03-18")

Outputs

The function outputs a tab-separated value file which are:

column name description
variant_origin "somatic" for all variants.
report_event_id Unique identifier for each event.
chromosome The name of the chromosome where the primary breakpoint, bp1, of the SV is located. For translocations, bp1 refers to the breakpoint on the chromosome with the lower number, while bp2 refers to the breakpoint on the chromosome with the higher number of the chromosomes involved in the variant. For all other SVs, bp2 does not apply.
start Start position of the SV at bp1.
end End position of the SV at bp1.
size Size of SV in basepairs. NA for translocations.
cytobands Chromosomal cytoband information.
SVTYPE Type of structural variant. BND (translocation), DEL (deletion), DUP (duplication), INV (inversion), INS (insertion), as defined by Manta.
bp1_txs The Ensembl transcript ID overlapping bp1.
bp2_txs The Ensembl transcript ID overlapping bp2. Only applicable for translocations.
gene_name_bp1 The HGNC gene name of the gene(s) overlapping bp1.
gene_name_bp2 The HGNC gene name of the gene(s) overlapping bp2. Only applicable for translocations.
variant_domain The Domain assigned to the variant by the tiering pipeline. Can be DOMAIN1, DOMAIN2, or DOMAIN3.
gene The HGNC gene symbol of gene(s) involved in the variant.
ensembl_id The Ensembl gene ID of gene(s) involved in the variant.
bp1_location A combination of the Ensembl transcript ID and gene feature overlapping bp1 in the format ENST_feature. For example, ENST00000256078_intron.
bp2_location A combination of the Ensembl transcript ID and gene feature overlapping bp2 in the format ENST_feature. Only applicable for translocations.
secondaryChromosome The name of the chromosome where the secondary breakpoint, bp2, of the SV is located. Only applicable for translocations.
secondaryStart Start position of the SV at bp2. Only applicable for translocations.
secondaryEnd End position of the SV at bp2. Only applicable for translocations.
orientation Orientation of translocations, can be start_start, start_end, or end_end. Only applicable for translocations.
mate Indicates how the two chromosomes are merged at bp1. Only applicable for translocations.
MATEALT Indicates how the two chromosomes are merged at bp2. Only applicable for translocations.
tumour_sample_platekey Genomics England identifier for sequenced sample, aka platekey.
disease_type 100K: cancer type of the tumour sample, NHS-GMS: clinical indication tested for.
participant_id Genomics England identifier for participants
disease_sub_type Subtype of tumour sample. Not applicable for NHS-GMS samples.
fusion_inferred Indicates whether a putative fusion has been identified based on whether both breakpoints involved fall into coding regions. Only applicable for translocations.
fusion_outcomes Fusion frame prediction annotation for potential gene fusions. Only available for translocations from NHS-GMS samples that have gone through versions of the interpretation pipeline that include this feature.
column name description
query Gene name queried for
plate Genomics England identifier for sequenced sample, aka platekey
participant_id Genomics England identifier for participants
disease_type Tumour type
disease_sub_type Tumour sub-type
type SV type, BND (translocation), DEL (deletion), DUP (duplication), INV (inversion), INS (insertion), as defined by Manta
gene_name_bp1 If breakpoint 1 (bp1) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp1 falls in a non-coding region, this fields says 'empty'
gene_name_bp2 If breakpoint 2 (bp2) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp2 falls in a non-coding region, this fields says 'empty'
coordinates_bp1 Coordinates of bp1 in the chr:start:end. Bp1 is defined as the bp that occurs on the chromosome with lowest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp1 is the one with the lowest position
coordinates_bp2 Coordinates of bp2 in the chr:start:end. Bp2 is defined as the bp that occurs on the chromosome with highest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp2 is the one with the highest position
gene_txs_bp1 Additional annotation for gene on bp1, if any
gene_txs_bp2 Additional annotation for gene on bp2, if any
size Size of SV in basepairs. NA for translocations
tier Here, tier represents the level of actionability of the gene involved in the SV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes
mate Information for translocation, that indicates how the two chromosomes are merged. NA for other SV types
additionalTextualVariantAnnotations Additional information
actions Information about actionability, if any
cytobands Chromosomal cytoband information

The function outputs a tab-separated value file which are:

column name description
variant_origin "somatic" for all variants.
report_event_id Unique identifier for each event.
chromosome Chromosome harbouring the CNV.
start Start position of CNV.
end End position of CNV.
size Size of CNV in basepairs.
numberOfCopies Copy number of CNV.
cytobands Chromosomal cytoband information.
SVTYPE Type of structural variant. Can by CNV or LOH.
gene_name_bp1 NA for all CNVs.
gene_name_bp2 NA for all CNVs.
variant_domain The Domain assigned to the variant by the tiering pipeline. Can be DOMAIN1, DOMAIN2, or DOMAIN3.
gene The HGNC gene symbol of gene(s) involved in the variant.
ensembl_id The Ensembl gene ID of gene(s) involved in the variant.
bp1_location NA for all CNVs.
bp2_location NA for all CNVs.
tumour_sample_platekey Genomics England identifier for sequenced sample, aka platekey.
disease_type 100K: cancer type of the tumour sample, NHS-GMS: clinical indication tested for.
participant_id Genomics England identifier for participants.
disease_sub_type Subtype of tumour sample. Not applicable for NHS-GMS samples.

Omitted NHS-GMS samples

CNVs from two NHS-GMS participants will not be included in the results from getCNV. These results were omitted from the underlying database due to inconsistent formatting of their tiered CNV JSON files. If you wish to retrieve tiered CNVs from these participants please refer to their JSONs directly. Further information on these omitted participants can be found at /gel_data_resources/workflows/get_svcnv_per_gene/omitted_samples/omitted_samples_v1.tsv.

column name description
query Gene name queried for
plate Genomics England identifier for sequenced sample, aka platekey
participant_id Genomics England identifier for participants
disease_type Tumour type
disease_sub_type Tumour sub-type
tier Here, tier represents the level of actionability of the genes involved in the CNV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes, although not included
coordinates CNV coordinates in the format: chr:start-end
cn copy number of the region indicated in coordinates. Note that cn = 2 corresponds to LOH
type 'CNV' is the old annotation, indicating cn is different from what is expected. 'GAIN' cn > 2, 'LOSS' cn < 2, 'LOH' cn = 2
genes annotation
cytobands Chromosomal cytoband information
size Size of SV in basepairs. NA for translocations

Help and support

Please reach out via the Genomics England Service Desk for any issues related to running this script, including "somatic-tiered-SVCNV-package" in the title/description of your inquiry.