Somatic SVs and CNVs for a specific gene¶
The getSVCNVperGene package retrieves tiered structural and copy number variants from genes of interest within somatic samples processed through the Genomics England Interpretation pipeline.
The newly released version 2 of getSVCNVperGene supports 100kGP (or main programme) data releases and now includes compatibility with NHS-GMS.
For 100kGP data release version 15 and below, the package iterates through interpreted JSON files to extract variant information directly. For newer Main Programme releases (version 16 and higher) and for NHS-GMS, the package queries a pre-compiled database without the necessity of parsing multiple JSON files.
If your research requires you to query other SVs and CNVs, we would recommend that you review our Structural Variant workflow that queries VCFs directly for any sample in the Genomics England database.
Installing the package¶
We offer the package as a source tar file that can be installed directly:
/gel_data_resources/workflows/get_svcnv_per_gene/2.0.0/getSVCNVperGene_2.0.0.tar.gz
Alternatively, we provide a Singuarity image with the package and its dependencies pre-installed:
/gel_data_resources/workflows/get_svcnv_per_gene/2.0.0/getSVCNVperGene.sif
Installing from source¶
The package relies on several third-party R packages that must be installed and loaded in the session before use.
If you are using a version of Rlabkey after 3.0.0, please run the command below:
You can install the compiled package in the Research Environment using the following steps:
# Set default library location
.libPaths(c(.libPaths(), "/tools/aws-workspace-apps/ce/R/4.0.2"))
# Install package from source
install.packages("/gel_data_resources/workflows/get_svcnv_per_gene/2.0.0/getSVCNVperGene_2.0.0.tar.gz")
# Load package
library(getSVCNVperGene)
To uninstall the package, take the following steps:
Using the Singularity container¶
You can run the containerised version of the package like:
module load singularity/4.1.1
singularity exec /gel_data_resources/workflows/get_svcnv_per_gene/2.0.0/getSVCNVperGene.sif Rscript <SCRIPT>
For more information on running Singularity containers in the Research Environment, please refer to this page.
Using the package¶
Using with 100kGP releases 15 and earlier
Due to a change in the tiering pipeline that occurred between data release 15 and 16 of 100kGP, the format of the output JSON files has changed. JSON files generated for release 15 and earlier differ in structure from those produced for release 16 and later. The new JSON format also applies to NHS-GMS samples. Therefore, when querying for variants from releases 15 and earlier, the package follows a different internal process and yields a different output format compared to querying for variants from release 16 and later.
The main functions of the package and their input parameters remain identical, regardless of the release being queried. However, the output format will vary. Detailed descriptions of both input and output are provided below.
Limited support for releases < 16
We will not be actively supporting feature requests for functionality relating to releases prior to release 16.
There are two functions:
- getSV queries structural variants.
- getCNV queries copy number variants.
Inputs¶
argument | type | default | description |
---|---|---|---|
gene | required | - | string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET') |
fusionOnly | optional | FALSE | logical argument, select on SVs that result in the fusion of genes. Fusions here are defined as any SV whose breakpoints are located in the coding region of two different genes, one of them being the gene name given to the argument gene above. |
diseaseType | optional | NULL | string argument, a diseases type or a vector of disease types. If given, this argument should correspond to diseases included in the cancer_analysis table under disease_type for 100K samples or clinical_indication_full_name for NHS-GMS. If not given, return all diseases. |
participantID | optional | NULL | string argument, a participant ID or a vector of participant IDs. If not given, query all participants. |
plateKey | optional | NULL | string argument, a platekey ID or a vector of platekeys; participantID and plateKey should NOT be used simultaneously. If not given, query all samples. |
release_version | optional | "/main-programme/main-programme_v19_2024-10-31" | string argument containing the filepath for the Release Programme version. E.g. "/main-programme/main-programme_v19_2024-10-31" for release version /main-programme/main-programme_v19_2024-10-31. |
The following example detects tiered structural variants in KRAS, PTEN and BRAF related to samples concerning breast cancer, which may result in the fusion of genes.
getSV(gene = c("KRAS", "PTEN", "BRAF"), fusionOnly = TRUE, diseaseType = "Breast")
By default the package will query the most recent 100kGP release. To query for variants from a different release, such as an NHS-GMS release, use the release_version
parameter. For example, the following command retrieves all BRCA1 SVs identified in NHS-GMS cancer participants:
getSV(gene = "BRCA1", release_version = "/nhs-gms/nhs-gms-release_v3_2024-03-18")
argument | type | default | description |
---|---|---|---|
gene | required | - | string argument, name of gene or a vector of gene names. E.g. gene=c('EGFR', 'MET') |
gain | optional | NULL | logical argument, if TRUE selects only gains, if FALSE selects only losses, if NULL selects all |
diseaseType | optional | NULL | string argument, a diseases type or a vector of disease types. If given, this argument should correspond to diseases included in the cancer_analysis table under disease_type for 100K samples or clinical_indication_full_name for NHS-GMS. If not given, return all diseases. |
participantID | optional | NULL | string argument, a participant ID or a vector of participant IDs. If not given, query all participants. |
plateKey | optional | NULL | string argument, a platekey ID or a vector of platekeys; participantID and plateKey should NOT be used simultaneously. If not given, query all samples. |
release_version | optional | "/main-programme/main-programme_v19_2024-10-31" | string argument containing the filepath for the Release Programme version. E.g. use "/main-programme/main-programme_v19_2024-10-31" for release version /main-programme/main-programme_v19_2024-10-31. |
The following example detects tiered copy number variants in KRAS, PTEN and BRAF, related to samples concerning breast cancer:
getCNV(gene = c("KRAS", "PTEN", "BRAF"), diseaseType = "Breast")
By default the package will query the most recent 100kGP data release. To query for variants from a different release, such as an NHS-GMS release, use the release_version
parameter. For example, the following command retrieves all BRCA1 CNVs identified in NHS-GMS cancer participants:
getCNV(gene = "BRCA1", release_version = "/nhs-gms/nhs-gms-release_v3_2024-03-18")
Outputs¶
The function outputs a tab-separated value file which are:
column name | description |
---|---|
variant_origin | "somatic" for all variants. |
report_event_id | Unique identifier for each event. |
chromosome | The name of the chromosome where the primary breakpoint, bp1, of the SV is located. For translocations, bp1 refers to the breakpoint on the chromosome with the lower number, while bp2 refers to the breakpoint on the chromosome with the higher number of the chromosomes involved in the variant. For all other SVs, bp2 does not apply. |
start | Start position of the SV at bp1. |
end | End position of the SV at bp1. |
size | Size of SV in basepairs. NA for translocations. |
cytobands | Chromosomal cytoband information. |
SVTYPE | Type of structural variant. BND (translocation), DEL (deletion), DUP (duplication), INV (inversion), INS (insertion), as defined by Manta. |
bp1_txs | The Ensembl transcript ID overlapping bp1. |
bp2_txs | The Ensembl transcript ID overlapping bp2. Only applicable for translocations. |
gene_name_bp1 | The HGNC gene name of the gene(s) overlapping bp1. |
gene_name_bp2 | The HGNC gene name of the gene(s) overlapping bp2. Only applicable for translocations. |
variant_domain | The Domain assigned to the variant by the tiering pipeline. Can be DOMAIN1, DOMAIN2, or DOMAIN3. |
gene | The HGNC gene symbol of gene(s) involved in the variant. |
ensembl_id | The Ensembl gene ID of gene(s) involved in the variant. |
bp1_location | A combination of the Ensembl transcript ID and gene feature overlapping bp1 in the format ENST_feature. For example, ENST00000256078_intron. |
bp2_location | A combination of the Ensembl transcript ID and gene feature overlapping bp2 in the format ENST_feature. Only applicable for translocations. |
secondaryChromosome | The name of the chromosome where the secondary breakpoint, bp2, of the SV is located. Only applicable for translocations. |
secondaryStart | Start position of the SV at bp2. Only applicable for translocations. |
secondaryEnd | End position of the SV at bp2. Only applicable for translocations. |
orientation | Orientation of translocations, can be start_start, start_end, or end_end. Only applicable for translocations. |
mate | Indicates how the two chromosomes are merged at bp1. Only applicable for translocations. |
MATEALT | Indicates how the two chromosomes are merged at bp2. Only applicable for translocations. |
tumour_sample_platekey | Genomics England identifier for sequenced sample, aka platekey. |
disease_type | 100K: cancer type of the tumour sample, NHS-GMS: clinical indication tested for. |
participant_id | Genomics England identifier for participants |
disease_sub_type | Subtype of tumour sample. Not applicable for NHS-GMS samples. |
fusion_inferred | Indicates whether a putative fusion has been identified based on whether both breakpoints involved fall into coding regions. Only applicable for translocations. |
fusion_outcomes | Fusion frame prediction annotation for potential gene fusions. Only available for translocations from NHS-GMS samples that have gone through versions of the interpretation pipeline that include this feature. |
column name | description |
---|---|
query | Gene name queried for |
plate | Genomics England identifier for sequenced sample, aka platekey |
participant_id | Genomics England identifier for participants |
disease_type | Tumour type |
disease_sub_type | Tumour sub-type |
type | SV type, BND (translocation), DEL (deletion), DUP (duplication), INV (inversion), INS (insertion), as defined by Manta |
gene_name_bp1 | If breakpoint 1 (bp1) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp1 falls in a non-coding region, this fields says 'empty' |
gene_name_bp2 | If breakpoint 2 (bp2) falls in a coding region, this fields has the gene name of the corresponding coding region. If bp2 falls in a non-coding region, this fields says 'empty' |
coordinates_bp1 | Coordinates of bp1 in the chr:start:end. Bp1 is defined as the bp that occurs on the chromosome with lowest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp1 is the one with the lowest position |
coordinates_bp2 | Coordinates of bp2 in the chr:start:end. Bp2 is defined as the bp that occurs on the chromosome with highest number between the two chromosomes involved in the SV. If both bp's are on the same chromosome, bp2 is the one with the highest position |
gene_txs_bp1 | Additional annotation for gene on bp1, if any |
gene_txs_bp2 | Additional annotation for gene on bp2, if any |
size | Size of SV in basepairs. NA for translocations |
tier | Here, tier represents the level of actionability of the gene involved in the SV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes |
mate | Information for translocation, that indicates how the two chromosomes are merged. NA for other SV types |
additionalTextualVariantAnnotations | Additional information |
actions | Information about actionability, if any |
cytobands | Chromosomal cytoband information |
The function outputs a tab-separated value file which are:
column name | description |
---|---|
variant_origin | "somatic" for all variants. |
report_event_id | Unique identifier for each event. |
chromosome | Chromosome harbouring the CNV. |
start | Start position of CNV. |
end | End position of CNV. |
size | Size of CNV in basepairs. |
numberOfCopies | Copy number of CNV. |
cytobands | Chromosomal cytoband information. |
SVTYPE | Type of structural variant. Can by CNV or LOH. |
gene_name_bp1 | NA for all CNVs. |
gene_name_bp2 | NA for all CNVs. |
variant_domain | The Domain assigned to the variant by the tiering pipeline. Can be DOMAIN1, DOMAIN2, or DOMAIN3. |
gene | The HGNC gene symbol of gene(s) involved in the variant. |
ensembl_id | The Ensembl gene ID of gene(s) involved in the variant. |
bp1_location | NA for all CNVs. |
bp2_location | NA for all CNVs. |
tumour_sample_platekey | Genomics England identifier for sequenced sample, aka platekey. |
disease_type | 100K: cancer type of the tumour sample, NHS-GMS: clinical indication tested for. |
participant_id | Genomics England identifier for participants. |
disease_sub_type | Subtype of tumour sample. Not applicable for NHS-GMS samples. |
Omitted NHS-GMS samples
CNVs from two NHS-GMS participants will not be included in the results from getCNV. These results were omitted from the underlying database due to inconsistent formatting of their tiered CNV JSON files. If you wish to retrieve tiered CNVs from these participants please refer to their JSONs directly. Further information on these omitted participants can be found at /gel_data_resources/workflows/get_svcnv_per_gene/omitted_samples/omitted_samples_v1.tsv
.
column name | description |
---|---|
query | Gene name queried for |
plate | Genomics England identifier for sequenced sample, aka platekey |
participant_id | Genomics England identifier for participants |
disease_type | Tumour type |
disease_sub_type | Tumour sub-type |
tier | Here, tier represents the level of actionability of the genes involved in the CNV. Tier1 means actionable, i.e. genes for which some therapy exists. Tier2 potentially actionable, i.e. other oncogenes. Tier3 are all other genes, although not included |
coordinates | CNV coordinates in the format: chr:start-end |
cn | copy number of the region indicated in coordinates. Note that cn = 2 corresponds to LOH |
type | 'CNV' is the old annotation, indicating cn is different from what is expected. 'GAIN' cn > 2, 'LOSS' cn < 2, 'LOH' cn = 2 |
genes | annotation |
cytobands | Chromosomal cytoband information |
size | Size of SV in basepairs. NA for translocations |
Help and support¶
Please reach out via the Genomics England Service Desk for any issues related to running this script, including "somatic-tiered-SVCNV-package" in the title/description of your inquiry.