Cancer tiering¶

The cancer_tier_and_domain_variants table in LabKey lists variants of potential clinical significance in cancer. These could include oncogenic variants that destabilise protein function of a tumour suppressor, protective variants that disrupt an oncogene or variants of unknown significance. This allows you to query cancer-relevant genotypes.

The table is based on the Gene centric SNV report for cancer participants. It is an aggregation of non-synonymous, splice site and RNA gene small variants found per sample for the the participants in the cancer programme.

The csv file generated by our internal cancer analysis team was used as a source and the paths of each participant's csv file are provided in the cancer_analysis Labkey table.

This table only contains domain and tiered variants, and is therefore not a complete picture of all variants in a sample.

The majority of mutations included are somatic, and have been designated Domain 1, 2 or 3. Some germline variants occur in the table limited to those indicative of cancer predisposition as listed by Genomics England PanelApp.

Variant annotation¶

Each variant carries information from Cellbase (cellbase_consequence) and Clinvar (clinical_significance) after annotation with ClinVar (version 2022-08-24). An assessment on variant loss of function has been included in the relevance column. Here a variant is marked as: loss of function (LoF) when the Cellbase consequence type matches those in Table 1, (likely) Pathogenic when the variant is present as such in ClinVar, and path_LoF when both are true. Remaining variants are marked as other.

SO term	Consequence type
`SO:0001893`	transcript ablation
`SO:0001574`	splice_acceptor_variant
`SO:0001575`	splice_donor_variant
`SO:0001587`	stop_gained
`SO:0001589`	frameshift_variant
`SO:0001578`	stop_lost
`SO:0002012`	start_lost
`SO:0001821`	inframe_insertion
`SO:0001822`	inframe_deletion

Data overview¶

snvdb is 5,001,535 rows long with 21 columns where each row represents a small variant within a tumour_sample_platekey. Value counts (>5) per: type, origin, domain and clinical_significance. These are on the total variants in the data, not normalised to the sample count. Please note the differing y-axis (log-scale or 1E6 modifier).

Data dictionary¶

Field	Enumerations/Date Type	Description
`participant_id`	participantId, xs:string	Participant Identifier (supplied by Genomics England).
`tumour_sample_platekey`	varchar	Concatination of Plate ID and Well ID - unique identifier for a proccessed well for tumour sample.
`csv_version`	string	Version of the source csv, over the 100K genomes project tiering and cancer domains have been updated. creating minor differences between reports generated by the cancer analysis team.
`disease_type`	string	The cancer type of the tumour sample submitted to Genomics England. Note: Some of the genomic analysis performed by the pipeline makes it possible to identify what cancer (disease type) the sample is from, and therefore correct potential errors in the disease type that was registered by the GMC. As a result, the disease type in this table can be different from the disease type found in cancer_participant_disease.
`disease_sub_type`	string	The subtype of the cancer in question, recorded against a limited set of supplied enumerations.
`type`	string	sample type: `PRIMARY`, `METASTASES` or `RECURRENCE_OF_PRIMARY_TUMOUR`, as reported in .csv files.
`study_abbreviation`	string	TCGA study abbreviations, based on av_tumour histology and ICD10 codes for the given participant. For any participants where there is no data in av_tumour, the TCGA code was deduced from ICD10 codes in hes_apc.
`match_rank`	Enumerations: `1` = Information in `cancer_analysis`, `av_tumour` and `hes_apc` all agree with one another. `2` = Information in `cancer_analysis` and `av_tumour` agree `3` = Information in `av_tumour` and `hes_apc` agree `4` = Information in `cancer_analysis` and `hes_apc` agree 5 = No linkage - either there is no data in `av_tumour` or `hes_apc`, or there is no agreement between all three	A categorical value describing the relationship between data in cancer_anaylsis, `av_tumour` and `hes_apc`
`origin`	string	the origin of a mutation differentiates between "somatic" and "germline" mutations.
`gene`	string	HUGO Gene Nomenclature
`transcript`	string	ENSEMBL transcript ID (ENST#)
`cellbase_consequence`	string	Cellbase consequence type
`change`	string	HGSV coding DNA reference sequence
`protein_change`	string	protein coding change
`chr`	string	chromosome, named as: 1-22, X, Y
`pos`	numeric	Position on chromosome (1-based)
`ref`	varchar	Reference Allele sequence, the same provided in vcf
`alt`	varchar	Alternate Allele sequence, the same provided in vcf
`domain`	Enumerations: `1` = Domain 1 `2` = Domain 2 `3` = Domain 3	Domain 1: variants in a virtual panel of potentially actionable genes. Domain 2: variants in a virtual panel of cancer-related genes as curated by the Sanger's Cancer Gene Census). Domain 3: small variants in genes not included in domains 1 and 2.
`clinical_significance`	string	ClinVar clinical significance (version 2022-02-05) matched on gene and HGSV change.
`relevance`	string	LoF: the cellbase_consequence type confers protein loss of function. (likely)pathogenic: the entry has been annotated pathogenic or likely pathogenic by ClinVar. path_LoF: both the cellbase_consequence type confers loss of function and the variant is annotated as (likely)pathogenic by ClinVar. Other: Clinvar does not mark this variant as (likely) pathogenic, nor does it carry a mutation conferring protein loss of function.

Help and support¶

Please reach out via the Genomics England Service Desk for any queries regarding the cancer tier and domain variants. We would welcome your feedback so that we can improve on our data offering.