Skip to content

Cancer analysis

The cancer_analysis table contains one entry for every sequenced tumour sample with results of the Genomics England interpretation pipeline. When the samples have been sequenced and had variants called by the Illumina pipeline Genomics England passes the samples through its interpretation pipeline, which will apply further QC and annotate on the called variants and perform analyses, such as estimating tumour mutation burden and compute mutational signatures.

Samples are uniquely identified by their tumour_sample_platekey number, and matched to the information of their germline, as well as disease type, quality control measures, tumour mutational burden (somatic_coding_variants_per_mb), signatures and path to bam and vcf files. Note that one participant may have more than one tumour sample, for the same or different tumours.

TGCA study and histology

Details of histology codes and TGCA studies are included. Histology codes are taken directly from the av_tumour table table, while TGCA studies are deduced based on the av_tumour histology codes or, if these are not available, ICD10 codes in hes_apc. To help you understand the quality of this information, the difference in dates between av_tumour or hes_apc and the sampling date is included, along with a match_rank code indicating how well these tables match. Details of this analysis are found in this document

Tumour mutational burden (TMB)

For each tumour sample, TMB is calculated as the total number of non-synonymous small somatic variants divided by the total length of coding sequence (32.61 Mb). Small somatic variants are somatic SNVs and indels smaller than 50 bp. TMB is found on the somatic_coding_variants_per_mb column.

Somatic mutational signatures

Somatic mutational are the consequence of multiple mutational processes that the human body is subjected to throughout life. Each different process generates a unique combination of mutation types that are called mutation signatures. Genomics England computes mutational signatures using the R package nnls. For further information on how the signatures are computed, check Alexandrov et al, 2013.

For more information on QC metrics and how variants that are called for the tumour sample, please refer to the Cancer Analysis Technical Information Document.

Small variant annotation

SNVs and small indels were normalised (left aligned, trimmed, multi-allelic variants decomposed), annotated using Cellbase with the Ensembl (version 90/GRCh38), COSMIC (version v86/GRCh38) and ClinVar (October 2018 release) databases. CellBase takes advantage of the data integrated in its database to implement a rich and high-performance variant annotator (with 99.9991% concordance with Ensembl VEP Consequence Types across 1000 genomes phase 3 variants). Only variants annotated with the following consequence types in canonical transcripts (see List of canonical transcripts v1.10) are reported:

SO term Consequence type
SO:0001893 transcript ablation
SO:0001574 splice_acceptor_variant
SO:0001575 splice_donor_variant
SO:0001587 stop_gained
SO:0001589 frameshift_variant
SO:0001578 stop_lost
SO:0002012 start_lost
SO:0001889 transcript_amplification
SO:0001821 inframe_insertion
SO:0001822 inframe_deletion
SO:0001650 Inframe_variant
SO:0001583 missense_variant
SO:0001630 splice_region_variant

A note on germline flagged variants

We have observed that in some germline VCFs pathogenic variants have been called, and that these are present in many of the cancer participants. We have provided a list of known germline variants that are filtered out in our own clinical tiering pipelines. You can find more information and the whereabouts of this list on this page: faq.


This information can be found in LabKey under a table called cancer_analysis under the quick view tab. cancer_analysis connects to other tables via the participant_id. If divergences are found between cancer_analysis and other tables, the data on the former is the most reliable one, since the interpretation pipeline assures validation.