100kGP Cancer-specific clinical data¶

Some tables in LabKey contain data specific to cancer participants. All tables and their fields are described in our data dictionary.

Primary and secondary data tables

Primary clinical data were collected when participants were enrolled in the programme.

Secondary clinical data were obtained from third parties such as NHSE.

Cancer data are presented at the participant level or sample level. All tumour samples have a matched germline sample. One participant might have more than one tumour sample, which, in such a case, could be related to temporal samples, two different tumours or, rarely, biological replicates. The latter is often part of the TracerX which is not available to commercial users.

Central tables¶

LabKey table	Description	Primary or secondary	CloudOS tsv filename
`cancer_analysis`	provides a list of cancer samples that have been sequenced and had variants called by the Illumina pipeline. Genomics England passes the samples through its Interpretation Pipeline, which will apply further QC and annotate on the called variants and perform analyses, such as estimating tumour mutation burden and compute mutational signatures. This information is then made available in the cancer analysis table, where each entry corresponds to one tumour sample that has been sequenced and interpreted. Samples are categorised by their registration disease and disease subtype. Data for all cancer participants whose genomes have been through Genomics England bioinformatics interpretation and passed quality checks, including: sex, ethnicity, disease recruited for and diagnosis; tumour ID, build of latest genome, QC status of latest genome and path to latest genomes; as well file paths to the genomes. This table includes information derived from `laboratory_sample` and `cancer_participant_tumour`. Some key data included in the table are elucidated below: Global Tumour Mutation Burden: This is the number of somatic non-synonymous small variants per megabase of coding sequences (32.61 Mb). This metric was calculated using `somatic_small_variants_annotation_vcf` as input (see below for description) and all non-PASS variants were removed from the calculation. Tumour purity: This is the tumour purity (cancer cell fraction) as calculated by Ccube Mutational Signatures: The table includes the relative proportions of the different mutational signatures demonstrated by the tumour. Analysis of large sequencing datasets (10,952 exomes and 1,048 whole-genomes from 40 distinct tumour types) has allowed patterns of relative contextual frequencies of different SNVs to be grouped into specific mutational signatures. Using mathematical methods (decomposition by non-negative least squares) the contribution of each of these signatures to the overall mutation burden observed in a tumour can be derived. Further details of the 30 different mutational signatures used for this analysis, their prevalence in different tumour types and proposed aetiology can be found at the Sanger Institute Website. Cancer PCA QC Statistics: The cancer analysis pipeline employs a sequencing quality control check which selects several important statistics associated with the sequencing returned by the sequencing provider, and uses them to check whether or not the sample in question is an outlier with respect to previous samples that have been run through the pipeline. It is, in effect, a safety net that can spot issues that have occurred at the tissue collection stage (i.e. at the GMC (Genomic Medicine Centre)) or at the library preparation step (i.e. at the sequencing provider), both of which may impact upon the final genomic analysis returned to the clinician. Somatic small variants annotation vcf filepaths: The `somatic_small_variants_annotation_vcf` column contains file paths pointing to VCFs containing Genomics England flags for potential false positive variants as well as additional annotations (see VCF header for details). SIFT and PolyPhen scores as well as new PONnoise50SNV flag were added. The flags used for annotation are: i. `CommonGermlineVariant`: variants with a population germline allele frequency above 1% in an early subset of the Genomics England dataset. ii. `CommonGnomADVariant`: variants with a population germline allele frequency above 1% in gnomAD dataset iii. `RecurrentSomaticVariant`: recurrent somatic variants with frequency above 5% in an early subset of the Genomics England dataset iv. `SimpleRepeat`: variants overlapping simple repeats as defined by Tandem Repeats Finder v. `BCNoiseIndel`: small indels in regions with high levels of sequencing noise where at least 10% of the basecalls in a window extending 50 bases to either side of the indel’s call have been filtered out by Strelka due to the poor quality vi. `PONnoise50SNV`: SNVs resulting from systematic mapping and calling artefacts The following methodology was used for the PONnoise50SNV flag: the ratio of tumour allele depths at each somatic SNV site was tested to see if it is significantly different to the ratio of allele depths at this site in a panel of normals (PoN) using Fisher’s exact test. The PoN was composed of a cohort of 7000 non-tumour genomes from the Genomics England dataset, and at each genomic site only individuals not carrying the relevant alternate allele were included in the count of allele depths. The mpileup function in bcftools v1.9 was used to count allele depths in the PoN, and to replicate Strelka filters duplicate reads were removed and quality thresholds set at mapping quality ≥ 5 and base quality ≥ 5. All somatic SNVs with a Fisher’s exact test phred score < 50 were filtered, this threshold minimised the loss of true positive variants while still gaining significant improvement in specificity of SNV calling as calculated from a TRACERx truth set. A presentation entitled PONnoise50SNV: SNVs resulting from systematic mapping and calling artefacts, which further outlines the methodology, can be found in the Publications and other useful links table located on our Further reading and documentation page. Alignment BAM files generated by Isaac Genome Alignment Software: We have a paper written by Research Network members discussing the issue of reference bias in the computation of variant allele frequencies (VAFs) by the Illumina Isaac pipeline (caused by preferential soft clipping of reads supporting alternate alleles).		`gel_cancer_analysis_100k.tsv`

Cancer participants¶

LabKey table	Description	CloudOS tsv filename
`cancer_participant_disease`	data about participants' cancer disease type and subtype.	`gel_cancer_disease_100k.tsv`
`cancer_participant_tumour`	data that characterises the tumour, e.g. staging and grading; morphology and location; recurrence at time of enrolment; and the basis of diagnosis.	`gel_cancer_tumour_100k.tsv`
`cancer_care_plan`	information from participants' NHS cancer care plan on their treatment and care intent, in particular outcomes of MDT meetings and coded connected data (e.g. diagnoses from scans).	`gel_cancer_care_plan_100k.tsv`
`cancer_invest_imaging`	coded data on imaging investigations characterising the scan, its modality, anatomical site and outcome; as well as the outcome of the imaging report in free text form.	`gel_cancer_imaging_100k.tsv`
`cancer_participant_tumour_metastatic_site`	the site of any metastatic disease in the body at diagnosis (if applicable).	`gel_cancer_tumour_metastases_100k.tsv`
`cancer_risk_factor_cancer_specific`	data on specific risk factors related to particular cancer types. This table was compiled with input from Research Network members.	`gel_cancer_risk_factor_100k.tsv`
`cancer_risk_factor_general`	data on general cancer risk factors, namely smoking status, height, weight and alcohol consumption. This table was compiled with input from Research Network members.	`gel_cancer_risk_factor_general_100k.tsv`
`cancer_surgery`	details of what surgical procedures were had, as well as the specific location of the intervention.	`gel_cancer_surgery_100k.tsv`

Tumour samples¶

LabKey table	Description	CloudOS tsv filename
`cancer_invest_circulating_tumour_marker`	biomarker measurements specific to particular cancer types (ovarian or prostate).	`gel_circulating_tumour_marker_100k.tsv`
`cancer_invest_sample_pathology`	full pathology reports and other related data on and from their tumour samples around diagnosis and characterisation of the cancer. Much of this information is also found in the `clinic_sample` and `cancer_participant_tumour` tables.	`gel_cancer_pathology_100k.tsv`
`cancer_specific_pathology`	pathology data specific to participants' cancer type. This may provide additional data to the `cancer_invest_sample_pathology` and `cancer_participant_tumour` tables.	`gel_cancer_specific_pathology_100k.tsv`
`cancer_systemic_anti_cancer_therapy`	details the regimen and intent of the participants' chemotherapy.

Consolidated data¶

LabKey table	Description	Primary or secondary	CloudOS tsv filename
`cancer_staging_consolidated`	combines staging information from our primary clinical data (`cancer_participant_tumour`) and secondary clinical data from PHE/NCRAS (`sact` and `av_tumour`) to give a stage for each sample we have sequenced and fully interpreted on our database (`cancer_analysis`). The staging information may be in form of TNM combined, each component or other standards such as ajcc, or dukes, for example. The genomic data are rematched to the clinical data using a disease type (genomic data) and icd code (clinical data) correspondence dictionary created and validated internally. Also, the clinical stage information must not be further away than one year from the date the sample has been collected. The column names have been preserved as found in the original datasets they were extracted from, except for `tumour_pseudo_id` found both in `sact` and `av_tumour`, where a prefix with the dataset names was added to. Also, for each staging dataset used, when more than one entry for the same patient was available the closest one to the clinical data collection has been kept.		`phe_gel_cancer_tumour_linkage_100k.tsv`

TNM

The TNM Classification of Malignant Tumours | TNM) is a cancer staging notation system that describes the stage of a cancer that originates from a solid tumour with alphanumeric codes.

T describes the size of the original (primary) tumour and whether it has invaded nearby tissues.
N describes nearby (regional) lymph nodes that are involved.
M describes distant metastasis.

The code for a particular cancer is made up of these three parts along with other parameters and modifiers.

Bioinformatics analysis¶

LabKey table	Description	Primary or secondary	CloudOS tsv filename
`cancer_100K_genomes_realigned_on_pipeline_2`	Cancer genomes re-processed through Pipeline 2.0 (which uses Dragen v3.2.22 for alignment and germline variant calling + Strelka 2.9.9 for somatic small variants + Canvas 1.39 for somatic CNV + Manta 1.5 for somatic SVs). Also contains `somatic_small_variants_annotation_vcf` files and tumour in normal contamination (TINC) results for a subset of ~800 haematological samples.		`gel_dragen_realigned_100k_genomes_100k.tsv`

NHSE-NCRAS cancer clinical data¶

Data from the third party NHSE, including data from the National Cancer Registration and Analysis Service | NCRAS), describing cancer patients' medical history. The NCRAS is responsible for cancer registration in England to support cancer epidemiology, public health, service monitoring and research.

Cancer Registration (AV) is the systematic collection of data about cancer and tumour diseases. In England, this data collection is managed by NCRAS. Every year, NCRAS collects information on over 300,000 cases of cancer, including patient details (including their name, address, age, sex, and date of birth), as well as detailed data about the type of cancer, how advanced it is and the treatment the patient receives. At Genomics England the data are stripped out of identifiable information and associated to a the patient's participant_id so that these data can be linked to other clinical and also the genomic data.

This dataset brings together data from more than 500 local and regional datasets to build a picture of an individual's treatment from diagnosis.

tumour_ids in AV tables are assigned to participants by NCRAS and do not link to the tumour_ids assigned by GEL for sequencing and clinical data. Whilst this may refer to the same cancer, you should be cautious when linking these together.

Bug in rtds table

There is a bug in the NCRAS radiotherapy table, rtds for 100kGP releases 17 and 18. Approximately 8% of all records in this table are missing dates. This is due to a bug translating Sep in three-letter months to numbered dates.
This will be fixed for release 19, due later in 2024.

LabKey table	Description	CloudOS tsv filename
`av_patient`	demographics from the Cancer Registration and information about death, when applicable by the last day of data collection for the AV tables.	`ncras_cancer_patient_100k.tsv`
`av_tumour`	medical information about the tumour, including hormonal status (PR, ER and HER2), date of diagnosis, site, morphological and behaviour ICD10 codes as well as histology and grade. Table's `anon_tumour_id` is used to link treatment tables also available in NCRAS. One row per tumour (`av*` table specific `anon_tumour_id`), per participant at the point of registration of that cancer/tumour with NCRAS.	`ncras_cancer_tumour_100k.tsv`
`av_treatment`	treatment received for each participant. One participant receives more than one treatment, which includes surgery, chemo, immuno and radiotherapy.	`ncras_cancer_treatment_100k.tsv`
`av_rtd`	routes to diagnosis; these routes have been determined using a model that combines AV data with HES data, Cancer Waiting Times (CWT) data and data from the cancer screening programmes. Using these datasets cancers registered in England which were diagnosed in 2006 to 2016 are categorised into one of eight Routes to Diagnosis.	`ncras_cancer_route_to_diagnosis_100k.tsv`
`av_imd`	income deprivation domain; measures the proportion of the population experiencing deprivation relating to low income. The definition of low income used includes both those people that are out-of-work and those that are in work but who have low earnings.	`ncras_cancer_index_of_multiple_deprivation_100k.tsv`
`cwt`	the National Cancer Waiting Times Monitoring Data Set supports the continued management and monitoring of waiting times.	`ncras_cancer_waiting_times_100k.tsv`
`ncras_did`	diagnostic imaging dataset; a central collection of detailed information about diagnostic imaging tests carried out on NHS patients, extracted from local radiology information systems. The DID captures information about referral source, details of the test (type of test and body site), demographic information such as GP registered practice, patient postcode, ethnicity, gender and date of birth, plus data items about different events (date of imaging request, date of imaging, date of reporting, which allows calculation of time intervals.	`ncras_diagnostic_imaging_metadata_100k.tsv`
`rtds`	radiotherapy dataset; is an existing standard (SCCI0111) that has required all NHS Acute Trust providers of radiotherapy services in England to collect and submit standardised data monthly against a nationally defined data set since 2009. The purpose of the standard is to collect consistent and comparable data across all NHS Acute Trust providers of radiotherapy services in England in order to provide intelligence for service planning, commissioning, clinical practice and research and the operational provision of radiotherapy services across England. Data are available from 01/04/2009.	`ncras_radiotherapy_100k.tsv`
`sact`	systemic anti-cancer therapy; contains clinical management on patients receiving cancer chemotherapy, and newer agents that have anti-cancer effects, in or funded by the NHS in England. It covers chemotherapy treatment for all solid tumour and haematological malignancies and those in clinical trials. It relates to all cancer patients, both adult and paediatric, in acute inpatient, day case, outpatient settings and delivery in the community. Data available for regimens between 11/09/16-15/12/17 with cycles within ending 15/02/18.	`ncras_systemic_anti_cancer_therapy_curated_100k.tsv`

LUCADA¶

The National Lung Cancer Audit (LUCADA) looks at the care delivered during referral, diagnosis, treatment and outcomes for people diagnosed with lung cancer and mesothelioma. The data items in the LUCADA dataset have been compiled to meet the requirements of audit, and are not to be confused with the data items identified as Lung Cancer in the National Cancer dataset. The audit focuses on measuring the care given to lung cancer patients from diagnosis to the primary treatment package, assessing against standards and bringing about necessary improvements. The project supports the Calman Hine recommendations, the National Cancer Plan and other national guidance (e.g. NICE guidance) as it emerges.

The audit follows patients diagnosed between: 01/01/2005 - 31/12/2013 the vital status of each patient can be followed up with linkage to Cancer Registration data).

LabKey table	Description	Primary or secondary	CloudOS tsv filename
`lucada_2013`	contains, for 56 participants, data on the national lung cancer audit 2013.		`ncras_lung_cancer_dataset_2013_100k.tsv`
`lucada_2014`	contains, for 18 participants, data on the national lung cancer audit 2014.		`ncras_lung_cancer_dataset_2014_100k.tsv`

Cancer-specific GEL curated datasets - pilot¶

Genomics England are striving to improve the clinical data provided for its researchers. We understand the value of accurate and granular clinical data, especially in the context of cancer.

In order to deliver this, we are planning a series of pilot datasets, aiming to incorporate additional clinical data provided by Public Health England cancer registry (NCRAS). Genomics England will aim to deliver cancer specific datasets, with the initial focus being on providing a broad pathological understanding. This will aim to incorporate data points such as molecular mutations and resection margins in pathology reports. The focus will then incorporate radiological imaging reports and finally focus on live/ up-to-date clinical data. In addition, we are also including the date each participant was last seen alive (data provided up to October 2020) and dates and causes of death to aid with outcomes.

It must be stressed that this work is a development process, and we are working in unison with NCRAS to progress this. Whilst we do not possess the extensive experience and resource of Public Health England, we are developing a natural language based algorithm for focused data extraction. NCRAS have a dedicated team to curating clinical data and the gold standard remains the NCRAS curated tables. However, for this dataset to improve and move forward, Genomics England are keen for feedback and for you to highlight areas for improvement.

You will note subtle differences to the structure of the table compared to the curated NCRAS tables and thus additional data dictionaries have been provided. Genomics England hopes to continue developing this uncurated live dataset with feedback and look forward to hearing your thoughts. Please reach out to us with related thoughts and suggestions via the Genomics England Service Desk, including "cancer_specific_datasets_pilot" in the title of your enquiry.

LabKey table	Description	Primary or secondary	CloudOS tsv filename
`sact_uncurated`	table is the raw feed from NCRAS which feeds into their curation process producing the sact table (both under NCRAS section). This table extracts chemotherapy (SACT) information for cancer participants in the 100,000 genomes project from unlinked and unprocessed NCRAS chemotherapy data from 2008 until March 2021. It is likely to contain some errors, however it contains clinical therapy data that is not yet available in the curated NCRAS registries, such as SNOMED CT diagnosis codes alongside ICD10. A major point to raise is that this SACT curation does not provide tumour IDs, thus you must match this dataset to other NCRAS registries by adjusting for date. Please refer to background and use caveats in the quality notes section of this release note.		`phe_systemic_anti_cancer_therapy_un_curated_100k.tsv`
`pathology_reports`	Full text pathology reports pertaining to participants from 100k Genomes Project across all cancer types. Multiple reports per participants are provided where available prior, around and post WGS sample.