100kGP Cancer-specific clinical data

Some tables in LabKey contain data specific to cancer participants.

CloudOS files

Names of the equivalent file in CloudOS are stated after the table name in brackets.

Primary and secondary data tables

Primary clinical data were collected when participants were enrolled in the programme; tables are tagged with .

Secondary clinical data were obtained from third parties such as NHSE; tables are tagged with .

The central table is cancer_analysis (gel_cancer_analysis_100k.tsv) , which contains all tumour samples that have been sequenced, had variants called and successfully passed through the Genomics England interpretation pipeline. For each tumour sample, you have their matched germline information, as well as information about the tumour, sequencing quality control metrics, tumour mutational burden (somatic_coding_variants_per_mb), signatures and path to bam and vcf files. One participant may have more than one tumour sample.

Cancer clinical data

Cancer data are presented at the participant level or sample level. All tumour samples have a matched germline sample. One participant might have more than one tumour sample, which, in such a case, could be related to temporal samples, two different tumours or, rarely, biological replicates. The latter is often part of the TracerX which is not available to commercial users.

Cancer participants

  • cancer_participant_disease (gel_cancer_disease_100k.tsv) data about participants' cancer disease type and subtype.
  • cancer_participant_tumour (gel_cancer_tumour_100k.tsv) data that characterises the tumour, e.g. staging and grading; morphology and location; recurrence at time of enrolment; and the basis of diagnosis.
  • cancer_care_plan (gel_cancer_care_plan_100k.tsv) information from participants' NHS cancer care plan on their treatment and care intent, in particular outcomes of MDT meetings and coded connected data (e.g. diagnoses from scans).
  • cancer_invest_imaging (gel_cancer_imaging_100k.tsv) coded data on imaging investigations characterising the scan, its modality, anatomical site and outcome; as well as the outcome of the imaging report in free text form.
  • cancer_participant_tumour_metastatic_site (gel_cancer_tumour_metastases_100k.tsv) the site of any metastatic disease in the body at diagnosis.
  • cancer_risk_factor_cancer_specific (gel_cancer_risk_factor_100k.tsv) data on specific risk factors related to particular cancer types. This table was compiled with input from Research Network members.
  • cancer_risk_factor_general (gel_cancer_risk_factor_general_100k.tsv) data on general cancer risk factors, namely smoking status, height, weight and alcohol consumption. This table was compiled with input from Research Network members.
  • cancer_surgery (gel_cancer_surgery_100k.tsv) details of what surgical procedures were had, as well as the specific location of the intervention.

Tumour samples

  • cancer_invest_circulating_tumour_marker (gel_circulating_tumour_marker_100k.tsv) biomarker measurements specific to particular cancer types (ovarian or prostate).
  • cancer_invest_sample_pathology (gel_cancer_pathology_100k.tsv) full pathology reports and other related data on and from their tumour samples around diagnosis and characterisation of the cancer. Much of this information is also found in the clinic_sample and cancer_participant_tumour tables.
  • cancer_specific_pathology (gel_cancer_specific_pathology_100k.tsv) pathology data specific to participants' cancer type. This may provide additional data to the cancer_invest_sample_pathology and cancer_participant_tumour tables.
  • cancer_systemic_anti_cancer_therapy details the regimen and intent of the participants' chemotherapy.

Consolidated data

cancer_staging_consolidated (phe_gel_cancer_tumour_linkage_100k.tsv) combines staging information from our primary clinical data (cancer_participant_tumour) and secondary clinical data from PHE/NCRAS (sact and av_tumour) to give a stage for each sample we have sequenced and fully interpreted on our database (cancer_analysis). The staging information may be in form of TNM combined, each component or other standards such as ajcc, or dukes, for example.

The genomic data are rematched to the clinical data using a disease type (genomic data) and icd code (clinical data) correspondence dictionary created and validated internally. Also, the clinical stage information must not be further away than one year from the date the sample has been collected.

The column names have been preserved as found in the original datasets they were extracted from, except for tumour_pseudo_id found both in sact and av_tumour, where a prefix with the dataset names was added to.


The TNM Classification of Malignant Tumours (TNM) is a cancer staging notation system that describes the stage of a cancer that originates from a solid tumour with alphanumeric codes.

  • T describes the size of the original (primary) tumour and whether it has invaded nearby tissues.
  • N describes nearby (regional) lymph nodes that are involved.
  • M describes distant metastasis.

The code for a particular cancer is made up of these three parts along with other parameters and modifiers.

NHSE-NCRAS cancer clinical data

Data from the third party NHSE, including data from the National Cancer Registration and Analysis Service (NCRAS), describing cancer patients' medical history. The NCRAS is responsible for cancer registration in England to support cancer epidemiology, public health, service monitoring and research.

Cancer Registration (AV) is the systematic collection of data about cancer and tumour diseases. In England, this data collection is managed by NCRAS. Every year, NCRAS collects information on over 300,000 cases of cancer, including patient details (including their name, address, age, sex, and date of birth), as well as detailed data about the type of cancer, how advanced it is and the treatment the patient receives. At Genomics England the data are stripped out of identifiable information and associated to a the patient's participant_id so that these data can be linked to other clinical and also the genomic data.

AV tables gather data for patients diagnosed with cancer from 1 January 1995 - 31 December 2017. This dataset brings together data from more than 500 local and regional datasets to build a picture of an individual's treatment from diagnosis.

tumour_ids in AV tables are assigned to participants by NCRAS and do not link to the tumour_ids assigned by GEL for sequencing and clinical data. Whilst this may refer to the same cancer, you should be cautious when linking these together.

  • av_imd (ncras_cancer_index_of_multiple_deprivation_100k.tsv; income deprivation domain) measures the proportion of the population experiencing deprivation relating to low income. The definition of low income used includes both those people that are out-of-work and those that are in work but who have low earnings.
  • av_patient (ncras_cancer_patient_100k.tsv) demographics from the Cancer Registration and information about death, when applicable by the last day of data collection for the AV tables.
  • av_rtd (ncras_cancer_route_to_diagnosis_100k.tsv; routes to diagnosis) These routes have been determined using a model that combines AV data with HES data, Cancer Waiting Times (CWT) data and data from the cancer screening programmes.
  • av_treatment (ncras_cancer_treatment_100k.tsv) treatment received for each participant. One participant receives more than one treatment, which includes surgery, chemo, immuno and radiotherapy.
  • av_tumour (ncras_cancer_tumour_100k.tsv) medical information about the tumour, including hormonal status (PR, ER and HER2), date of diagnosis, site, morphological and behaviour ICD10 codes as well as histology and grade.


The National Lung Cancer Audit (LUCADA) looks at the care delivered during referral, diagnosis, treatment and outcomes for people diagnosed with lung cancer and mesothelioma. The data items in the LUCADA dataset have been compiled to meet the requirements of audit, and are not to be confused with the data items identified as Lung Cancer in the National Cancer dataset. The audit focuses on measuring the care given to lung cancer patients from diagnosis to the primary treatment package, assessing against standards and bringing about necessary improvements. The project supports the Calman Hine recommendations, the National Cancer Plan and other national guidance (e.g. NICE guidance) as it emerges.

The audit follows patients diagnosed between: 01/01/2005 - 31/12/2013 (the vital status of each patient can be followed up with linkage to Cancer Registration data).

  • lucada_2013 (ncras_lung_cancer_dataset_2013_100k.tsv) contains, for 56 participants, data on the national lung cancer audit 2013.
  • lucada_2014 (ncras_lung_cancer_dataset_2014_100k.tsv) contains, for 18 participants, data on the national lung cancer audit 2014.

Other data from NHSE

  • cwt (ncras_cancer_waiting_times_100k.tsv) the National Cancer Waiting Times Monitoring Data Set supports the continued management and monitoring of waiting times.
  • ncras_did (ncras_diagnostic_imaging_metadata_100k.tsv; diagnostic imaging dataset) a central collection of detailed information about diagnostic imaging tests carried out on NHS patients, extracted from local radiology information systems. The DID captures information about referral source, details of the test (type of test and body site), demographic information such as GP registered practice, patient postcode, ethnicity, gender and date of birth, plus data items about different events (date of imaging request, date of imaging, date of reporting, which allows calculation of time intervals. Data are available for patients diagnosed between 1 January 2013 and 31 December 2015.
  • rtds (ncras_radiotherapy_100k.tsv; radiotherapy dataset) is an existing standard (SCCI0111)that has required all NHS Acute Trust providers of radiotherapy services in England to collect and submit standardised data monthly against a nationally defined data set since 2009. The purpose of the standard is to collect consistent and comparable data across all NHS Acute Trust providers of radiotherapy services in England in order to provide intelligence for service planning, commissioning, clinical practice and research and the operational provision of radiotherapy services across England. Data are available from 01/04/2009.
  • sact (ncras_systemic_anti_cancer_therapy_curated_100k.tsv; systemic anti-cancer therapy) contains clinical management on patients receiving cancer chemotherapy, and newer agents that have anti-cancer effects, in or funded by the NHS in England. It covers chemotherapy treatment for all solid tumour and haematological malignancies and those in clinical trials. It relates to all cancer patients, both adult and paediatric, in acute inpatient, day case, outpatient settings and delivery in the community. Data available for regimens between 11/09/16-15/12/17 with cycles within ending 15/02/18.

