Skip to content

Cancer analysis histology and TGCA study

The cancer_analysis table includes the results of bioinformatic sequence analysis, as well as categorisation of the cancer based on histology to TGCA studies.

The disease subtype attempts to classify tumours according to tumour type and was made at the time of registration. Thus, some participants may well have had their histological diagnosis adjusted over the course of treatment based on histology of resection specimens. This has lead to a need to provide additional histological data on cancer genomes alongside a consideration of updating the classification of tumours registered within the 100,000 Genome Project.

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterised over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions. In an attempt to continue to develop Genomics England cancer classification and to align with TCGA, we have now added histology from the Public Health England National Cancer Register and allocated a matched TCGA code where possible. The process to match vast volumes of data require clear processes and these have been documented below in the release note.

Overview and purpose

To classify tumours based on histology, we match details in the cancer_analysis first with av_tumour, which is considered the main source of truth for tumours, and then hes_apc. The match_rank criteria is as follows:

match_rank meaning
1 Information in cancer_analysis, av_tumour and apc all agree with one another.
2 Information in cancer_analysis and av_tumour agree
3 Information in av_tumour and apc agree
4 Information in cancer_analysis and apc agree
5 There is no linkage between cancer_analysis, av_tumour and apc

Pipeline and logic


  1. A disease_type_translations file created by domain experts consisting of four columns: Study.Abbreviation, Study.Name, gel_ca.disease_type, ICD10, ICD.O.3. Here Study.Abbreviation and Study.Name refer to the TCGA code and corresponding name. gel_ca.disease_type refers to the cancer_analysis disease type that this TCGA code falls under.
  2. The av_tumour LabKey table - This table is needed to find the relevant ICD10 and ICD.O.3 codes for a given participant in cancer_analysis. Based on disease_type_translations, these ICD10 codes and ICD.O.3 codes are then mapped to TCGA codes. There may be multiple rows in av_tumour for a given participant, so the row with diagnosisdatebest closest to tumour_clinical_sample_time is chosen.
  3. The apc LabKey table - For any participants where there is no data for in av_tumour, ICD10 codes are extracted from apc and mapped to possible TCGA codes using 'disease_type_translations' file. Again, there may be multiple rows in apc for a given participant so the row with opdate_01 closest to tumour_clinical_sample_time is chosen.
  4. The most up-to-date cancer_analysis table from LabKey.
  5. The participant list for the Data Release version at hand.


Prepare the data:

  1. Read in cancer_analysis the original cancer_analysis table and keep all the columns.
  2. Read in the whole disease_type_translations
  3. Read in av_tumour and apc. The columns needed are as follows:

    table field use
    av_tumour * participant_id
    * diagnosisdatebest
    av_tumour * site_coded_3char
    * histology_coded
    * histology_coded_desc
    apc * participant_id
    * opdate_01
    apc * diag_01 cross-reference
  4. Filter apc based on diag_01 code. Firstly select all rows that have a diag_01 code containing a 'C'. Then remove any trailing X | - | D | - | A characters. and convert the diag_01 codes to 3 character length. Now av_tumour, apc and 'disease_type_translations' all have 3 character ICD10 codes.

Compare av_tumour and cancer analysis

  1. Join av_tumour to cancer_analysis on participant_id. This provides a table with all the cancer analysis columns plus diagnosisdatebest, site_coded_3char, histology_coded, histology_coded_desc from av_tumour. However there are extra rows for given participants based on multiple rows in av_tumour for that participant.
  2. Create a new column named av_tum.date_difference representing the absolute difference in days between diagnosisdatebest and tumour_clinical_sample_time. Replace any null values with a placeholder value of 100000 days.
  3. For instances where we have multiple rows in av_tumour, select the rows in the joined table with the minimum av_tum.date_difference. This provides a table with the original cancer_analysis data and any data from av_tumour closest to the tumour_clinical_sample_time. Refer to this joined table as 'ca_av_tum'.

Note: Not all the participants in cancer_analysis appear in av_tumour, or there is no ICD10 code and histology_code in av_tumour for some participants in cancer_analysis. These are identifiable by looking at null diagnosisdatebest in 'ca_av_tum' and these are the participants where additional information from apc is required.

Map the ICD10 codes and ICD.O.3 codes from av_tumour to TCGA codes

  1. Join the 'disease_type_translations' table with the 'ca_av_tum' on the following columns:

    disease_type_translations columns ca_av_tum columns
    ICD10 site_coded_3char
    ICD.O.3 histology_coded

    The disease_type_translations file contains mappings to TCGA codes for many combinations of ICD10 and ICD.O.3 and thus mappings to corresponding disease_types based on these TCGA codes. This means that given a participant in cancer_analysis, if there is data for that participant in av_tumour, then the joined table will have corresponding TCGA code and the relevant disease_type based on this TCGA code. This acts as a point of comparison between the disease_type in cancer_analysis and the additional data provided by av_tumour.

  2. Temporarily create a av_tum.gel_match column acknowledging whether the data in av_tumour agrees with the data in cancer_analysis. Return TRUE if they are in agreement and FALSE otherwise. Note, null values will also return FALSE which is not a problem. This will later be used to aid the match_rank system.

Compare apc and cancer analysis

Repeat similar steps as above but for apc instead of av_tumour.

  1. Join apc to cancer_analysis on participant_id.
  2. Create a new column named apc.date_difference representing the absolute difference in days between opdate_01 and tumour_clinical_sample_time. Replace null values with a placeholder value of 100000 days.
  3. For instances where there are multiple rows in apc for a participant in cancer_analysis, select the rows in the joined table with minimum apc.date_difference. Refer to this joined table as 'ca_apc'.

Note: ICD10 codes from diag_01 in apc is the only additional data that can be provided by apc. There are some instances where no ICD.O.3 code is needed to determine a unique TCGA code however there are also ambiguous cases where one ICD10 code could map to multiple TCGA codes. In these instances, domain experts manually went through these cases and selected a unique TCGA code based on the information for that participant.

Map ICD10 codes from apc to TCGA codes

  1. Reformat the 'disease_type_translations' table so that it contains all the possible TCGA codes for a given ICD10 code.

  2. Join 'disease_type_translations' to 'ca_apc' on the ICD10 columns:

    disease_type_translations column ca_apc column
    ICD10 diag_01

    This results in a table with TCGA codes for cancer_analysis participants based on ICD10 codes in apc.

  3. Temporarily create an apc.gel_match column acknowledging whether the data in apc agrees with the data in cancer_analysis. Return TRUE if they agree and FALSE otherwise.

Combine ca_apc and ca_av_tum and deduce match_rankings

  1. Join ca_apc and ca_av_tum on all the columns in cancer_analysis and drop diagnosisdatebest and opdate_01. This results in a table with all the columns in cancer analysis and the following fields:

    table field
    ca_av_tum * site_coded_3char
    * histology_coded
    * histology_coded_desc
    * av_tum.Study.Abbreviation
    * av_tum.Study.Name
    * av_tum.gel_ca.disease_type
    * av_tum.date_difference
    * av_tum.gel_match
    ca_apc * diag_01
    * apc.Study.Abbreviation
    * apc.Study.Name
    * apc.gel_ca.disease_type
    * apc.date_difference
    * apc.gel_match

    Where .Study.Abbreviation, Study.Name, gel_ca.disease_type refers to the joined TCGA code, name and disease_type from the disease_type_translations file.

  2. Now create new columns in the joined table taking the Study.Abbreviation from ca_av_tum and filling in any null values with the Study.Abbreviation from 'ca_apc'. This prioritises matches with av_tumour and fills in any gaps with matches in apc.

  3. Repeat the above for Study.Name and gel_ca.disease_type.
  4. Create a new column named match_rank. Fill it in based on the following conditions:
match_rank condition meaning
1 if av_tum.gel_match == TRUE and apc.gel_match == TRUE Information in cancer_analysis, av_tumour and apc all agree with one another.
2 Omitting all rows with match_rank 1:
if av_tum.gel_match == TRUE
Information in cancer_analysis and av_tumour agree
3 Omitting all rows with match_rank 1 and 2:
if av_tum.gel_ca.disease_type == apc.gel_ca.disease_type
Information in av_tumour and apc agree
4 Omitting all rows with match_rank 1,2 and 3:
if disease_type == apc.gel_ca.disease_type
Information in cancer_analysis and apc agree
5 Omit rows with match_rank 1,2,3 and 4 leaves data where there is no linkage between cancer_analysis, av_tumour and apc. No linkage - either there is no data in av_tumour or apc, or there is no agreement between all 3.

Drop unneeded columns leaving the following schema

from cancer_analysis:

  • all columns

from av_tumour:

  • histology_coded
  • histology_coded_desc

from disease_type_translations:

  • Study.Name
  • Study.Abbreviation

calculated columns:

  • av_tum.date_difference
  • apc.date_difference
  • match_rank
Code lists
Study Abbreviation Study Name GEL Disease type
GBM Glioblastoma multiforme ADULT_GLIOMA
LGG Brain Lower Grade Glioma ADULT_GLIOMA
NEURO Neurological other ADULT_GLIOMA
BLAD Bladder other BLADDER
BLCA Bladder Urothelial Carcinoma BLADDER
BRCA Breast invasive carcinoma BREAST
COAD Colon adenocarcinoma COLORECTAL
CRC Colorectal other COLORECTAL
GINET Colorectal and Upper GI neuroendocrine COLORECTAL
READ Rectum adenocarcinoma COLORECTAL
ACC Adrenocortical carcinoma ENDOCRINE
ENDO Endocrine other ENDOCRINE
THCA Thyroid carcinoma ENDOCRINE
UCEC Uterine Corpus Endometrial Carcinoma ENDOMETRIAL_CARCINOMA
UCESC Uterine Corpus Endometrial Serous Carcinoma ENDOMETRIAL_CARCINOMA
AML Acute myeloid Leukaemia HAEMONC
DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma HAEMONC
HAEM Haematological malignancy other HAEMONC
HPB Hepatopancreatobiliary other HEPATOPANCREATOBILIARY
LIHC Liver hepatocellular carcinoma HEPATOPANCREATOBILIARY
LCNEC Lung neuroendocrine LUNG
LUAD Lung adenocarcinoma LUNG
LUG Lung other LUNG
LUSC Lung squamous cell carcinoma LUNG
MESO Mesothelioma LUNG
HNSC Head and Neck squamous cell carcinoma ORAL_OROPHARYNGEAL
DERM Other malignant neoplasms of skin OTHER
EDOC Ovarian endometrioid adenocarcinoma OVARIAN
HGSOC High grade Ovarian serous cystadenocarcinoma OVARIAN
LGSOC Low grade Ovarian serous carcinoma OVARIAN
OVAR Ovarian other OVARIAN
PRAD Prostate adenocarcinoma PROSTATE
PROS Prostate other PROSTATE
KICH Kidney Chromophobe RENAL
KIRC Kidney renal clear cell carcinoma RENAL
KIRP Kidney renal papillary cell carcinoma RENAL
NEPH Renal other RENAL
EAC Esophageal adenocarcinoma UPPER_GASTROINTESTINAL
ESCA Esophageal squamous cell carcinoma UPPER_GASTROINTESTINAL
INSIT In-situ and pre-malignant lesions
Other Other/ Unmatched/ NA

Last update: November 8, 2023