Cancer analysis histology and TGCA study¶

The 100kGP cancer_analysis table includes the results of bioinformatic sequence analysis, as well as categorisation of the cancer based on histology to TGCA studies.

The disease subtype attempts to classify tumours according to tumour type and was made at the time of registration. Thus, some participants may well have had their histological diagnosis adjusted over the course of treatment based on histology of resection specimens. This has lead to a need to provide additional histological data on cancer genomes alongside a consideration of updating the classification of tumours registered within the 100,000 Genome Project.

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterised over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions. In an attempt to continue to develop Genomics England cancer classification and to align with TCGA, we have now added histology from the Public Health England National Cancer Register and allocated a matched TCGA code where possible. The process to match vast volumes of data require clear processes and these have been documented below in the release note.

Overview and purpose¶

To classify tumours based on histology, we match details in the cancer_analysis first with av_tumour, which is considered the main source of truth for tumours, and then hes_apc. The match_rank criteria is as follows:

`match_rank`	meaning
1	Information in `cancer_analysis`, `av_tumour` and `apc` all agree with one another.
2	Information in `cancer_analysis` and `av_tumour` agree
3	Information in `av_tumour` and `apc` agree
4	Information in `cancer_analysis` and `apc` agree
5	There is no linkage between `cancer_analysis`, `av_tumour` and `apc`

Pipeline and logic¶

Prerequisites¶

A disease_type_translations file created by domain experts consisting of four columns: Study.Abbreviation, Study.Name, gel_ca.disease_type, ICD10, ICD.O.3. Here Study.Abbreviation and Study.Name refer to the TCGA code and corresponding name. gel_ca.disease_type refers to the cancer_analysis disease type that this TCGA code falls under.
The av_tumour LabKey table - This table is needed to find the relevant ICD10 and ICD.O.3 codes for a given participant in cancer_analysis. Based on disease_type_translations, these ICD10 codes and ICD.O.3 codes are then mapped to TCGA codes. There may be multiple rows in av_tumour for a given participant, so the row with diagnosisdatebest closest to tumour_clinical_sample_time is chosen.
The apc LabKey table - For any participants where there is no data for in av_tumour, ICD10 codes are extracted from apc and mapped to possible TCGA codes using 'disease_type_translations' file. Again, there may be multiple rows in apc for a given participant so the row with opdate_01 closest to tumour_clinical_sample_time is chosen.
The most up-to-date cancer_analysis table from LabKey.
The participant list for the Data Release version at hand.

Logic¶

Prepare the data:¶

Read in cancer_analysis the original cancer_analysis table and keep all the columns.
Read in the whole disease_type_translations

Read in av_tumour and apc. The columns needed are as follows:

table	field	use
av_tumour	* participant_id * diagnosisdatebest	identifier
av_tumour	* site_coded_3char * histology_coded * histology_coded_desc	cross-reference
apc	* participant_id * opdate_01	identifier
apc	* diag_01	cross-reference

Filter apc based on diag_01 code. Firstly select all rows that have a diag_01 code containing a 'C'. Then remove any trailing X | - | D | - | A characters. and convert the diag_01 codes to 3 character length. Now av_tumour, apc and 'disease_type_translations' all have 3 character ICD10 codes.

Compare av_tumour and cancer analysis¶

Join av_tumour to cancer_analysis on participant_id. This provides a table with all the cancer analysis columns plus diagnosisdatebest, site_coded_3char, histology_coded, histology_coded_desc from av_tumour. However there are extra rows for given participants based on multiple rows in av_tumour for that participant.
Create a new column named av_tum.date_difference representing the absolute difference in days between diagnosisdatebest and tumour_clinical_sample_time. Replace any null values with a placeholder value of 100000 days.
For instances where we have multiple rows in av_tumour, select the rows in the joined table with the minimum av_tum.date_difference. This provides a table with the original cancer_analysis data and any data from av_tumour closest to the tumour_clinical_sample_time. Refer to this joined table as 'ca_av_tum'.

Note: Not all the participants in cancer_analysis appear in av_tumour, or there is no ICD10 code and histology_code in av_tumour for some participants in cancer_analysis. These are identifiable by looking at null diagnosisdatebest in 'ca_av_tum' and these are the participants where additional information from apc is required.

Map the ICD10 codes and ICD.O.3 codes from av_tumour to TCGA codes¶

Join the 'disease_type_translations' table with the 'ca_av_tum' on the following columns:

disease_type_translations columns ca_av_tum columns

ICD10 site_coded_3char

ICD.O.3 histology_coded

The disease_type_translations file contains mappings to TCGA codes for many combinations of ICD10 and ICD.O.3 and thus mappings to corresponding disease_types based on these TCGA codes. This means that given a participant in cancer_analysis, if there is data for that participant in av_tumour, then the joined table will have corresponding TCGA code and the relevant disease_type based on this TCGA code. This acts as a point of comparison between the disease_type in cancer_analysis and the additional data provided by av_tumour.
Temporarily create a av_tum.gel_match column acknowledging whether the data in av_tumour agrees with the data in cancer_analysis. Return TRUE if they are in agreement and FALSE otherwise. Note, null values will also return FALSE which is not a problem. This will later be used to aid the match_rank system.

Compare apc and cancer analysis¶

Repeat similar steps as above but for apc instead of av_tumour.

Join apc to cancer_analysis on participant_id.
Create a new column named apc.date_difference representing the absolute difference in days between opdate_01 and tumour_clinical_sample_time. Replace null values with a placeholder value of 100000 days.
For instances where there are multiple rows in apc for a participant in cancer_analysis, select the rows in the joined table with minimum apc.date_difference. Refer to this joined table as 'ca_apc'.

Note: ICD10 codes from diag_01 in apc is the only additional data that can be provided by apc. There are some instances where no ICD.O.3 code is needed to determine a unique TCGA code however there are also ambiguous cases where one ICD10 code could map to multiple TCGA codes. In these instances, domain experts manually went through these cases and selected a unique TCGA code based on the information for that participant.

Map ICD10 codes from apc to TCGA codes¶

Reformat the 'disease_type_translations' table so that it contains all the possible TCGA codes for a given ICD10 code.
Join 'disease_type_translations' to 'ca_apc' on the ICD10 columns:

disease_type_translations column ca_apc column

ICD10 diag_01

This results in a table with TCGA codes for cancer_analysis participants based on ICD10 codes in apc.
Temporarily create an apc.gel_match column acknowledging whether the data in apc agrees with the data in cancer_analysis. Return TRUE if they agree and FALSE otherwise.

Combine ca_apc and ca_av_tum and deduce match_rankings¶

Join ca_apc and ca_av_tum on all the columns in cancer_analysis and drop diagnosisdatebest and opdate_01. This results in a table with all the columns in cancer analysis and the following fields:

table	field
ca_av_tum	* site_coded_3char * histology_coded * histology_coded_desc * av_tum.Study.Abbreviation * av_tum.Study.Name * av_tum.gel_ca.disease_type * av_tum.date_difference * av_tum.gel_match
ca_apc	* diag_01 * apc.Study.Abbreviation * apc.Study.Name * apc.gel_ca.disease_type * apc.date_difference * apc.gel_match

Where .Study.Abbreviation, Study.Name, gel_ca.disease_type refers to the joined TCGA code, name and disease_type from the disease_type_translations file.

Now create new columns in the joined table taking the Study.Abbreviation from ca_av_tum and filling in any null values with the Study.Abbreviation from 'ca_apc'. This prioritises matches with av_tumour and fills in any gaps with matches in apc.
Repeat the above for Study.Name and gel_ca.disease_type.
Create a new column named match_rank. Fill it in based on the following conditions:

match_rank	condition	meaning
1	if av_tum.gel_match == TRUE and apc.gel_match == TRUE	Information in `cancer_analysis`, `av_tumour` and `apc` all agree with one another.
2	Omitting all rows with match_rank 1: if av_tum.gel_match == TRUE	Information in `cancer_analysis` and `av_tumour` agree
3	Omitting all rows with match_rank 1 and 2: if av_tum.gel_ca.disease_type == apc.gel_ca.disease_type	Information in `av_tumour` and `apc` agree
4	Omitting all rows with match_rank 1,2 and 3: if disease_type == apc.gel_ca.disease_type	Information in `cancer_analysis` and `apc` agree
5	Omit rows with match_rank 1,2,3 and 4 leaves data where there is no linkage between cancer_analysis, av_tumour and apc.	No linkage - either there is no data in av_tumour or apc, or there is no agreement between all 3.

Drop unneeded columns leaving the following schema¶

from cancer_analysis:

all columns

from av_tumour:

histology_coded
histology_coded_desc

from disease_type_translations:

Study.Name
Study.Abbreviation

calculated columns:

av_tum.date_difference
apc.date_difference
match_rank

Code lists¶

Study Abbreviation	Study Name	Genomics England Disease type
GBM	Glioblastoma multiforme	ADULT_GLIOMA
LGG	Brain Lower Grade Glioma	ADULT_GLIOMA
NEURO	Neurological other	ADULT_GLIOMA
BLAD	Bladder other	BLADDER
BLCA	Bladder Urothelial Carcinoma	BLADDER
BRCA	Breast invasive carcinoma	BREAST
COAD	Colon adenocarcinoma	COLORECTAL
CRC	Colorectal other	COLORECTAL
GINET	Colorectal and Upper GI neuroendocrine	COLORECTAL
READ	Rectum adenocarcinoma	COLORECTAL
ACC	Adrenocortical carcinoma	ENDOCRINE
ENDO	Endocrine other	ENDOCRINE
THCA	Thyroid carcinoma	ENDOCRINE
UCEC	Uterine Corpus Endometrial Carcinoma	ENDOMETRIAL_CARCINOMA
UCESC	Uterine Corpus Endometrial Serous Carcinoma	ENDOMETRIAL_CARCINOMA
UTE	Uterine other	ENDOMETRIAL_CARCINOMA
AML	Acute myeloid Leukaemia	HAEMONC
DLBC	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma	HAEMONC
HAEM	Haematological malignancy other	HAEMONC
LYMP	Lymphoid	HAEMONC
MYEL	Myeloid	HAEMONC
CHOL	Cholangiocarcinoma	HEPATOPANCREATOBILIARY
HPB	Hepatopancreatobiliary other	HEPATOPANCREATOBILIARY
LIHC	Liver hepatocellular carcinoma	HEPATOPANCREATOBILIARY
PAAD	Pancreatic adenocarcinoma	HEPATOPANCREATOBILIARY
LCNEC	Lung neuroendocrine	LUNG
LUAD	Lung adenocarcinoma	LUNG
LUG	Lung other	LUNG
LUSC	Lung squamous cell carcinoma	LUNG
MESO	Mesothelioma	LUNG
THYM	Thymoma	LUNG
SKCM	Skin Cutaneous Melanoma	MALIGNANT_MELANOMA
UVM	Uveal Melanoma	MALIGNANT_MELANOMA
HAN	Head and Neck other	ORAL_OROPHARYNGEAL
HNSC	Head and Neck squamous cell carcinoma	ORAL_OROPHARYNGEAL
DERM	Other malignant neoplasms of skin	OTHER
EDOC	Ovarian endometrioid adenocarcinoma	OVARIAN
HGSOC	High grade Ovarian serous cystadenocarcinoma	OVARIAN
LGSOC	Low grade Ovarian serous carcinoma	OVARIAN
OVAR	Ovarian other	OVARIAN
PRAD	Prostate adenocarcinoma	PROSTATE
PROS	Prostate other	PROSTATE
KICH	Kidney Chromophobe	RENAL
KIRC	Kidney renal clear cell carcinoma	RENAL
KIRP	Kidney renal papillary cell carcinoma	RENAL
NEPH	Renal other	RENAL
SARC	Sarcoma	SARCOMA
TEST	Testicular other	TESTICULAR_GERM_CELL_TUMOURS
TGCT	Testicular Germ Cell Tumors	TESTICULAR_GERM_CELL_TUMOURS
EAC	Esophageal adenocarcinoma	UPPER_GASTROINTESTINAL
ESCA	Esophageal squamous cell carcinoma	UPPER_GASTROINTESTINAL
SMIN	Small Intestine Neoplasm	UPPER_GASTROINTESTINAL
STAD	Stomach adenocarcinoma	UPPER_GASTROINTESTINAL
INSIT	In-situ and pre-malignant lesions
Other	Other/ Unmatched/ NA