Cancer analysis histology and TGCA study¶
The 100kGP cancer_analysis
table includes the results of bioinformatic sequence analysis, as well as categorisation of the cancer based on histology to TGCA studies.
The disease subtype attempts to classify tumours according to tumour type and was made at the time of registration. Thus, some participants may well have had their histological diagnosis adjusted over the course of treatment based on histology of resection specimens. This has lead to a need to provide additional histological data on cancer genomes alongside a consideration of updating the classification of tumours registered within the 100,000 Genome Project.
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterised over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions. In an attempt to continue to develop Genomics England cancer classification and to align with TCGA, we have now added histology from the Public Health England National Cancer Register and allocated a matched TCGA code where possible. The process to match vast volumes of data require clear processes and these have been documented below in the release note.
Overview and purpose¶
To classify tumours based on histology, we match details in the cancer_analysis
first with av_tumour
, which is considered the main source of truth for tumours, and then hes_apc
. The match_rank
criteria is as follows:
match_rank |
meaning |
---|---|
1 | Information in cancer_analysis , av_tumour and apc all agree with one another. |
2 | Information in cancer_analysis and av_tumour agree |
3 | Information in av_tumour and apc agree |
4 | Information in cancer_analysis and apc agree |
5 | There is no linkage between cancer_analysis , av_tumour and apc |
Pipeline and logic¶
Prerequisites¶
A disease_type_translations
file created by domain experts consisting of four columns: Study.Abbreviation, Study.Name, gel_ca.disease_type, ICD10, ICD.O.3. Here Study.Abbreviation and Study.Name refer to the TCGA code and corresponding name. gel_ca.disease_type refers to the cancer_analysis disease type that this TCGA code falls under.- The
av_tumour
LabKey table - This table is needed to find the relevant ICD10 and ICD.O.3 codes for a given participant in cancer_analysis. Based ondisease_type_translations
, these ICD10 codes and ICD.O.3 codes are then mapped to TCGA codes. There may be multiple rows inav_tumour
for a given participant, so the row with diagnosisdatebest closest to tumour_clinical_sample_time is chosen. - The
apc
LabKey table - For any participants where there is no data for inav_tumour
, ICD10 codes are extracted fromapc
and mapped to possible TCGA codes using 'disease_type_translations' file. Again, there may be multiple rows inapc
for a given participant so the row with opdate_01 closest to tumour_clinical_sample_time is chosen. - The most up-to-date
cancer_analysis
table from LabKey. - The participant list for the Data Release version at hand.
Logic¶
Prepare the data:¶
- Read in
cancer_analysis
the original cancer_analysis table and keep all the columns. - Read in the whole
disease_type_translations
-
Read in
av_tumour
andapc
. The columns needed are as follows:table field use av_tumour * participant_id
* diagnosisdatebestidentifier av_tumour * site_coded_3char
* histology_coded
* histology_coded_desccross-reference apc * participant_id
* opdate_01identifier apc * diag_01 cross-reference -
Filter
apc
based on diag_01 code. Firstly select all rows that have a diag_01 code containing a 'C'. Then remove any trailing X | - | D | - | A characters. and convert the diag_01 codes to 3 character length. Nowav_tumour
,apc
and 'disease_type_translations' all have 3 character ICD10 codes.
Compare av_tumour and cancer analysis¶
- Join
av_tumour
tocancer_analysis
on participant_id. This provides a table with all the cancer analysis columns plus diagnosisdatebest, site_coded_3char, histology_coded, histology_coded_desc fromav_tumour
. However there are extra rows for given participants based on multiple rows inav_tumour
for that participant. - Create a new column named av_tum.date_difference representing the absolute difference in days between diagnosisdatebest and tumour_clinical_sample_time. Replace any null values with a placeholder value of 100000 days.
- For instances where we have multiple rows in
av_tumour
, select the rows in the joined table with the minimum av_tum.date_difference. This provides a table with the original cancer_analysis data and any data fromav_tumour
closest to the tumour_clinical_sample_time. Refer to this joined table as 'ca_av_tum'.
Note: Not all the participants in cancer_analysis appear in av_tumour
, or there is no ICD10 code and histology_code in av_tumour
for some participants in cancer_analysis. These are identifiable by looking at null diagnosisdatebest in 'ca_av_tum' and these are the participants where additional information from apc
is required.
Map the ICD10 codes and ICD.O.3 codes from av_tumour to TCGA codes¶
-
Join the 'disease_type_translations' table with the 'ca_av_tum' on the following columns:
disease_type_translations columns ca_av_tum columns ICD10 site_coded_3char ICD.O.3 histology_coded The
disease_type_translations
file contains mappings to TCGA codes for many combinations of ICD10 and ICD.O.3 and thus mappings to correspondingdisease_type
s based on these TCGA codes. This means that given a participant in cancer_analysis, if there is data for that participant inav_tumour
, then the joined table will have corresponding TCGA code and the relevant disease_type based on this TCGA code. This acts as a point of comparison between the disease_type incancer_analysis
and the additional data provided byav_tumour
. -
Temporarily create a
av_tum.gel_match
column acknowledging whether the data inav_tumour
agrees with the data in cancer_analysis. Return TRUE if they are in agreement and FALSE otherwise. Note, null values will also return FALSE which is not a problem. This will later be used to aid the match_rank system.
Compare apc and cancer analysis¶
Repeat similar steps as above but for apc
instead of av_tumour
.
- Join
apc
tocancer_analysis
on participant_id. - Create a new column named apc.date_difference representing the absolute difference in days between opdate_01 and tumour_clinical_sample_time. Replace null values with a placeholder value of 100000 days.
- For instances where there are multiple rows in
apc
for a participant incancer_analysis
, select the rows in the joined table with minimum apc.date_difference. Refer to this joined table as 'ca_apc'.
Note: ICD10 codes from diag_01 in apc
is the only additional data that can be provided by apc
. There are some instances where no ICD.O.3 code is needed to determine a unique TCGA code however there are also ambiguous cases where one ICD10 code could map to multiple TCGA codes. In these instances, domain experts manually went through these cases and selected a unique TCGA code based on the information for that participant.
Map ICD10 codes from apc to TCGA codes¶
-
Reformat the 'disease_type_translations' table so that it contains all the possible TCGA codes for a given ICD10 code.
-
Join 'disease_type_translations' to 'ca_apc' on the ICD10 columns:
disease_type_translations column ca_apc column ICD10 diag_01 This results in a table with TCGA codes for cancer_analysis participants based on ICD10 codes in
apc
. -
Temporarily create an apc.gel_match column acknowledging whether the data in
apc
agrees with the data in cancer_analysis. Return TRUE if they agree and FALSE otherwise.
Combine ca_apc and ca_av_tum and deduce match_rankings¶
-
Join ca_apc and ca_av_tum on all the columns in cancer_analysis and drop diagnosisdatebest and opdate_01. This results in a table with all the columns in cancer analysis and the following fields:
table field ca_av_tum * site_coded_3char
* histology_coded
* histology_coded_desc
* av_tum.Study.Abbreviation
* av_tum.Study.Name
* av_tum.gel_ca.disease_type
* av_tum.date_difference
* av_tum.gel_matchca_apc * diag_01
* apc.Study.Abbreviation
* apc.Study.Name
* apc.gel_ca.disease_type
* apc.date_difference
* apc.gel_matchWhere
.Study.Abbreviation, Study.Name, gel_ca.disease_type refers to the joined TCGA code, name and disease_type from the disease_type_translations file. -
Now create new columns in the joined table taking the Study.Abbreviation from ca_av_tum and filling in any null values with the Study.Abbreviation from 'ca_apc'. This prioritises matches with
av_tumour
and fills in any gaps with matches inapc
. - Repeat the above for Study.Name and gel_ca.disease_type.
- Create a new column named match_rank. Fill it in based on the following conditions:
match_rank | condition | meaning |
---|---|---|
1 | if av_tum.gel_match == TRUE and apc.gel_match == TRUE | Information in cancer_analysis , av_tumour and apc all agree with one another. |
2 | Omitting all rows with match_rank 1: if av_tum.gel_match == TRUE |
Information in cancer_analysis and av_tumour agree |
3 | Omitting all rows with match_rank 1 and 2: if av_tum.gel_ca.disease_type == apc.gel_ca.disease_type |
Information in av_tumour and apc agree |
4 | Omitting all rows with match_rank 1,2 and 3: if disease_type == apc.gel_ca.disease_type |
Information in cancer_analysis and apc agree |
5 | Omit rows with match_rank 1,2,3 and 4 leaves data where there is no linkage between cancer_analysis, av_tumour and apc. | No linkage - either there is no data in av_tumour or apc, or there is no agreement between all 3. |
Drop unneeded columns leaving the following schema¶
from cancer_analysis:
- all columns
from av_tumour:
- histology_coded
- histology_coded_desc
from disease_type_translations:
- Study.Name
- Study.Abbreviation
calculated columns:
- av_tum.date_difference
- apc.date_difference
- match_rank
Code lists¶
Study Abbreviation | Study Name | GEL Disease type |
---|---|---|
GBM | Glioblastoma multiforme | ADULT_GLIOMA |
LGG | Brain Lower Grade Glioma | ADULT_GLIOMA |
NEURO | Neurological other | ADULT_GLIOMA |
BLAD | Bladder other | BLADDER |
BLCA | Bladder Urothelial Carcinoma | BLADDER |
BRCA | Breast invasive carcinoma | BREAST |
COAD | Colon adenocarcinoma | COLORECTAL |
CRC | Colorectal other | COLORECTAL |
GINET | Colorectal and Upper GI neuroendocrine | COLORECTAL |
READ | Rectum adenocarcinoma | COLORECTAL |
ACC | Adrenocortical carcinoma | ENDOCRINE |
ENDO | Endocrine other | ENDOCRINE |
THCA | Thyroid carcinoma | ENDOCRINE |
UCEC | Uterine Corpus Endometrial Carcinoma | ENDOMETRIAL_CARCINOMA |
UCESC | Uterine Corpus Endometrial Serous Carcinoma | ENDOMETRIAL_CARCINOMA |
UTE | Uterine other | ENDOMETRIAL_CARCINOMA |
AML | Acute myeloid Leukaemia | HAEMONC |
DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | HAEMONC |
HAEM | Haematological malignancy other | HAEMONC |
LYMP | Lymphoid | HAEMONC |
MYEL | Myeloid | HAEMONC |
CHOL | Cholangiocarcinoma | HEPATOPANCREATOBILIARY |
HPB | Hepatopancreatobiliary other | HEPATOPANCREATOBILIARY |
LIHC | Liver hepatocellular carcinoma | HEPATOPANCREATOBILIARY |
PAAD | Pancreatic adenocarcinoma | HEPATOPANCREATOBILIARY |
LCNEC | Lung neuroendocrine | LUNG |
LUAD | Lung adenocarcinoma | LUNG |
LUG | Lung other | LUNG |
LUSC | Lung squamous cell carcinoma | LUNG |
MESO | Mesothelioma | LUNG |
THYM | Thymoma | LUNG |
SKCM | Skin Cutaneous Melanoma | MALIGNANT_MELANOMA |
UVM | Uveal Melanoma | MALIGNANT_MELANOMA |
HAN | Head and Neck other | ORAL_OROPHARYNGEAL |
HNSC | Head and Neck squamous cell carcinoma | ORAL_OROPHARYNGEAL |
DERM | Other malignant neoplasms of skin | OTHER |
EDOC | Ovarian endometrioid adenocarcinoma | OVARIAN |
HGSOC | High grade Ovarian serous cystadenocarcinoma | OVARIAN |
LGSOC | Low grade Ovarian serous carcinoma | OVARIAN |
OVAR | Ovarian other | OVARIAN |
PRAD | Prostate adenocarcinoma | PROSTATE |
PROS | Prostate other | PROSTATE |
KICH | Kidney Chromophobe | RENAL |
KIRC | Kidney renal clear cell carcinoma | RENAL |
KIRP | Kidney renal papillary cell carcinoma | RENAL |
NEPH | Renal other | RENAL |
SARC | Sarcoma | SARCOMA |
TEST | Testicular other | TESTICULAR_GERM_CELL_TUMOURS |
TGCT | Testicular Germ Cell Tumors | TESTICULAR_GERM_CELL_TUMOURS |
EAC | Esophageal adenocarcinoma | UPPER_GASTROINTESTINAL |
ESCA | Esophageal squamous cell carcinoma | UPPER_GASTROINTESTINAL |
SMIN | Small Intestine Neoplasm | UPPER_GASTROINTESTINAL |
STAD | Stomach adenocarcinoma | UPPER_GASTROINTESTINAL |
INSIT | In-situ and pre-malignant lesions | |
Other | Other/ Unmatched/ NA |