Cancer analysis histology and TGCA study¶
cancer_analysis table includes the results of bioinformatic sequence analysis, as well as categorisation of the cancer based on histology to TGCA studies.
The disease subtype attempts to classify tumours according to tumour type and was made at the time of registration. Thus, some participants may well have had their histological diagnosis adjusted over the course of treatment based on histology of resection specimens. This has lead to a need to provide additional histological data on cancer genomes alongside a consideration of updating the classification of tumours registered within the 100,000 Genome Project.
The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterised over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions. In an attempt to continue to develop Genomics England cancer classification and to align with TCGA, we have now added histology from the Public Health England National Cancer Register and allocated a matched TCGA code where possible. The process to match vast volumes of data require clear processes and these have been documented below in the release note.
Overview and purpose¶
To classify tumours based on histology, we match details in the
cancer_analysis first with
av_tumour, which is considered the main source of truth for tumours, and then
match_rank criteria is as follows:
|5||There is no linkage between
Pipeline and logic¶
A disease_type_translationsfile created by domain experts consisting of four columns: Study.Abbreviation, Study.Name, gel_ca.disease_type, ICD10, ICD.O.3. Here Study.Abbreviation and Study.Name refer to the TCGA code and corresponding name. gel_ca.disease_type refers to the cancer_analysis disease type that this TCGA code falls under.
av_tumourLabKey table - This table is needed to find the relevant ICD10 and ICD.O.3 codes for a given participant in cancer_analysis. Based on
disease_type_translations, these ICD10 codes and ICD.O.3 codes are then mapped to TCGA codes. There may be multiple rows in
av_tumourfor a given participant, so the row with diagnosisdatebest closest to tumour_clinical_sample_time is chosen.
apcLabKey table - For any participants where there is no data for in
av_tumour, ICD10 codes are extracted from
apcand mapped to possible TCGA codes using 'disease_type_translations' file. Again, there may be multiple rows in
apcfor a given participant so the row with opdate_01 closest to tumour_clinical_sample_time is chosen.
- The most up-to-date
cancer_analysistable from LabKey.
- The participant list for the Data Release version at hand.
Prepare the data:¶
- Read in
cancer_analysisthe original cancer_analysis table and keep all the columns.
- Read in the whole
apc. The columns needed are as follows:
table field use av_tumour * participant_id
identifier av_tumour * site_coded_3char
cross-reference apc * participant_id
identifier apc * diag_01 cross-reference
apcbased on diag_01 code. Firstly select all rows that have a diag_01 code containing a 'C'. Then remove any trailing X | - | D | - | A characters. and convert the diag_01 codes to 3 character length. Now
apcand 'disease_type_translations' all have 3 character ICD10 codes.
Compare av_tumour and cancer analysis¶
cancer_analysison participant_id. This provides a table with all the cancer analysis columns plus diagnosisdatebest, site_coded_3char, histology_coded, histology_coded_desc from
av_tumour. However there are extra rows for given participants based on multiple rows in
av_tumourfor that participant.
- Create a new column named av_tum.date_difference representing the absolute difference in days between diagnosisdatebest and tumour_clinical_sample_time. Replace any null values with a placeholder value of 100000 days.
- For instances where we have multiple rows in
av_tumour, select the rows in the joined table with the minimum av_tum.date_difference. This provides a table with the original cancer_analysis data and any data from
av_tumourclosest to the tumour_clinical_sample_time. Refer to this joined table as 'ca_av_tum'.
Note: Not all the participants in cancer_analysis appear in
av_tumour, or there is no ICD10 code and histology_code in
av_tumour for some participants in cancer_analysis. These are identifiable by looking at null diagnosisdatebest in 'ca_av_tum' and these are the participants where additional information from
apc is required.
Map the ICD10 codes and ICD.O.3 codes from av_tumour to TCGA codes¶
Join the 'disease_type_translations' table with the 'ca_av_tum' on the following columns:
disease_type_translations columns ca_av_tum columns ICD10 site_coded_3char ICD.O.3 histology_coded
disease_type_translationsfile contains mappings to TCGA codes for many combinations of ICD10 and ICD.O.3 and thus mappings to corresponding
disease_types based on these TCGA codes. This means that given a participant in cancer_analysis, if there is data for that participant in
av_tumour, then the joined table will have corresponding TCGA code and the relevant disease_type based on this TCGA code. This acts as a point of comparison between the disease_type in
cancer_analysisand the additional data provided by
Temporarily create a
av_tum.gel_matchcolumn acknowledging whether the data in
av_tumouragrees with the data in cancer_analysis. Return TRUE if they are in agreement and FALSE otherwise. Note, null values will also return FALSE which is not a problem. This will later be used to aid the match_rank system.
Compare apc and cancer analysis¶
Repeat similar steps as above but for
apc instead of
- Create a new column named apc.date_difference representing the absolute difference in days between opdate_01 and tumour_clinical_sample_time. Replace null values with a placeholder value of 100000 days.
- For instances where there are multiple rows in
apcfor a participant in
cancer_analysis, select the rows in the joined table with minimum apc.date_difference. Refer to this joined table as 'ca_apc'.
Note: ICD10 codes from diag_01 in
apc is the only additional data that can be provided by
apc. There are some instances where no ICD.O.3 code is needed to determine a unique TCGA code however there are also ambiguous cases where one ICD10 code could map to multiple TCGA codes. In these instances, domain experts manually went through these cases and selected a unique TCGA code based on the information for that participant.
Map ICD10 codes from apc to TCGA codes¶
Reformat the 'disease_type_translations' table so that it contains all the possible TCGA codes for a given ICD10 code.
Join 'disease_type_translations' to 'ca_apc' on the ICD10 columns:
disease_type_translations column ca_apc column ICD10 diag_01
This results in a table with TCGA codes for cancer_analysis participants based on ICD10 codes in
Temporarily create an apc.gel_match column acknowledging whether the data in
apcagrees with the data in cancer_analysis. Return TRUE if they agree and FALSE otherwise.
Combine ca_apc and ca_av_tum and deduce match_rankings¶
Join ca_apc and ca_av_tum on all the columns in cancer_analysis and drop diagnosisdatebest and opdate_01. This results in a table with all the columns in cancer analysis and the following fields:
table field ca_av_tum * site_coded_3char
ca_apc * diag_01
.Study.Abbreviation, Study.Name, gel_ca.disease_type refers to the joined TCGA code, name and disease_type from the disease_type_translations file.
Now create new columns in the joined table taking the Study.Abbreviation from ca_av_tum and filling in any null values with the Study.Abbreviation from 'ca_apc'. This prioritises matches with
av_tumourand fills in any gaps with matches in
- Repeat the above for Study.Name and gel_ca.disease_type.
- Create a new column named match_rank. Fill it in based on the following conditions:
|1||if av_tum.gel_match == TRUE and apc.gel_match == TRUE||Information in
|2||Omitting all rows with match_rank 1:
if av_tum.gel_match == TRUE
|3||Omitting all rows with match_rank 1 and 2:
if av_tum.gel_ca.disease_type == apc.gel_ca.disease_type
|4||Omitting all rows with match_rank 1,2 and 3:
if disease_type == apc.gel_ca.disease_type
|5||Omit rows with match_rank 1,2,3 and 4 leaves data where there is no linkage between cancer_analysis, av_tumour and apc.||No linkage - either there is no data in av_tumour or apc, or there is no agreement between all 3.|
Drop unneeded columns leaving the following schema¶
- all columns
|Study Abbreviation||Study Name||GEL Disease type|
|LGG||Brain Lower Grade Glioma||ADULT_GLIOMA|
|BLCA||Bladder Urothelial Carcinoma||BLADDER|
|BRCA||Breast invasive carcinoma||BREAST|
|GINET||Colorectal and Upper GI neuroendocrine||COLORECTAL|
|UCEC||Uterine Corpus Endometrial Carcinoma||ENDOMETRIAL_CARCINOMA|
|UCESC||Uterine Corpus Endometrial Serous Carcinoma||ENDOMETRIAL_CARCINOMA|
|AML||Acute myeloid Leukaemia||HAEMONC|
|DLBC||Lymphoid Neoplasm Diffuse Large B-cell Lymphoma||HAEMONC|
|HAEM||Haematological malignancy other||HAEMONC|
|LIHC||Liver hepatocellular carcinoma||HEPATOPANCREATOBILIARY|
|LUSC||Lung squamous cell carcinoma||LUNG|
|SKCM||Skin Cutaneous Melanoma||MALIGNANT_MELANOMA|
|HAN||Head and Neck other||ORAL_OROPHARYNGEAL|
|HNSC||Head and Neck squamous cell carcinoma||ORAL_OROPHARYNGEAL|
|DERM||Other malignant neoplasms of skin||OTHER|
|EDOC||Ovarian endometrioid adenocarcinoma||OVARIAN|
|HGSOC||High grade Ovarian serous cystadenocarcinoma||OVARIAN|
|LGSOC||Low grade Ovarian serous carcinoma||OVARIAN|
|KIRC||Kidney renal clear cell carcinoma||RENAL|
|KIRP||Kidney renal papillary cell carcinoma||RENAL|
|TGCT||Testicular Germ Cell Tumors||TESTICULAR_GERM_CELL_TUMOURS|
|ESCA||Esophageal squamous cell carcinoma||UPPER_GASTROINTESTINAL|
|SMIN||Small Intestine Neoplasm||UPPER_GASTROINTESTINAL|
|INSIT||In-situ and pre-malignant lesions|
|Other||Other/ Unmatched/ NA|