100kGP General clinical data¶
A number of tables in LabKey are common to all participants, both cancer and rare disease participants. These cover data about the participants themselves, the samples, results of bioinformatic analyses and participant medical history.
CloudOS files
Names of the equivalent file in CloudOS are stated after the table name in brackets.
Primary and secondary data tables
Primary clinical data were collected when participants were enrolled in the programme; tables are tagged with .
Secondary clinical data were obtained from third parties such as NHSE; tables are tagged with .
Participant information tables¶
The participant_summary
table contains data for all rare disease and cancer participants, both relatives and probands. The table was generated using the following tables/calculations:
field | origin table(s) |
---|---|
participant_stated_gender |
participant |
participant_karyotyped_sex |
rare_disease_analysis , cancer_analysis |
participant_phenotyped_sex |
participant |
yob |
participant |
genotyped_ancestry |
aggregate_gvcf_sample_stats (filtered on pred_ancestry value >= 0.8) |
predominant_ancestry |
aggregate_gvcf_sample_stats |
predominant_ancestry_value |
aggregate_gvcf_sample_stats |
programme |
participant |
participant_type |
participant |
affection_status |
rare_disease_interpreted |
date_of_consent |
participant |
cancer_diagnosis_date |
cancer_staging_consolidated (calculated MINDATE(diagnosis_date ) group by participant_id ) |
cancer_diagnosis_age |
calculated (DATEDIF(yob , cancer_diagnosis_date , year )) |
cancer_study_name |
cancer_analysis |
cancer_study_abbreviation |
cancer_analysis |
cancer disease_type |
cancer_participant_disease |
cancer_disease_sub_type |
cancer_participant_disease |
disease_diagnosis_date |
rare_diseases_participant_disease |
disease_diagnosis_age |
calculated (DATEDIF(yob , disease_diagnosis_date , year ))` |
normalised_disease_group |
rare_diseases_participant_disease |
normalised_disease_sub_group |
rare_diseases_participant_disease |
normalised_specific_disease |
rare_diseases_participant_disease |
hpo_term |
rare_diseases_participant_phenotype (filtered for participants where hpo_present = 'yes' ) |
death_date |
death_details , mortality, rare_diseases_pedigree_member` |
participant
(gel_participant_100k.tsv
) contains demographics (such as relatives or ethnicity); points of contact with the Project (e.g. handling Genomic Medicine Centre or Trust); and a record of the status of their clinical review. Note that some participants in this table may not have sequence data available yet.death_details
(gel_gmc_death_details_100k.tsv
) contains participant deaths submitted by GMCs, likely less complete than the data collected by ONS and NHSE.domain_assignment
(gel_domain_assignment_100k.tsv
) contains data describing the disease type to which they were recruited; the disease panel applied to their genome; the GECIP domain to which their genome has been assigned for the purposes of administering the GECIP publication moratorium; as well as the end date of the GECIP moratorium associated with their genome(s).
Sampling tables¶
Biological sampling tables¶
clinic_sample
(gel_clinic_sample_100k.tsv
) describes the taking and handling of participant samples at the Genomic Medicine Centres, i.e. in the clinic, as well as the type of samples obtained. Because of the complexities of handling and managing tumour tissues samples in a clinical setting, there are many fields that are cancer-specific.clinic_sample_quality_check_result
(gel_clinic_sample_qc_results_100k.tsv
) describes the quality control of obtaining and handling participant samples at the Genomic Medicine Centres, i.e. in the clinic.laboratory_sample
(gel_laboratory_sample_100k.tsv
) describes the handling of samples at the biorepository and in preparation for sequencing, as well as the type of biological sample.laboratory_sample_omics_availability
(gel_laboratory_sample_omics_100k.tsv
) contains information on other biological samples that are available in our biobank for our participants as of the latest data release. Please note that these samples have not been sequenced nor analysed by Genomics England.
Sequencing data tables¶
plated_sample
(gel_plated_sample_100k.tsv
) contains, for each sequenced sample, the plate key and plate id, along with Illumina QC date, status and few other QC information.
Results of bioinformatic analysis¶
These tables contain data from and information about Genomics England interpretation pipelines.
aggregate_gvcf_sample_stats
(gel_rare_disease_and_germline_genomic_variant_call_format_sample_statistics_100k.tsv
) contains the samples that have been used to create the aggregate vcf files (/gel_data_resources/main_programme/aggregated_illumina_gvcf/GRCH38/20190228/
) and their QC metrics. These files contain the aggregated variant calls.genome_file_paths_and_types
contains the folder location for the bam and vcf files for each participant.sequencing_report
(gel_sequencing_report_100k.tsv
) contains data describing the sequencing of their genome(s) and associated output, as well as the sample type that the sequence is from, e.g. rare disease germline, cancer somatic, etc.panels_applied
(gel_panels_applied_100k.tsv
) contains the name and version of the panel(s) that was applied to their genome.
Participant medical history¶
Secondary Clinical Data is available for 100kGP participants with current consent. Clinical data is not available for participants that have withdrawn from the 100kGP or were otherwise ineligible. Up to Data Release V14 any participant on child consent who turned 16 without reconsenting as an adult was deemed ineligible and removed from the release. For Data Release V14 a decision was made to reinclude these participants with a caveat that where a participant was consented as a child but turned 16 without re-consenting as an adult, only data collected under that consent (i.e. all that collected prior to their 16th birthday) is released.
Some participants' secondary data records may have a sudden end point that does not correlate to an end to treatment. The column secondary_data_received_until_year
in the table participant
contains the year that the participant turned 16 or this column will be null.
Hospital Episodes Statistics from NHSE¶
Hospital Episodes Statistics (HES) contain details of all admissions, outpatient appointments, critical care and A&E attendances at NHS hospitals in England. Each data entry is collected during a patient's time in hospital and are submitted to allow hospitals to be paid for the care they deliver. HES data are designed to enable secondary use, that is use for non-clinical purposes, of these administrative data.
It is a records-based system that covers all NHS trusts in England, including acute hospitals, primary care trusts and mental health trusts. HES information is stored as a large collection of separate records and Genomics England receives regular partial exports of HES data held for each of the participants within the 100,000 Genomes Project, which are linked with their Participant ID. HES data are presented in LabKey as separate datasets:
hes_ae
(nhs_d_hospital_episodes_statistics_accident_and_emergency_100k.tsv
; accident and emergency) contains historic records of A&E attendances of Genomics England main programme participants.hes_apc
(nhs_d_hospital_episodes_statistics_admitted_patient_care_100k.tsv
; admitted patient care) contains historic records of admissions into secondary care of Genomics England main programme participants.hes_cc
(nhs_d_hospital_episodes_statistics_critical_care_100k.tsv
; critical care) contains historic records of admissions into critical care of Genomics England main programme participants.hes_op
(nhs_d_hospital_episodes_statistics_outpatient_100k.tsv
; outpatient) contains historic records of outpatient attendances of Genomics England main programme participants.ecds
(nhs_d_emergency_care_dataset_100k.tsv
) Main dataset of urgent and emergency care of Genomics England main programme participants. Expands hes_ae and will replace it entirely in the future.
The HES data are presented in LabKey with each row representing a separate period of care for that participant. Therefore, each participant may have one or more rows of data. Often there will be empty fields, due to the way the data is structured.
Some data-points, such as diagnoses and treatments, are split across multiple columns since there will be mutliple entries per visit. There are also columns that concatenate these values together, making them easier to search.
Concatenated columns available
The concatenated columns in each of the tables are shown in the table below:
Table | Concatenated Column Name | Source Columns |
---|---|---|
ecds |
care_professional_tier_all |
care_professional_tier_01 - care_proffessional_tier_10 |
ecds |
classification_all |
classification_01 - classification_04 |
ecds |
comorbidities_all |
comorbidities_01 - comorbidities_10 |
ecds |
diagnosis_code_all |
diagnosis_code_01 - diagnosis_code_12 |
ecds |
diagnosis_qualifier_all |
diagnosis_qualifier_01 - diagnosis_qualifier_12 |
ecds |
drug_alcohol_code_all |
drug_alcohol_code_01 - drug_alcohol_code_04 |
ecds |
investigation_code_all |
investigation_code_01 - investigation_code_12 |
ecds |
treatment_code_all |
treatment_code_01 - treatment_code_12 |
hes_apc |
acpdisp_all |
acpdisp_1 - acpdisp_9 |
hes_apc |
acpdqind_all |
acpdqind_1 - acpdqind_9 |
hes_apc |
acploc_all |
acploc_1 - acploc_9 |
hes_apc |
acpout_all |
acpout_1 - acpout_9 |
hes_apc |
acpsour_all |
acpsour_1 - acpsour_9 |
hes_apc |
acpspef_all |
acpspef_1 - acpspef_9 |
hes_apc |
diag_all |
diag_01 - diag_20 |
hes_apc |
opertn_all |
opertn_01 - opertn_24 |
hes_ae |
diag_all |
diag_01 - diag_12 |
hes_ae |
diag2_all |
diag2_01 - diag2_12 |
hes_ae |
diaga_all |
diaga_01 - diaga_12 |
hes_ae |
diags_all |
diags_01 - diags_12 |
hes_ae |
invest_all |
invest_01 - invest_12 |
hes_ae |
invest2_all |
invest2_01 - invest2_12 |
hes_ae |
treat2_all |
treat2_01 - treat2_12 |
hes_ae |
treat_all |
treat_01 - treat_12 |
hes_op |
diag_all |
diag_01 - diag_12 |
hes_op |
opertn_all |
opertn_01 - opertn_24 |
mortality |
icd10_multiple_cause_all |
icd10_multiple_cause_01 - icd10_multiple_cause_15 |
Diagnosis and treatment codes
ICD-10¶
ICD-10 is a classification of diseases that allows systematic recording, analysis, interpretation and comparison of mortality and morbidity data. It is the international standard diagnostic classification for all general epidemiological and many health-management purposes. Although the ICD is primarily designed for the classification of diseases and injuries with a formal diagnosis, not every problem or reason for coming into contact with health services can be categorised in this way.
ICD-10 codes must be used in the manner set forth in Volume 2: Instruction Manual of the International Statistical Classification of Diseases and Related Health Problems, Tenth Revision. You are responsible for ensuring that the codes are properly used in this manner.
For more information on ICD-10, please see the 'International statistical classification of diseases and related health problems (ICD-10)' document.
ICD codes and code descriptions are deposited in the Research Environment under the folder: /gel_data_resources/licenced_resources/ICD10
ICD-O-3¶
The International Classification of Diseases for Oncology (ICD-O) is internationally recognised as the definitive classification of neoplasms. It is used by cancer registries throughout the world to record incidence of malignancy and survival rates, and the data produced are used to inform cancer control, research activity, treatment planning and health economics. The classification of neoplasms used in ICD-O links closely to the definitions of neoplasms used in the WHO/IARC Classification of Tumours series, which are compiled by consensus groups of intenational experts and, as such, the classification is underpinned by the highest level of scientific evidence and opinion.
ICD-O consists of two axes (or coding systems), which together describe the tumour:
- the
topographical
code, which describes the anatomical site of origin (or organ system) of the tumour - the
morphological
code, which describes the cell type (or histology) of the tumour, together with the behaviour (malignant or benign).
SNOMED¶
SNOMED was started in 1965 as a Systematised Nomenclature of Pathology (SNOP) and was further developed into a logic-based health care terminology. SNOMED CT was created in 1999 by the merger, expansion and restructuring of two large-scale terminologies: SNOMED Reference Terminology (SNOMED RT) and the Clinical Terms Version 3 (CTV3) (formerly known as the Read codes), developed by the NHS.
Other medical history tables¶
cen
(nhs_d_cohort_event_notification_100k.tsv
; cohort event notification) informs, for Genomics England main programme participants, if the participant has Deceased, cancelled cypher or other event related to programme participation.did
(nhs_d_diagnostic_imaging_metadata_100k.tsv
) contains historic diagnostic imaging records of Genomics England main program participants.did_bridge
(nhs_d_diagnostic_imaging_linkage_100k.tsv
) links file of participants to DID records.ons
(office_of_national_statistics_mortality_100k.tsv
) lists the Office of National Statistics' cause of death records for the Genomics England main programme participants.
Mental health data¶
Mental Health Datasets contain historic data on patients receiving care in NHS specialist mental health services. Note that mental health (MH) data is split into mental health minimum data (mhmd_), mental health learning disabilities dataset (mhldds_) and mental health services dataset (mhsds_).
mhmd/_*
(Mental Health Minimum Dataset) - 2011-2014mhldds/_*
(Mental Health Learning Disabilities Dataset) - 2014-2016mhsds/_*
(Mental Health Services Dataset) - 2016 onwards
MHMD and MHLDDS each consist of three tables; event, record and episode, covering the periods 2011-2014 and 2014-2016 respectively. MHSDS is based on a new data model and consists of 35 tables covering the period 2016-2019. Thus, events that happened until March 2014 are found in mhmd_*
, between April 2014 and April 2016 in mhldds_*
and after April 2016 in mhsds_*
.
To make the new MHSDS dataset more accessible we have also generated four curated overview tables as well as a flag table showing which participants have data in individual MHSDS tables.
Curated mental health tables¶
The curated tables
have been generated by joining the main columns of interest from the individual MHSDS tables into the following four overview tables:
mhsds_curated_participant
Overview of general participant information; demographics, death details, GP registrations and psychosis indicators, as well as details of any care plans created for a participant.mhsds_curated_community
Overview of community (outpatient) care. This includes details on referrals, discharge agreements and care contacts with associated care activities.mhsds_curated_inpatient
Overview of inpatient care. This includes details of hospital spells, ward stays, delayed discharge periods and associated clinicians and care professionals.mhsds_curated_assessment_diagnoses_and_cluster
Overview of scored assessments and clustering tool assessments completed, patient diagnoses and allocated care clusters.
The below diagram shows which tables feed into the overview tables:
These curated tables give an overview of the data available across the entire dataset, giving a subset of the main data. Some rows have been removed from the curated tables due to data quality issues. For master_patient_index
results in the Participant curated table, only the most recent data for each participant has been included.
Where column names are shared between datasets in the curated tables(e.g. codeproc), the column names have been prefixed with the initials of their dataset name - e.g. those with shared column names from mhsds_indirect_activity
have been prefixed with ia_
.
In the data dictionary, you will see the column 'Column Origin (Curated only)', which links together the curated columns in the format
An additional flag table has also been generated to provide an easy way to view which participants are present in which of the individual tables.
mhsds_dataset_flags
contains, for each participant in mhsds, a flag of which mhsds tables they appear in. This only applies to the individual tables, not the curated tables.
Each row represents a participant with true/false Boolean columns for each of the 35 mhsds tables. If a participant is not present in this table then they have no data available in the MHSDS dataset.
mhmd (2011-2014) and mhldds (2014-2016) individual tables¶
mh_bridge
(nhs_d_mental_health_linkage_100k.tsv
) links file of participants to MHMD records.mhldds_episode
(nhs_d_mental_health_learning_and_disability_data_set_episodes_100k.tsv
) contains historic records of mental health (MH) related admissions of Genomics England main programme participants. Episode and event tables link to the records table via mhm_mhmds_spell_id.mhldds_event
(nhs_d_mental_health_learning_and_disability_data_set_events_100k.tsv
) contains historic records of MH related admissions of Genomics England main programme participants. Episode and event table link to the records table via mhm_mhmds_spell_id.mhldds_record
(nhs_d_mental_health_learning_and_disability_data_set_records_100k.tsv
) contains historic records of MH related admissions of Genomics England main programme participants. One record per spell per patient in a provider.mhmd_v4_episode
(nhs_d_mental_health_minimum_dataset_episodes_100k.tsv
) contains historic records of MH related admissions of Genomics England main programme participants. Episode and event tables link to the records table via mhm_mhmds_spell_id.mhmd_v4_event
(nhs_d_mental_health_minimum_dataset_events_100k.tsv
) contains historic records of MH related admissions of Genomics England main programme participants. Episode and event tables link to the records table via mhm_mhmds_spell_id.mhmd_v4_record
(nhs_d_mental_health_minimum_dataset_records_100k.tsv
) contains historic records of MH related admissions of Genomics England main programme participants. One record per spell per patient in a provider.
mhsds (>2016) individual tables¶
mhsds_bridge
Linking file of participant_id to mhsds_id (known as uniqmhsdspersid in individual tables).mhsds_master_patient_index
Provides patient information, demographics and death details for participants present in the MHSDS dataset. One record per participant per recording period (2016/17, 2017/18, 2018/19) giving a maximum of 3 records per participant.mhsds_gp_practice_registration
Carries details of GP Practice registrations for participants present in the MHSDS dataset. One record per change of GP practice registration.mhsds_patient_indicators
Carries details of specific indicators relating to the patient including psychosis.mhsds_care_coordinator
Carries details of the mental health care coordinator assigned to a patient. One record per assignment.mhsds_care_plan_type
Carries details of Care Plans created for a patient by their responsible organisation. One record per care plan created for the patient.mhsds_crisis_plan
The predecessor (pre 2018) to Care Plan Type. Carries detail of Crisis Plans created for a patient by their responsible organisation. One record per crisis plan created for the patient. Other care plan types are not included.mhsds_care_plan_agreement
Carries detail of any agreements to Care Plans by a patient, team or organisation. One record per care plan agreement.mhsds_service_or_team_referral
Carries details of the referral that a patient is subject to. This is an instance where a patient is referred to specialist care. One record for each referral.mhsds_service_or_team_type_referred_to
Carries details of the service or team that a patient is referred to. One record for each service or team referred to.mhsds_other_reason_for_referral
Carries additional details about why a patient has been referred to a specific service. One record per additional referral.mhsds_referral_to_treatment
Carries details about the referral to treatment details for the patients referral. One record for each Referral To Treatment period.mhsds_onward_referral
Carries details about any onward referral of a patient. One record per onward referral.mhsds_discharge_plan_agreement
Carries details about any agreements to a Discharge Plan by a person, team or organisation. One record per agreement of a discharge plan.mhsds_care_contact
Carries details about any contacts with a patient which have taken place as part of a referral. One record per care contact.mhsds_care_activity
Carries details about any Care Activity undertaken at a Care Contact. One record per care activity.mhsds_other_in_attendance
Carries details about any other people in attendance at a Care Contact. One record per other person in attendance at a Care Contact.mhsds_indirect_activity
Carries details of indirect activity which takes place as a result of the referral. One record for each instance of indirect activity taking place.mhsds_responsible_clinician_assignment
Carries details of the assignment of a Mental Health Responsible Clinician to the patient. One record per assigned Mental Health Responsible Clinician.mhsds_hospital_provider_spell
Carries details of each Hospital Provider Spell for a patient. One record per hospital provider spell. This is a continuous period of inpatient care under a single Hospital Provider starting with a hospital admission and ending with a discharge from hospital.mhsds_ward_stay
Carries additional details of Ward Stays which occurred during a Hospital Provider Spell for the patient. One record per ward stay.mhsds_assigned_care_professional
Carries details of the Care Professional assigned responsibility for the care of the patient. One record per care professional admitted care episode.mhsds_delayed_discharge
Carries details of the patient's Mental Health Delayed Discharge Periods which occurred during a Hospital Provider Spell. One record per instance of a patient being subject to a mental health delayed discharge period.mhsds_hospital_provider_spell_commissioner
Carries details of each Commissioner Assignment Period during a Hospital Provider Spell. One record per commissioner assignment period.mhsds_medical_history_previous_diagnosis
Carries details any previous diagnoses for a patient which are stated by the patient or recorded in medical notes. These do not necessarily have to have been diagnosed by the organisation submitting the data. One record per previous diagnosis.mhsds_provisional_diagnosis
Carries details of a provisional diagnosis recorded for a patient. One record per provisional diagnosis.mhsds_primary_diagnosis
Carries details of the primary diagnosis recorded for the patient. Only one record is permitted for the primary diagnosis per patient.mhsds_secondary_diagnosis
Carries details of a secondary diagnosis recorded for a patient. One record for each secondary diagnosis.mhsds_coded_scored_assessment_referral
Carries details of scored assessments that are issued and completed as part of a referral to a mental health service, but do not take place at a specific contact. One record per coded scored assessment question or dimension captured outside of a Care Contact.mhsds_coded_scored_assessment_act
Carries details of scored assessments that are issued and completed as part of a specific Care Activity. One record per coded scored assessment question or dimension captured as part of a specific Care Activity.mhsds_coded_scored_assessment_cont
Replaced by coded_score_assessment_act in 2018. Carries details of scored assessments that are issued and completed as part of a specific Care Activity. One record per coded scored assessment question or dimension captured as part of a specific Care Activity.mhsds_care_programme_approach_care_episode
Carries details of the periods of time the patient spent on Care Programme Approach. One record per CPA Care Episode.mhsds_care_programme_approach_review
Carries details of the Care Programme Approach (CPA) reviews undertaken for the patient. One record permitted for the most recent CPA Review that hastaken place.mhsds_clustering_tool_assessment
Carries details of all clustering tool assessments for all patients. One record per Clustering Tool Assessment.mhsds_coded_score_assessment_clustering_tool
Carries details of scored assessments that are issued and completed as part of a Clustering Tool assessment. One record per coded scored assessment question or dimension captured as part of a Clustering Tool assessment.mhsds_care_cluster
Carries details of the Care Cluster resulting from a clustering tool assessment. One record per period of time that a patient was allocated to a Care Cluster.
Secondary Data - COVID¶
This data describes extracts from the NHSE microbiology results database, known as SGSS, following linkage to external cohort studies. Linkage is made by NHS number, and all positive and negative results are included.
Please be aware that:
- NHSE is integrating data from a large number of NHS laboratories and third party organisations, at a very rapid rate.
- Not all laboratories are reporting negative results.
- It is possible that duplicate entries may exist, because some laboratories' results may reach SGSS via several different routes.
- Results from NHS providers are all integrated at present.
- Data from the Milton Keynes Superlab and similar Academic/Industry partnerships have now been integrated. These initiatives are at present being used predominantly to test Health Care Workers. 'Healthcare Worker Testing' is a category in SGSS that's included in the result set. It's also worth noting that this automatically clears the inpatient indicator (as this category of tests are for hospital staff, not patients).
- As of 16th March 2020, when the UK entered the 'delay' phase of the outbreak, testing was largely restricted to those referred to hospital, who are likely to be on the severe end of the disease spectrum. Admission to hospital for infection control reasons alone has not been practiced in the delay phase. Therefore, positive results from those for whom there is evidence (from the microbiology record) of hospitalisation are likely to be derived from cases of clinically significant COVID disease. All tests with a sample date earlier than 16th March 2020 are excluded from the returned results.
- The SGSS database does not contain clinical information.
- The date of death is obtained by the Office for National Statistics linkage on the date of the extract.
- When we extract weekly data, we will re-extract all records linking to the NHS number list in the external cohort.