Skip to content

NHS GMS data release change summary

nhs-gms-release_v4_2024-08-22

Changes to existing tables

cancer_analysis
Each participant only has one row per cancer case. An additional filter on the referral status is included in GMS data release v4 to exclude statuses that are not active. This fixes an issue in GMS data release v3 which caused duplicate tumour_uid to be present.

report_outcome_questionnaire
This table was previously called gmc_exit_questionnaire, and has been renamed as report_outcome_questionnaire so that it is more aligned with what the questionnaires are called in GMS.

LabKey UI datatype changes
There have been improvements to the datatypes in the LabKey UI for the following tables.

Table Field Previous datatype Updated datatype
sample collection_date varchar timestamp (yyyy-MM-dd format)
din_value_glh integer decimal
percentage_dna_glh integer decimal
panels_applied panel_identifier integer varchar
tiering_data father_affected boolean varchar
mother_affected boolean varchar
exomiser father_affected boolean varchar
mother_affected boolean varchar
poly_phen varchar decimal
mutation_taster varchar decimal
sift varchar decimal
av_patient embarkation boolean varchar
sact administration_date varchar timestamp (yyyy-MM-dd format)
date_of_final_treatment varchar timestamp (yyyy-MM-dd format)
chemo_radiation varchar boolean
regimen_mod_stopped_early varchar boolean
regimen_mod_time_delay varchar boolean
start_date_of_cycle varchar timestamp (yyyy-MM-dd format)
start_date_of_regimen timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
date_decision_to_treat varchar timestamp (yyyy-MM-dd format)
rtds proceduredate timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
timeofexposure timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (HH:mm:ss format)
treatmentstartdate timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
earliestclinappropriatedate timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
decisiontotreatdate timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
apptdate timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
av_treatment eventdate timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
v_tumour diagnosisdate1 timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
diagnosisdate2 timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
diagnosisdatebest timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
statusofregistration boolean varchar
breslow varchar decimal
first_hosp_date timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)
date_first_surgery timestamp (yyyy-MM-dd HH:mm:ss format) timestamp (yyyy-MM-dd format)

nhs-gms-release_v3_2024-03-18

New data sets

This release includes secondary clinical data, i.e. medical history, from the National Cancer Registration and Analysis Service (NCRAS). The information is provided in the following tables:

  • av_imd: income deprivation domain
  • av_patient: demographics from the Cancer Registration and information about death
  • av_rtd: routes to diagnosis
  • av_treatment: treatment received for each participant
  • av_tumour: medical information about the tumour
  • rtds: radiotherapy dataset
  • sact: systemic anti-cancer therapy

More information on these datasets can be found in the Cancer-specific clinical data page

Changes to existing tables

participant

The participant table now contains two additional fields:

  • Category: this provides the category given to the referral (Cancer/Rare Diseases)
  • Referral id: this provides the id of the referral submitted to GMS

observation

The observation table now contains two additional fields:

  • Normalised Hpo Id
  • Normalised Hpo Term

gmc_exit_questionnaire

The Additional Comments and Publications columns in the gmc_exit_questionnaire table now contain information. Personal Identifiable Data (PID) has been masked by replacing it with '---'.

nhs-gms-release_v2_2023-02-28

Some tables have been present in the 100K data and therefore follow a similar format. Based on the 100K format the following changes are present in similarly named tables of the NHS GMS data.

While various subtle changes have been made to the NHS GMS tables, we list some of the most important ones below. For example, with NHS GMS release v2 we have reintroduced the cancer_analysis table.

General Changes

  • This release sees the introduction of partial referrals. Partial referrals are referrals for more than one participant, and for which only a proportion of the participants have consented for research. This may limit some of the data available for the participants within a partial referral who did consent for research. We are only releasing data for partial referral participants who did consent for research.
  • We have had to make a change to our approach of encrypting participant, referral, and sample IDs. Therefore, you will unfortunately not be able to find the same IDs between NHS-GMS release v1 and v2. As the first release contained a relatively minimal dataset, we hope the impact of this change remains minimal. The majority of the participants included in NHS GMS release v1 will be part of release v2, but under different participant and referral IDs.

Bioinformatics data

genome_file_paths_and_types

  • Structural Variant (SV) VCFs for rare disease participants (*.diploidSV.vcf.gz) are now provided per individual instead of a single VCF containing SVs of individuals of a given family. As implied, in NHS-GMS release v1 these were still provided at a family basis, but due to the introduction of partial referrals we aimed to maintain the possibility to study SVs as much as possible.

cancer_analysis * Introduction of referral_id as a case reference ID. Participants can be part of multiple referrals. * NHS GMS columns clinical_indication_code and clinical_indication_full_name will provide detailed information on the tumour type (also found in the referral table). * 100K column tumour_id has been replaced with tumour_uid for NHS-GMS. The tumour_uid will enable the linking of tumour morphology and topography data across clinical tables. * 100K column tumour_clinical_sample_time has been replaced with tumour_sample_clinical_sample_date_time and the germline equivalent added as germline_sample_clinical_sample_date_time. However, this data is no longer submitted for every referral, so is absent for many samples. * NHS GMS columns somatic_tinc_vcf and somatic_tinc_sv_vcf are currently empty in the cancer_analysis table. This is not an error and is subject to change in future releases, but we decided to already include the column for this data. * 100K columns analysis_csv_filepath and analysis_html_filepath have been replaced with cancer_report_reported_variants_csv and cancer_report_supplementary_html, respectively. In addition, we have now also provided the smaller summarised report in the cancer_report_html column. * The annotated VCFs, csv's and html's can now be found in a single interpretation folder to increase visibility of data belonging to the same interpretation request. * While we expect that this table will receive more additions to increase its utility, we look forward to suggestions from the Research Community as to what may be useful columns or information.

gmc_exit_questionnaire * While no changes have been made to this table, we want to reiterate that the columns additional_comments and publications have been intentionally made NA in this release as well. This remains subject to change in future releases.

Clinical Data

Three new fields have been added to this release of the clinical datasets:

  • referral.date_submitted: this provides the date when the referral was first submitted to GMS
  • plated_sample.date_of_dispatch: this provides the date when the plated sample was dispatched to the sequencing facility
  • referral.category: this provides the category given to the referral (Cancer/Rare Diseases)
  • Several of the extraneous guid fields have been removed from this release of the clinical datasets, specifically:
    • condition.uid
    • observation_component.uid
    • participant.uid
    • referral.uid
    • referral_participant.uid
    • referral_test.uid
    • tumour_morphology.uid
    • tumour_topography.uid

nhs-gms-release_v1_2022-06-15

This data release represents the baseline for subsequent releases.

Some tables have been present in the 100K data and therefore follow a similar format. Based on the 100K format the following changes are present in similarly named tables of the NHS GMS data.

  • The participant_id's have changed format and are now a string with the following logic: ppXXXXXXXXXXX
  • sequencing_report and genome_file_paths_and_types
    • Column family_id has been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively, referral_id replaces family_id.
    • Column laboratory_sample_id has been removed and will not be available for NHS GMS data.
    • Discrepancy between plate_key vs platekey has been streamlined. From now on, only references to platekey are used.
    • Column associated_interpretation_request_id has been included. From now on researchers will have a better view on which CRAM files have been used for a given interpretation request.
    • Joint-called VCFs are now readily available in /gel_data_resources/ and can be queried from either table.
    • Column data_format has been included. Within our pipeline, singletons will go through the same pipeline as multi-member families and are thus considered 'joint-called' even when it concerns a singleton. Samples called without other family members are marked as single_sample in the data_format column.
    • More granularity has been provided in the file_sub_type column (i.e. more types).
    • Column delivery_date has been streamlined across the table and now only contains YYYY-MM-DD. Time stamps have been removed.
  • tiering_data and exomiser
    • Columns rare_diseases_family_id and family_id have been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively, referral_id replaces the utility of rare_diseases_family_id and family_id.
    • Discrepancy between sample_id vs platekey has been streamlined. From now on, only references to platekey are used.
    • Discrepancy between genome_build vs assembly has been streamlined. From now on, only references to genome_build are used.
    • Columns full_brothers_affected and full_sisters_affected have been removed. This has been replaced by full_siblings_affected and indicates the number of affected full siblings.
    • Column participant_phenotypic_sex will be NA in this release. This is subject to change in future releases.
  • panels_applied
    • Column rare_diseases_family_id has been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively, referral_id replaces the utility of family_id.
    • Discrepancy between sample_id vs platekey has been streamlined. From now on, only references to platekey are used.
  • tiered_variants_frequency
    • A large number of columns will not be available for the initial release. The primary reason is their unavailability (may change) in our backend systems as changes have been made between the 100K pipeline and the NHS GMS pipeline. This is subject to change in future releases.
  • gmc_exit_questionnaire
    • Column family_id has been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively, referral_id replaces the utility of family_id.
    • Discrepancy between genome_build vs assembly has been streamlined. From now on, only references to genome_build are used.
    • Columns additional_comments and publications have been intentionally made NA in this release. This is subject to change in future releases.
    • participant, plated_sample and sample
    • A large number of the columns will not be available for this initial release. This is subject to change in future releases.

The data model for a number of the clinical tables is different to that in the 100,000 Genomes Project main programme releases. The below outlines where you would find the equivalent data in the main programme release.

  • condition, observation and observation_component
    • Data found in these tables can be found in the main programme tables rare_disease_participant_disease and rare_disease_participant_phenotype
  • referral and referral_participant
    • For NHS GMS, cases are referred to as referrals and family members will be part of a single referral. Effectively, referral_id replaces the utility of family_id and the referral tables replace the utility of the rare_disease_pedigree, rare_disease_pedigree_member and rare_disease_family tables
    • The concept of pedigree_member doesn't exist in NHS GMS, only data on currently consented individuals is included