NHS GMS data release change summary¶
nhs-gms-release_v4_2024-08-22¶
Changes to existing tables¶
cancer_analysis
Each participant only has one row per cancer case. An additional filter on the referral status is included in GMS data release v4 to exclude statuses that are not active
. This fixes an issue in GMS data release v3 which caused duplicate tumour_uid
to be present.
report_outcome_questionnaire
This table was previously called gmc_exit_questionnaire
, and has been renamed as report_outcome_questionnaire
so that it is more aligned with what the questionnaires are called in GMS.
LabKey UI datatype changes
There have been improvements to the datatypes in the LabKey UI for the following tables.
Table | Field | Previous datatype | Updated datatype |
---|---|---|---|
sample | collection_date | varchar | timestamp (yyyy-MM-dd format) |
din_value_glh | integer | decimal | |
percentage_dna_glh | integer | decimal | |
panels_applied | panel_identifier | integer | varchar |
tiering_data | father_affected | boolean | varchar |
mother_affected | boolean | varchar | |
exomiser | father_affected | boolean | varchar |
mother_affected | boolean | varchar | |
poly_phen | varchar | decimal | |
mutation_taster | varchar | decimal | |
sift | varchar | decimal | |
av_patient | embarkation | boolean | varchar |
sact | administration_date | varchar | timestamp (yyyy-MM-dd format) |
date_of_final_treatment | varchar | timestamp (yyyy-MM-dd format) |
|
chemo_radiation | varchar | boolean | |
regimen_mod_stopped_early | varchar | boolean | |
regimen_mod_time_delay | varchar | boolean | |
start_date_of_cycle | varchar | timestamp (yyyy-MM-dd format) |
|
start_date_of_regimen | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
|
date_decision_to_treat | varchar | timestamp (yyyy-MM-dd format) |
|
rtds | proceduredate | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
timeofexposure | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (HH:mm:ss format) |
|
treatmentstartdate | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
|
earliestclinappropriatedate | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
|
decisiontotreatdate | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
|
apptdate | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
|
av_treatment | eventdate | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
v_tumour | diagnosisdate1 | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
diagnosisdate2 | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
|
diagnosisdatebest | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
|
statusofregistration | boolean | varchar | |
breslow | varchar | decimal | |
first_hosp_date | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
|
date_first_surgery | timestamp (yyyy-MM-dd HH:mm:ss format) |
timestamp (yyyy-MM-dd format) |
nhs-gms-release_v3_2024-03-18¶
New data sets¶
This release includes secondary clinical data, i.e. medical history, from the National Cancer Registration and Analysis Service (NCRAS). The information is provided in the following tables:
av_imd
: income deprivation domainav_patient
: demographics from the Cancer Registration and information about deathav_rtd
: routes to diagnosisav_treatment
: treatment received for each participantav_tumour
: medical information about the tumourrtds
: radiotherapy datasetsact
: systemic anti-cancer therapy
More information on these datasets can be found in the Cancer-specific clinical data page
Changes to existing tables¶
participant
The participant
table now contains two additional fields:
Category
: this provides the category given to the referral (Cancer/Rare Diseases)Referral id
: this provides the id of the referral submitted to GMS
observation
The observation
table now contains two additional fields:
Normalised Hpo Id
Normalised Hpo Term
gmc_exit_questionnaire
The Additional Comments
and Publications
columns in the gmc_exit_questionnaire
table now contain information. Personal Identifiable Data (PID) has been masked by replacing it with '---'
.
nhs-gms-release_v2_2023-02-28¶
Some tables have been present in the 100K data and therefore follow a similar format. Based on the 100K format the following changes are present in similarly named tables of the NHS GMS data.
While various subtle changes have been made to the NHS GMS tables, we list some of the most important ones below. For example, with NHS GMS release v2 we have reintroduced the cancer_analysis table.
General Changes¶
- This release sees the introduction of partial referrals. Partial referrals are referrals for more than one participant, and for which only a proportion of the participants have consented for research. This may limit some of the data available for the participants within a partial referral who did consent for research. We are only releasing data for partial referral participants who did consent for research.
- We have had to make a change to our approach of encrypting participant, referral, and sample IDs. Therefore, you will unfortunately not be able to find the same IDs between NHS-GMS release v1 and v2. As the first release contained a relatively minimal dataset, we hope the impact of this change remains minimal. The majority of the participants included in NHS GMS release v1 will be part of release v2, but under different participant and referral IDs.
Bioinformatics data¶
genome_file_paths_and_types
- Structural Variant (SV) VCFs for rare disease participants (
*.diploidSV.vcf.gz
) are now provided per individual instead of a single VCF containing SVs of individuals of a given family. As implied, in NHS-GMS release v1 these were still provided at a family basis, but due to the introduction of partial referrals we aimed to maintain the possibility to study SVs as much as possible.
cancer_analysis
* Introduction of referral_id
as a case reference ID. Participants can be part of multiple referrals.
* NHS GMS columns clinical_indication_code
and clinical_indication_full_name
will provide detailed information on the tumour type (also found in the referral
table).
* 100K column tumour_id
has been replaced with tumour_uid
for NHS-GMS. The tumour_uid
will enable the linking of tumour morphology and topography data across clinical tables.
* 100K column tumour_clinical_sample_time
has been replaced with tumour_sample_clinical_sample_date_time
and the germline equivalent added as germline_sample_clinical_sample_date_time
. However, this data is no longer submitted for every referral, so is absent for many samples.
* NHS GMS columns somatic_tinc_vcf
and somatic_tinc_sv_vcf
are currently empty in the cancer_analysis table. This is not an error and is subject to change in future releases, but we decided to already include the column for this data.
* 100K columns analysis_csv_filepath
and analysis_html_filepath
have been replaced with cancer_report_reported_variants_csv
and cancer_report_supplementary_html
, respectively. In addition, we have now also provided the smaller summarised report in the cancer_report_html
column.
* The annotated VCFs, csv's and html's can now be found in a single interpretation folder to increase visibility of data belonging to the same interpretation request.
* While we expect that this table will receive more additions to increase its utility, we look forward to suggestions from the Research Community as to what may be useful columns or information.
gmc_exit_questionnaire
* While no changes have been made to this table, we want to reiterate that the columns additional_comments
and publications
have been intentionally made NA in this release as well. This remains subject to change in future releases.
Clinical Data¶
Three new fields have been added to this release of the clinical datasets:
referral.date_submitted
: this provides the date when the referral was first submitted to GMSplated_sample.date_of_dispatch
: this provides the date when the plated sample was dispatched to the sequencing facilityreferral.category
: this provides the category given to the referral (Cancer/Rare Diseases)- Several of the extraneous guid fields have been removed from this release of the clinical datasets, specifically:
condition.uid
observation_component.uid
participant.uid
referral.uid
referral_participant.uid
referral_test.uid
tumour_morphology.uid
tumour_topography.uid
nhs-gms-release_v1_2022-06-15¶
This data release represents the baseline for subsequent releases.
Some tables have been present in the 100K data and therefore follow a similar format. Based on the 100K format the following changes are present in similarly named tables of the NHS GMS data.
- The
participant_id
's have changed format and are now a string with the following logic:ppXXXXXXXXXXX
sequencing_report
andgenome_file_paths_and_types
- Column
family_id
has been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively,referral_id
replacesfamily_id
. - Column
laboratory_sample_id
has been removed and will not be available for NHS GMS data. - Discrepancy between
plate_key
vsplatekey
has been streamlined. From now on, only references toplatekey
are used. - Column
associated_interpretation_request_id
has been included. From now on researchers will have a better view on which CRAM files have been used for a given interpretation request. - Joint-called VCFs are now readily available in
/gel_data_resources/
and can be queried from either table. - Column
data_format
has been included. Within our pipeline, singletons will go through the same pipeline as multi-member families and are thus considered 'joint-called' even when it concerns a singleton. Samples called without other family members are marked assingle_sample
in the data_format column. - More granularity has been provided in the
file_sub_type
column (i.e. more types). - Column
delivery_date
has been streamlined across the table and now only containsYYYY-MM-DD
. Time stamps have been removed.
- Column
tiering_data
andexomiser
- Columns
rare_diseases_family_id
andfamily_id
have been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively,referral_id
replaces the utility ofrare_diseases_family_id
andfamily_id
. - Discrepancy between
sample_id
vsplatekey
has been streamlined. From now on, only references toplatekey
are used. - Discrepancy between
genome_build
vsassembly
has been streamlined. From now on, only references togenome_build
are used. - Columns
full_brothers_affected
andfull_sisters_affected
have been removed. This has been replaced byfull_siblings_affected
and indicates the number of affected full siblings. - Column
participant_phenotypic_sex
will be NA in this release. This is subject to change in future releases.
- Columns
panels_applied
- Column
rare_diseases_family_id
has been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively,referral_id
replaces the utility offamily_id
. - Discrepancy between
sample_id
vsplatekey
has been streamlined. From now on, only references toplatekey
are used.
- Column
tiered_variants_frequency
- A large number of columns will not be available for the initial release. The primary reason is their unavailability (may change) in our backend systems as changes have been made between the 100K pipeline and the NHS GMS pipeline. This is subject to change in future releases.
gmc_exit_questionnaire
- Column
family_id
has been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively,referral_id
replaces the utility offamily_id
. - Discrepancy between
genome_build
vsassembly
has been streamlined. From now on, only references togenome_build
are used. - Columns
additional_comments
andpublications
have been intentionally made NA in this release. This is subject to change in future releases. participant
,plated_sample
andsample
- A large number of the columns will not be available for this initial release. This is subject to change in future releases.
- Column
The data model for a number of the clinical tables is different to that in the 100,000 Genomes Project main programme releases. The below outlines where you would find the equivalent data in the main programme release.
condition
,observation
andobservation_component
- Data found in these tables can be found in the main programme tables
rare_disease_participant_disease
andrare_disease_participant_phenotype
- Data found in these tables can be found in the main programme tables
referral
andreferral_participant
- For NHS GMS, cases are referred to as referrals and family members will be part of a single referral. Effectively,
referral_id
replaces the utility offamily_id
and the referral tables replace the utility of therare_disease_pedigree
,rare_disease_pedigree_member
andrare_disease_family
tables - The concept of
pedigree_member
doesn't exist in NHS GMS, only data on currently consented individuals is included
- For NHS GMS, cases are referred to as referrals and family members will be part of a single referral. Effectively,