NHS GMS data release change summary¶
nhs-gms-release_v2_2023-02-28¶
Some tables have been present in the 100K data and therefore follow a similar format. Based on the 100K format the following changes are present in similarly named tables of the NHS GMS data.
While various subtle changes have been made to the NHS GMS tables, we list some of the most important ones below. For example, with NHS GMS release v2 we have reintroduced the cancer_analysis table.
General Changes¶
- This release sees the introduction of partial referrals. Partial referrals are referrals for more than one participant, and for which only a proportion of the participants have consented for research. This may limit some of the data available for the participants within a partial referral who did consent for research. We are only releasing data for partial referral participants who did consent for research.
- We have had to make a change to our approach of encrypting participant, referral, and sample IDs. Therefore, you will unfortunately not be able to find the same IDs between NHS-GMS release v1 and v2. As the first release contained a relatively minimal dataset, we hope the impact of this change remains minimal. The majority of the participants included in NHS GMS release v1 will be part of release v2, but under different participant and referral IDs.
genome_file_paths_and_types¶
- Structural Variant (SV) VCFs for rare disease participants (*.diploidSV.vcf.gz) are now provided per individual instead of a single VCF containing SVs of individuals of a given family. As implied, in NHS-GMS release v1 these were still provided at a family basis, but due to the introduction of partial referrals we aimed to maintain the possibility to study SVs as much as possible. cancer_analysis
- Introduction of
referral_id
as a case reference ID. Participants can be part of multiple referrals. - NHS GMS columns
clinical_indication_code
andclinical_indication_full_name
will provide detailed information on the tumour type (also found in thereferral
table). - 100K column
tumour_id
has been replaced withtumour_uid
for NHS-GMS. Thetumour_uid
will enable the linking of tumour morphology and topography data across clinical tables. - 100K column
tumour_clinical_sample_time
has been replaced withtumour_sample_clinical_sample_date_time
and the germline equivalent added asgermline_sample_clinical_sample_date_time
. However, this data is no longer submitted for every referral, so is absent for many samples. - NHS GMS columns
somatic_tinc_vcf
andsomatic_tinc_sv_vcf
are currently empty in the cancer_analysis table. This is not an error and is subject to change in future releases, but we decided to already include the column for this data. - 100K columns
analysis_csv_filepath
andanalysis_html_filepath
have been replaced withcancer_report_reported_variants_csv
andcancer_report_supplementary_html
, respectively. In addition, we have now also provided the smaller summarised report in thecancer_report_html
column. - The annotated VCFs, csv's and html's can now be found in a single interpretation folder to increase visibility of data belonging to the same interpretation request.
- While we expect that this table will receive more additions to increase its utility, we look forward to suggestions from the Research Community as to what may be useful columns or information. gmc_exit_questionnaire
- While no changes have been made to this table, we want to reiterate that the columns
additional_comments
andpublications
have been intentionally made NA in this release as well. This remains subject to change in future releases.
Clinical Data¶
Three new fields have been added to this release of the clinical datasets:
* referral.date_submitted
: this provides the date when the referral was first submitted to GMS
* plated_sample.date_of_dispatch
: this provides the date when the plated sample was dispatched to the sequencing facility
* referral.category
: this provides the category given to the referral (Cancer/Rare Diseases)
* Several of the extraneous guid fields have been removed from this release of the clinical datasets, specifically:
* condition.uid
* observation_component.uid
* participant.uid
* referral.uid
* referral_participant.uid
* referral_test.uid
* tumour_morphology.uid
* tumour_topography.uid
nhs-gms-release_v1_2022-06-15¶
This data release represents the baseline for subsequent releases.
Some tables have been present in the 100K data and therefore follow a similar format. Based on the 100K format the following changes are present in similarly named tables of the NHS GMS data.
- The
participant_id
's have changed format and are now a string with the following logic:ppXXXXXXXXXXX
sequencing_report
andgenome_file_paths_and_types
- Column
family_id
has been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively,referral_id
replacesfamily_id
. - Column
laboratory_sample_id
has been removed and will not be available for NHS GMS data. - Discrepancy between
plate_key
vsplatekey
has been streamlined. From now on, only references toplatekey
are used. - Column
associated_interpretation_request_id
has been included. From now on researchers will have a better view on which CRAM files have been used for a given interpretation request. - Joint-called VCFs are now readily available in
/gel_data_resources/
and can be queried from either table. - Column
data_format
has been included. Within our pipeline, singletons will go through the same pipeline as multi-member families and are thus considered 'joint-called' even when it concerns a singleton. Samples called without other family members are marked assingle_sample
in the data_format column. - More granularity has been provided in the
file_sub_type
column (i.e. more types). - Column
delivery_date
has been streamlined across the table and now only containsYYYY-MM-DD
. Time stamps have been removed.
- Column
tiering_data
andexomiser
- Columns
rare_diseases_family_id
andfamily_id
have been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively,referral_id
replaces the utility ofrare_diseases_family_id
andfamily_id
. - Discrepancy between
sample_id
vsplatekey
has been streamlined. From now on, only references toplatekey
are used. - Discrepancy between
genome_build
vsassembly
has been streamlined. From now on, only references togenome_build
are used. - Columns
full_brothers_affected
andfull_sisters_affected
have been removed. This has been replaced byfull_siblings_affected
and indicates the number of affected full siblings. - Column
participant_phenotypic_sex
will be NA in this release. This is subject to change in future releases.
- Columns
panels_applied
- Column
rare_diseases_family_id
has been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively,referral_id
replaces the utility offamily_id
. - Discrepancy between
sample_id
vsplatekey
has been streamlined. From now on, only references toplatekey
are used.
- Column
tiered_variants_frequency
- A large number of columns will not be available for the initial release. The primary reason is their unavailability (may change) in our backend systems as changes have been made between the 100K pipeline and the NHS GMS pipeline. This is subject to change in future releases.
gmc_exit_questionnaire
- Column
family_id
has been removed. From now on, cases are referred to as referrals and family members will be part of a single referral. Effectively,referral_id
replaces the utility offamily_id
. - Discrepancy between
genome_build
vsassembly
has been streamlined. From now on, only references togenome_build
are used. - Columns
additional_comments
andpublications
have been intentionally made NA in this release. This is subject to change in future releases. participant
,plated_sample
andsample
- A large number of the columns will not be available for this initial release. This is subject to change in future releases.
- Column
The data model for a number of the clinical tables is different to that in the 100,000 Genomes Project main programme releases. The below outlines where you would find the equivalent data in the main programme release.
condition
,observation
andobservation_component
- Data found in these tables can be found in the main programme tables
rare_disease_participant_disease
andrare_disease_participant_phenotype
- Data found in these tables can be found in the main programme tables
referral
andreferral_participant
- For NHS GMS, cases are referred to as referrals and family members will be part of a single referral. Effectively,
referral_id
replaces the utility offamily_id
and the referral tables replace the utility of therare_disease_pedigree
,rare_disease_pedigree_member
andrare_disease_family
tables - The concept of
pedigree_member
doesn't exist in NHS GMS, only data on currently consented individuals is included
- For NHS GMS, cases are referred to as referrals and family members will be part of a single referral. Effectively,