100kGP Release V1 (11/10/2017)¶
Document history and control¶
The controlled copy of this document is maintained in the Genomics England internal document management system. Any copies of this document held outside of that system, in whatever format (for example, paper, email attachment), are considered to have passed out of control and should be checked for currency and validity. This document is uncontrolled when printed.
Version history¶
Version | Date | Description |
---|---|---|
0.1 | 06/10/2017 | Initial draft of release note |
1.0 | 11/10/2017 | Final version that incorporates feedback |
Purpose¶
This document provides a description of the Main Programme Data Release v1 dated 11/10/2017.
This is our first formal release of data into the Research Environment. We intend to release data on a roughly quarterly basis although initial releases may be more frequent. In progressive releases, we will be incorporating new content, enhancing existing content and enabling more effective use of the existing and new data.
These data will be manifested within the current version of Genomics England Research Environment, accessed via the AWS virtual desktop interface and subject to all Genomics England data protection and privacy principles.
Release Overview¶
This release provides 19,865 genomes and associated clinical data. Of these, 17,338 are rare disease and 2,527 are cancer genomes.
- Genomic data are manifested in file shares.
- Clinical data are manifested in LabKey.
Some genomic data are currently called against GRCh37 and some against GRCh38, and there is also a mixture of Illumina’s alignment pipelines v2 and v4, reflecting the versions that were applicable at the time of sequencing. The versions for each genome are identified in the Genome table. We intend to recall all our genomes on a consistent basis in the future.
Audience¶
The intended audience for this document is researchers that have access to the Genomics England Research Environment.
Identifying this data release¶
The clinical data, secondary data, and tabulated bioinformatic data for this data release, and the paths to the applicable genome files, are found in the following LabKey folder:
main-programme /main-programme_v2_2017-10-11
Subsequent releases will be identified by an incremental increase in the version number and the date of data release.
Scope¶
In scope¶
Data that are in scope for this release:
- Cancer and rare disease data for consented main programme participants with current consent. These data include:
- genomic data
- clinical data for those whom we are providing genomic data
- Hospital Episode Statistics (HES) data for the above subset: HES Accident and Emergency, HES Admitted Patient Care and HES Outpatient Care.
Out of scope¶
Data that are out of scope for this release:
- Clinical and genomic data for participants that have withdrawn from the 100,000 Genomes Project.
- Participant data from the pilot phases of the project (ie not main programme).
- Interpretation request data and tiered variants (ie gene panels, interpretation settings and results of the interpretation pipeline).
- Formal pedigree data on rare disease participants although pedigree can be unambiguously derived from the relationship data provided for the large majority of families.
- Data on participants for those whom we do not yet hold sequence data.
- Other sources of secondary data other than HES.
Quality Notes¶
- BAM and VCF genomic data files are as they have been delivered to us by our sequencing provider. These have all passed an initial QC check based on sequencing quality and coverage. They have, however, not all undergone our in-house genetic checks and we therefore cannot guarantee against genetic versus reported discrepancies.
- Because of the availability of data at the time of release, some families lack a proband. These families without probands will also lack a diagnosis unless there is a second affected individual in the family. The missing data will be made available in a future release.
- Clinical data and HES data have been provided as submitted and have undergone limited validation.
- We do not yet have a complete HPO dataset for all participants in this release. The missing data will be made available in a future release.
Change Summary¶
The change summary below summarises the changes in each release:
Data release | Description |
---|---|
main-programme_v1_2017-10-11 | * This data release represents the baseline for subsequent releases. |
Data release description¶
Below is a description of the LabKey tables and their associated data fields.
Common¶
genome¶
Field | Description |
---|---|
assembly | The genome assembly versions in this release. GRCh37 is represented as 37 and GRCh38 as 38 |
participant_id | GEL participant ID |
platekey | The well identifier for the sequence, a composite of plate_id and well_id |
delivery_id | The ID number of the delivered sequence |
delivery_date | The date the sequence was delivered |
delivery_version | The illumina pipeline version |
path | The path to the sequence folder on the genomes drive |
plate_id | The ID number of the original plate used for sequencing |
well_id | The well of the sequence on the plate |
laboratory_sample_id | The ID number of the sample used for sequencing |
clinic_sample_type | The type of sample sequenced. Possible values: DNA Blood Germline; DNA FF Tumour; DNA FFPE Tumour; DNA; DNA Fibroblast; DNA Saliva; DNA FF Germline |
type | The type of sample. Possible values: rare disease; cancer germline; cancer tumour; unknown |
registration¶
Field | Description |
---|---|
participant_id | GEL participant ID |
study_type | Whether the participants is Rare Disease or Cancer |
sex_at_birth | Gender of participant at birth. 1 = Male; 2 = Female; 9 = Unknown |
GECIP_domain¶
Field | Description |
---|---|
participant_id | GEL participant ID |
family_id | The family ID |
domain | GECIP domain |
Rare disease¶
Disorder¶
Field | Description |
---|---|
disease_group | The disease group as described by the clinician |
disease_group_normalised | The disease group has been standardised using Genomics England naming convention. Note: This may change over time. |
disease_subgroup | The disease subgroup as described by the clinician |
disease_subgroup_normalised | The disease subgroup has been standardised using Genomics England naming convention. Note: This may change over time |
specific_disease | The specific disease as described by the clinician |
specific_disease_normalised | The specific disease has been standardised using Genomics England naming convention. Note: This may change over time |
participant_id | GEL participant ID |
Note: please refer to the GECIP Confluence page for the rare disease data model.
participant_relationship¶
Field | Description |
---|---|
participant_id | GEL participant ID |
participant_type_id | Whether the participant is a proband, or a relative |
relative_biological_relationship_to_proband_id | The relationship between the participant and the proband, values only present where participant_type_id = relative |
family_id | The family ID |
Note: please refer to the GECIP Confluence page for the rare disease data model
hpo¶
Field | Description |
---|---|
hpo_term | The accompanying HPO term description |
hpo_code | The HPO term identifier |
participant_id | GEL participant ID |
hpo_term_presence | Whether the HPO term is present or absent. Possible values: yes; no; unknown |
Cancer¶
tumour_type¶
Field | Description |
---|---|
participant_id | GEL participant ID |
disease_type_id | The cancer type of the tumour sample submitted to Genomics England |
tumour_subtype | The subtype of the cancer in question |
Note: please refer to the GECIP Confluence page for the cancer data model.
Medical history - HES data¶
HES data tables provided are Admitted Patient Care, Outpatient Care, and Accident and Emergency.
Contact and Support¶
For all queries relating to this data release please contact the Genomics England Service Desk portal: Service Desk The Service Desk is supported by dedicated GECIP team members for all relevant questions.
However, there is an expectation that GECIP domains are self-organised and self-managing. We will be providing a GECIP chat service that will operate within the Research Environment so that users can collaborate to resolve problems as a community. We will monitor GECIP chat ourselves and if we identify solutions to problems we will share with all users simultaneously.