2017)¶

Document history and control¶

The controlled copy of this document is maintained in the Genomics England internal document management system. Any copies of this document held outside of that system, in whatever format (for example, paper, email attachment), are considered to have passed out of control and should be checked for currency and validity. This document is uncontrolled when printed.

Version history¶

Version	Date	Description
0.1	06/10/2017	Initial draft of release note
1.0	11/10/2017	Final version that incorporates feedback

Purpose¶

This document provides a description of the Main Programme Data Release v1 dated 11/10/2017.

This is our first formal release of data into the Research Environment. We intend to release data on a roughly quarterly basis although initial releases may be more frequent. In progressive releases, we will be incorporating new content, enhancing existing content and enabling more effective use of the existing and new data.

These data will be manifested within the current version of Genomics England Research Environment, accessed via the AWS virtual desktop interface and subject to all Genomics England data protection and privacy principles.

Release Overview¶

This release provides 19,865 genomes and associated clinical data. Of these, 17,338 are rare disease and 2,527 are cancer genomes.

Genomic data are manifested in file shares.
Clinical data are manifested in LabKey.

Some genomic data are currently called against GRCh37 and some against GRCh38, and there is also a mixture of Illumina’s alignment pipelines v2 and v4, reflecting the versions that were applicable at the time of sequencing. The versions for each genome are identified in the Genome table. We intend to recall all our genomes on a consistent basis in the future.

Audience¶

The intended audience for this document is researchers that have access to the Genomics England Research Environment.

Identifying this data release¶

The clinical data, secondary data, and tabulated bioinformatic data for this data release, and the paths to the applicable genome files, are found in the following LabKey folder:

main-programme /main-programme_v2_2017-10-11

Subsequent releases will be identified by an incremental increase in the version number and the date of data release.

Scope¶

In scope¶

Data that are in scope for this release:

Cancer and rare disease data for consented main programme participants with current consent. These data include:
genomic data
clinical data for those whom we are providing genomic data
Hospital Episode Statistics (HES) data for the above subset: HES Accident and Emergency, HES Admitted Patient Care and HES Outpatient Care.

Out of scope¶

Data that are out of scope for this release:

Clinical and genomic data for participants that have withdrawn from the 100,000 Genomes Project.
Participant data from the pilot phases of the project (ie not main programme).
Interpretation request data and tiered variants (ie gene panels, interpretation settings and results of the interpretation pipeline).
Formal pedigree data on rare disease participants although pedigree can be unambiguously derived from the relationship data provided for the large majority of families.
Data on participants for those whom we do not yet hold sequence data.
Other sources of secondary data other than HES.

Quality Notes¶

BAM and VCF genomic data files are as they have been delivered to us by our sequencing provider. These have all passed an initial QC check based on sequencing quality and coverage. They have, however, not all undergone our in-house genetic checks and we therefore cannot guarantee against genetic versus reported discrepancies.
Because of the availability of data at the time of release, some families lack a proband. These families without probands will also lack a diagnosis unless there is a second affected individual in the family. The missing data will be made available in a future release.
Clinical data and HES data have been provided as submitted and have undergone limited validation.
We do not yet have a complete HPO dataset for all participants in this release. The missing data will be made available in a future release.

Change Summary¶

The change summary below summarises the changes in each release:

Data release	Description
main-programme_v1_2017-10-11	* This data release represents the baseline for subsequent releases.

Data release description¶

Below is a description of the LabKey tables and their associated data fields.

Common¶

genome¶

Field	Description
assembly	The genome assembly versions in this release. GRCh37 is represented as 37 and GRCh38 as 38
participant_id	GEL participant ID
platekey	The well identifier for the sequence, a composite of plate_id and well_id
delivery_id	The ID number of the delivered sequence
delivery_date	The date the sequence was delivered
delivery_version	The illumina pipeline version
path	The path to the sequence folder on the genomes drive
plate_id	The ID number of the original plate used for sequencing
well_id	The well of the sequence on the plate
laboratory_sample_id	The ID number of the sample used for sequencing
clinic_sample_type	The type of sample sequenced. Possible values: DNA Blood Germline; DNA FF Tumour; DNA FFPE Tumour; DNA; DNA Fibroblast; DNA Saliva; DNA FF Germline
type	The type of sample. Possible values: rare disease; cancer germline; cancer tumour; unknown

registration¶

Field	Description
participant_id	GEL participant ID
study_type	Whether the participants is Rare Disease or Cancer
sex_at_birth	Gender of participant at birth. 1 = Male; 2 = Female; 9 = Unknown

GECIP_domain¶

Field	Description
participant_id	GEL participant ID
family_id	The family ID
domain	GECIP domain

Rare disease¶

Disorder¶

Field	Description
disease_group	The disease group as described by the clinician
disease_group_normalised	The disease group has been standardised using Genomics England naming convention. Note: This may change over time.
disease_subgroup	The disease subgroup as described by the clinician
disease_subgroup_normalised	The disease subgroup has been standardised using Genomics England naming convention. Note: This may change over time
specific_disease	The specific disease as described by the clinician
specific_disease_normalised	The specific disease has been standardised using Genomics England naming convention. Note: This may change over time
participant_id	GEL participant ID

Note: please refer to the GECIP Confluence page for the rare disease data model.

participant_relationship¶

Field	Description
participant_id	GEL participant ID
participant_type_id	Whether the participant is a proband, or a relative
relative_biological_relationship_to_proband_id	The relationship between the participant and the proband, values only present where participant_type_id = relative
family_id	The family ID

Note: please refer to the GECIP Confluence page for the rare disease data model

hpo¶

Field	Description
hpo_term	The accompanying HPO term description
hpo_code	The HPO term identifier
participant_id	GEL participant ID
hpo_term_presence	Whether the HPO term is present or absent. Possible values: yes; no; unknown

Cancer¶

tumour_type¶

Field	Description
participant_id	GEL participant ID
disease_type_id	The cancer type of the tumour sample submitted to Genomics England
tumour_subtype	The subtype of the cancer in question

Note: please refer to the GECIP Confluence page for the cancer data model.

Medical history - HES data¶

HES data tables provided are Admitted Patient Care, Outpatient Care, and Accident and Emergency.

Contact and Support¶

For all queries relating to this data release please contact the Genomics England Service Desk portal: Service Desk The Service Desk is supported by dedicated GECIP team members for all relevant questions.

However, there is an expectation that GECIP domains are self-organised and self-managing. We will be providing a GECIP chat service that will operate within the Research Environment so that users can collaborate to resolve problems as a community. We will monitor GECIP chat ourselves and if we identify solutions to problems we will share with all users simultaneously.