Release V5.1 (20/11/2018)¶
Purpose¶
This document provides a description of the Main Programme Data Release v5.1 dated 20.11.2018.
This is an incremental release of the fifth formal release of Main Programme data into the Research Environment. Genomics England will be releasing data on roughly a quarterly basis. Each progressive release will incorporate new content, enhancing existing content, and enable more effective use of the existing and new data.
This data will be manifested within the current version of Genomics England Research Environment, accessed via the AWS virtual desktop interface and subject to all Genomics England data protection and privacy principles.
Release Overview¶
This release provides clinical data for 85,061 participants, and 80,883 genomes from 70,128 of these participants. Of these genomes, 61,578 are rare disease genomes (from 60,893 participants) and 19,305 are cancer genomes (from 9,236 participants)1.
Participants¶
Type | Count |
---|---|
Rare Disease | 68,003 |
Cancer | 17,058 |
Total | 85,061 |
Genomes¶
Type | Genomes count | Participant count |
---|---|---|
Cancer Germline | 9,532 | 9,140 |
Cancer Tumour | 9,773 | 9,065 |
Cancer Total | 19,305 | 9,236 |
Rare Disease | 61,578 | 60,893 |
Genomes Total | 80,883 | 70,128 |
- Genomic data are manifested in file shares.
- Clinical data and secondary health data (“medical history”) are manifested in LabKey.
The clinical data provided in the release comprise a broader set of variables than in the November 2018 Main Programme v5 data release. This release seeks to include all variables that contain (or may contain in future) meaningful data whilst not compromising participant privacy.
Some genomic data are currently aligned against the reference genome version GRCh37 and some against version GRCh38. The alignments were also made using different versions of Illumina’s alignment pipelines V2 and V4, reflecting the versions that were applicable at the time of sequencing. All new genomic data added in the current data release (since July 2018) is aligned against the reference genome version GRCh38, using alignment pipelines V4. The versions for each genome are identified in the Sequencing Report table. We intend to provide consistently realigned and recalled version of all our genomes in the future.
Audience¶
The intended audience for this document is researchers that have access to the Genomics England Research Environment. This does not include taught students on the MSc Genomic Medicine, who have access to a small subset of Main Programme data.
Identifying this data release¶
The clinical data, secondary data, and tabulated bioinformatic data for this data release, and the paths to the applicable genome files, are found in the following LabKey folder:
main-programme /main-programme_v5.1_2018-11-20
Subsequent releases will be identified by an incremental increase in the version number and the date of data release.
The main genome sequence files are found in the User’s AWS Home Drive, organised by date. Some of the included genomic data produced by the Genomics England Bioinformatics pipeline (such as rare disease de novo variants or tiering, structural and copy-number variant reports for cancer genomes or internal allele frequency VCFs for cancer genomes) are found in the Genomics England Data Resources (see Section 7.6).
Scope¶
In scope¶
Data that are in scope for this release:
- Cancer and rare disease data for the main programme participants with current consent. These data include:
- Genomic data for participants when available
- Whole genome sequencing (WGS) family-based quality control for rare disease, reporting sex checks and pedigree checks
- Outputs of the Genomics England Bioinformatics rare diseases interpretation pipeline
- Tiering data
- GMC outcome data ("exit questionnaire data")
- Interpretation request data
- Multi-sample VCF for interpreted genomes
- Outputs of the Genomics England Bioinformatics cancer interpretation pipeline
- Gold standard cancer genomes which have been through interpretation and passed quality checks
- Tumour signature and mutational burden data
- Annotation and tiering of small variants
- Tiering, structural and copy number variant report
- Cancer PCA stats
- Primary clinical data, including formal pedigree data on rare disease participants where it is available; and
- Secondary datasets (medical history), including:
- Hospital Episode Statistics (HES), including HES Admitted Patient Care, HES Adult Critical Care episodes, HES Accident and Emergency and HES Outpatient care.
- Peripheral datasets, including Diagnostic Imaging Data (DID), Patient Reported Outcome Measures (PROMs), Mental Health Minimum Dataset (MHMDS) and death statistics (Cohort Event Notification and cause of death report)
Out of scope¶
Data that are out of scope for this release:
- Clinical and genomic data for participants that have withdrawn from the 100,000 Genomes Project.
- Participant data from the pilot phases of the project (i.e. not main programme).
- Sources of secondary data other than HES, DID, PROMs, MHMDS and ONS / CEN.
Quality Notes¶
- BAM and VCF genomic data files are as they have been delivered to us by our sequencing provider. These have all passed an initial QC check based on sequencing quality and coverage. They have, however, not all undergone our full in-house genetic checks and we therefore cannot guarantee against genetic versus reported sex and family relationship discrepancies.
- For Rare Disease genomes, it should be noted that all tiered genomes have passed through Genomics England in-house QCs and that all tiered genomes come from the pool of genomes that have had family checks applied to them, as a first step towards Genomics England tiering.
- For Cancer genomes, it should be noted that all gold standard genomes that have been through Genomics England interpretation and passed quality checks are found in the cancer quick view table cancer_analysis.
- Some rare disease families lack a proband due to the availability of data at the time of release. These families without probands will also lack a diagnosis unless there is a second affected individual in the family. The missing data will be made available in a future release.
- Clinical data and secondary data have been provided as submitted and have undergone limited validation.
- Human Phenotype Ontology (HPO) terms may be missing or incomplete for some participants. This will be updated in future releases.
- Formal pedigree data are only available for a subset of rare disease participants. This will be updated in future releases. Each participant’s relationship to their family’s proband is available for all cases; this can be used to determine family relationships instead of formal pedigree data.
- WGS family selection quality checks are provided for rare disease genomes on GRCh38, reporting abnormalities of sex chromosomes, family relatedness, Mendelian inconsistencies and reported vs genetic sex summary checks (only sex checks are unpacked into individual data fields).
Conditions of Use¶
Participants identified as TracerX in the field normalised_consent_form in the participant table must not be used by commercial organisations.
Data Release Description¶
The Genomics England data are organised into data views (displayed within LabKey as tables) categorised into Quick View, Common, Rare Disease and Cancer.
The Data Dictionary that describes the table structure and provides data definitions for this release can be found here.
Quick View¶
Data views that bring together data from several LabKey tables for convenient access:
Name of Table / Data View | Description |
---|---|
rare_disease_analysis | Data for all rare disease participants including: sex, ethnicity, disease recruited for and relationship to proband; latest genome build, QC status of latest genome, path to latest genomes and whether tiering data are available; as well as family selection quality checks for rare disease genomes on GRCh38, reporting abnormalities of the sex chromosomes, family relatedness, Mendelian inconsistencies and reported vs genetic sex summary checks. Please note that only sex checks are unpacked into individual data fields; a final status is shown in the “genetic vs reported results” column. |
cancer_analysis | Data for all cancer participants whose genomes have been through Genomics England bioinformatics interpretation and passed quality checks, including: sex, ethnicity, disease recruited for and diagnosis; tumour ID, build of latest genome, QC status of latest genome and path to latest genomes; as well file paths to the genomes. This table includes information derived from laboratory_sample and cancer_participant_tumour. Tumour Mutational Burden The table includes the relative proportions of the different mutational signatures demonstrated by the tumour. Analysis of large sequencing datasets (10,952 exomes and 1,048 whole-genomes from 40 distinct tumour types) has allowed patterns of relative contextual frequencies of different SNVs to be grouped into specific mutational signatures. Using mathematical methods (decomposition by non-negative least squares) the contribution of each of these signatures to the overall mutation burden observed in a tumour can be derived. Further details of the 30 different mutational signatures used for this analysis, their prevalence in different tumour types and proposed aetiology can be found at the Sanger Institute Website. Cancer PCA QC Statistics The cancer analysis pipeline employs a sequencing quality control check which selects several important statistics associated with the sequencing returned by the sequencing provider, and uses them to check whether or not the sample in question is an outlier with respect to previous samples that have been run through the pipeline. It is, in effect, a safety net that can spot issues that have occurred at the tissue collection stage (i.e. at the GMC (Genomic Medicine Centre)) or at the library preparation step (i.e. at the sequencing provider), both of which may impact upon the final genomic analysis returned to the clinician. |
Common¶
Data views that are common to both the rare disease and the cancer domains. This data pertains to sample handling, genome sequencing, and participant data.
Data Relating to Participants:
Name of Table / Data View | Description |
---|---|
participant | Data on each individual participant in the 100,000 Genomes Project, e.g. personal information (such as relatives or self-reported ethnicity); points of contact with the Project (e.g. handling Genomic Medicine Centre or Trust); and a record of the status of their clinical review. |
sequencing_report | For each participant in the 100,000 Genomes Project, this table contains data describing the sequencing of their genome(s) and associated output, as well as the sample type that the sequence is from. |
domain_assignment | For each participant in the 100,000 Genomes Project, this table contains: data describing the disease type to which they were recruited; the disease panel applied to their genome; the GECIP domain to which their genome has been assigned for the purposes of administering the GECIP publication moratorium; as well as the end date of the GECIP moratorium associated with their genome(s). |
genome_file_paths_and_types | Data that specifies the genomic files and their folder locations for a given a participant. |
Data Relating to Samples:
Name of Table / Data View | Description |
---|---|
clinic_sample | Data describing the taking and handling of participant samples at the Genomic Medicine Centres, i.e. in the clinic, as well as the type of samples obtained. Because of the complexities of handling and managing tumour tissues samples in a clinical setting, there are many fields that are cancer-specific. |
clinic_sample_quality_check_result | Data describing the quality control of obtaining and handling participant samples at the Genomic Medicine Centres, i.e. in the clinic. |
laboratory_sample | Data describing the handling of samples at the biorepository and in preparation for sequencing, as well as the type of sample. |
Rare Diseases¶
Rare Disease data are presented at the level of Rare Disease families (families of probands), Rare Disease pedigrees, and participants. Participants are individuals who have consented to be part of the project with the expectation that a sample of their DNA will be obtained and their genome sequenced. Pedigree members are extended members of the proband’s family, this includes participants as well a small amounts of deidentified data recorded to allow a full picture of the proband’s extended family. This additional information is extracted from the proband’s medical record.
All Rare Disease table names are prefixed with “rare_diseases_”.
Data at the Level of Rare Disease Families:
Name of Table / Data View | Description |
---|---|
rare_diseases_family | Data describing the families of rare disease probands participating in the 100,000 Genomes Project. It includes the family group type, the status of the family’s pre-interpretation clinical review and the settings that were chosen for the interpretation pipeline at the clinical review. |
rare_diseases_pedigree | Data describing the Rare Disease participants, linking pedigrees to probands and their family members. |
rare_diseases_pedigree_member | Data describing the Rare Disease pedigree members, similar to the data about each individual participant in the COMMON data view. It includes some additional data, such as the age of onset of predominant clinical features; data on links to other family members; as well as data collected only for Phenotypes. |
Data at the Level of Rare Disease Participants.
The data presented in these tables provides information on disease progression and pertinent medical history:
Name of Table / Data View | Description |
---|---|
rare_diseases_participant_disease | Data describing the rare disease participants' disease type/subtype assigned to them upon enrolment, and the date of diagnosis. |
rare_diseases_participant_phenotype | Data describing the Rare Disease participants’ phenotypes. For each Rare Disease participant in the 100,000 Genomes Project, there are data about whether a phenotypic abnormality as defined by an HPO term is present and what the HPO term is, as well as the age of onset, the severity of manifestation, the spatial pattern in the body and whether it is progressive or not. Please note that these data are only available for a subset of the rare disease participants. |
rare_diseases_gen_measurement | For Rare Disease participants in the 100,000 Genomes Project, this table contains general measurements relevant to the disease, alongside the date that the measurements were taken on. Please note that these data are only available for a subset of the rare disease participants. |
rare_diseases_early_childhood_observation | For Rare Disease participants in the 100,000 Genomes Project, this table contains measurements and milestones provided by the GMCs, related to childhood development. Please note that these data are only available for a subset of the rare disease participants. |
rare_diseases_imaging | For Rare Disease participants in the 100,000 Genomes Project, this table contains various data and measurements from past scans, alongside the date of the scans. Please note that these data are only available for a subset of the rare disease participants. |
rare_diseases_invest_genetic | For Rare Disease participants in the 100,000 Genomes Project, this table contains information on any genetic tests carried out. Data characterising the genetic investigation is recorded alongside records of the sample tissue source and the type of testing laboratory. Please note that these data are only available for a subset of the rare disease participants. |
rare_diseases_invest_genetic_test_result | For Rare Disease participants in the 100,000 Genomes Project, this table contains the results of any genetic tests carried out. Following on from the rare_diseases_invest_genetic table, a summary of the results is presented and contextualised by testing method and scope. Please note that these data are only available for a subset of the rare disease participants. |
rare_diseases_invest_blood_laboratory_test_report | For Rare Disease participants in the 100,000 Genomes Project, this table contains the results of any blood tests carried out. Over 400 blood values are recorded alongside type and technique of testing and the status of the participating patient in the care pathway. Please note that these data are only available for a subset of the rare disease participants. |
Data output from the Genomics England interpretation pipeline
Name of Table / Data View | Description |
---|---|
panels_applied | For each participant of the 100,000 Genomes Project, this table contains the name and version of the panel(s) that was applied to his or her genome. |
tiering | |
tiered_variants_frequency | This table contains the frequencies of each tiered variant for every Project participant for whom we provide tiered variants. |
gmc_exit_questionnaire | Data reporting back from the Genomic Medicine Centres, for variants reported to them by Genomics England, to what extent a family’s presenting case can be explained by the combined variants reported to them (including any segregation testing performed); confidence in the identification and pathogenicity of each variant; and the clinical validity of each variant or variant pair in general and clinical utility in a specific case (only the most recent update will be shown and only one questionnaire per report). |
Cancer¶
Cancer data are presented for either the patient level cancer diagnosis or “disease type” or the tumour specific sample details of participants in the Cancer arm of the 100,000 Genomes Project.
Data Relating to Cancer Participants:
Name of Table / Data View | Description |
---|---|
cancer_participant_disease | For each cancer participant in the 100,000 Genomes Project, this table includes data about their cancer disease type and subtype. |
cancer_participant_tumour | For each cancer participant’s tumour in the 100,000 Genomes Project, this table contains data that characterises the tumour, e.g. staging and grading; morphology and location; recurrence at time of enrolment; and the basis of diagnosis. |
cancer_participant_tumour_metastatic_site | For each cancer participant in the 100,000 Genomes Project, this table contains the site of their metastatic disease in the body (if applicable) at diagnosis. |
cancer_care_plan | For a proportion of cancer participants in the 100,000 Genomes Project, this table contains information from their NHS cancer care plan on their treatment and care intent, in particular outcomes of MDT meetings and coded connected data (e.g. diagnoses from scans). |
cancer_surgery | For a proportion of cancer participants in the 100,000 Genomes Project, this table contains details of what surgical procedures were had, as well as the specific location of the intervention. |
cancer_risk_factor_general | For a proportion of cancer participants in the 100,000 Genomes Project, this table contains data on general cancer risk factors, namely smoking status, height, weight and alcohol consumption. This table was compiled with input from GECIP members. |
cancer_risk_factor_cancer_specific | For a proportion of cancer participants in the 100,000 Genomes Project, this table contains data on specific risk factors related to particular cancer types. This table was compiled with input from GECIP members. |
cancer_invest_imaging | For a proportion of cancer participants in the 100,000 Genomes Project, this table contains: coded data on imaging investigations characterising the scan, its modality, anatomical site and outcome; as well as the outcome of the imaging report in free text form. |
Data derived from or relating to tumour samples:
Name of Table / Data View | Description |
---|---|
cancer_invest_sample_pathology | For a subset of cancer participants in the 100,000 Genomes Project, this table contains full pathology reports and other related data on and from their tumour samples around diagnosis and characterisation of the cancer. Please note that much of this information is also found in the clinic_sample and cancer_participant_tumour tables. |
cancer_specific_pathology | For a subset of tumours from cancer participants in the 100,000 Genomes Project, this table contains pathology data specific to that participant’s cancer type. This may provide additional data to the cancer_invest_sample_pathology and cancer_participant_tumour tables. |
cancer_systemic_anti_cancer_therapy | For a subset of tumours from cancer participants in the 100,000 Genomes Project, this table contains details the regimen and intent of the patients’ chemotherapy. |
cancer_invest_circulating_tumour_marker | For a subset of tumours from cancer participants in the 100,000 Genomes Project, this table contains biomarker measurements specific to particular cancer types. |
Secondary Data from NHS England¶
These are data that cover patient activity around hospital admissions and follow NHS England models and data dictionary. The Data Dictionary that describes the table structure and provides data definitions for this release can be found here.
Hospital Episode Statistics Datasets¶
Name of Table / Data View | Description |
---|---|
APC (Admitted Patient Care finished consultant episodes) | This table contains (alongside participant IDs and demographic information) information on finished consultant episodes during a patient’s hospital admission spell (with multiple episodes per patient), including episodes and spells of admitted care; administrative data on admission and discharge (e.g. period of care, source and destination); diagnosis and procedure codes; data describing maternity-related episodes; as well as details on the trusts / organisations and the practitioners / consultants involved. |
AC (Adult Critical Care episodes) | This table, which is linked to APC, contains (alongside participant IDs) information on patients’ admission to critical care, including administrative data on admission and discharge (as for APC); as well as the number of support days per system affected and the level of critical care per number of days. |
A & E (Accident and Emergency Department episodes) | This table contains (alongside participant IDs and demograhpic information) information on patients’ admission to accident and emergency departments, including administrative data on admission and discharge (as for APC); diagnostic, treatment and investigation codes; as well as trust / consultant details on the providers and referrers for treatment and residence; and invoicing fields such as duration and conclusion of the admission. |
OP (Outpatient appointments) | This table includes (alongside participant IDs and demographic information) administrative information on the outpatient appointments; data on attendance outcomes; diagnostic and procedure codes; as well as details on the trusts / organisations and the practitioners / consultants involved. |
Peripheral Datasets¶
Name of Table / Data View | Description |
---|---|
DID Diagnostic Imaging Data | This table includes data that constitute records of diagnostic imaging for patients order within a hospital, including (alongside participant IDs and demographic information) imaging codes and information on the exact imaging modalities, as well as details on the providers / commissioners involved. |
MHMD_v4_Record | Mental Health Minimum Dataset This table contains information on mental health care spells per patient and provider. Please note that this data is available until 31.4.2014. |
MHMD_v4_Event | This table contains information on mental health care episodes within a single care spell. Please note that this data is available until 31.4.2014. |
MHMD_v4_Episode | This table contains information on mental health events on a day within a care spell. Please note that this data is available until 31.4.2014. |
PROMS | Patient Reported Outcome Measures This table includes data from patient questionnaires pre and post groin hernia operations, hip replacements, knee replacements and varicose vein operations. Please note that new PROMS data has no been received for this release. |
CEN | Cohort Event Notification This table contains records of events that may be death or cancer registration. Death events may be notified in advance of linked ONS cause of death reports. |
ONS (Cause of death report) | This table contains information from cohort event notification, including whether death has occurred due to cancer and some detail in cause for death in ICD10 codes |
Genomics England Data Resources¶
Genomics England data resources are available in the following locations:
From the AWS desktop:
~/gel_data_resources/
From the high performance compute (HPC) cluster:
/gel_data_resources/
The data resources available here are:
Tiering data for rare disease: Tiering data are available for rare disease participants who have been through the Genomics England interpretation platform. These data provide information on the pathogenicity of variants that have been identified in the proband’s genome. Tiering data for rare disease probands can also be found in the designated LabKey table outlined above.
GMC exit questionnaires for rare disease: Outcomes questionnaire for interpreted genomes generated by Genomics England and Clinical Interpretation Providers.
Interpretation request data for rare disease: The following information can be found within the interpretation request JSON file: Family Pedigree and Other Family History, Analysis Panels and versions, Specific Disorder, Tiered Variants and Tiering version, HPO terms, Workspace (NHS GMC or LDP site code), Gene Panel Coverage, Disease Penetrance, Variant Classification.
Multi-sample VCFs for interpreted rare disease genomes: Variant call files by family using Platypus variant caller software.
Tiering, structural, and copy-number variant reports for Cancer: Annotated in JSON format. The file paths are available in the Quick View titled cancer_analysis.
Internal allele frequency VCFs for cancer genomes: Provided as VCFs
Contact and Support¶
For all queries relating to this data release please contact the Genomics England Service Desk portal: Service Desk (accessible from outside the Research Environment). The Service Desk is supported by dedicated Genomics England staff for all relevant questions.
-
This excludes 31 TracerX genomes from 16 participants (refer to 6.4 for further information). ↩