Skip to content

100kGP Release V5 (31/10/2018)

Data dictionary

Purpose

This document provides a description of the Main Programme Data Release v5 dated 31.10.2018.

This is the fifth formal release of Main Programme data into the Research Environment. Genomics England will be releasing data on roughly a quarterly basis. Each progressive release will incorporate new content, enhancing existing content, and enable more effective use of the existing and new data.

This data will be manifested within the current version of Genomics England Research Environment, accessed via the AWS virtual desktop interface and subject to all Genomics England data protection and privacy principles.

Release Overview

This release provides clinical data for 85,070 participants, and 71,860 genomes from 62,487 of these participants. Of these genomes, 54,456 are rare disease genomes (from 54,138 participants) and 17,404 are cancer genomes (from 8,349 participants)1.

Participants

Type Count
Rare Disease 68,008
Cancer 17,062
Total 85,070

Genomes

Type Genomes count Participant count
Cancer Germline 8,614 8,272
Cancer Tumour 8,790 8,222
Cancer Total 17,404 8,349
Rare Disease 54,456 54,138
Genomes Total 71,860 62,487
  • Genomic data are manifested in file shares.
  • Clinical data and secondary health data (“medical history”) are manifested in LabKey.

The clinical data provided in the release comprise a broader set of variables than in the November 2018 Main Programme v5 data release. This release seeks to include all variables that contain (or may contain in future) meaningful data whilst not compromising participant privacy.

Some genomic data are currently aligned against the reference genome version GRCh37 and some against version GRCh38. The alignments were also made using different versions of Illumina’s alignment pipelines V2 and V4, reflecting the versions that were applicable at the time of sequencing. All new genomic data added in the current data release (since July 2018) is aligned against the reference genome version GRCh38, using alignment pipelines V4. The versions for each genome are identified in the Sequencing Report table. We intend to provide consistently realigned and recalled version of all our genomes in the future.

Audience

The intended audience for this document is researchers that have access to the Genomics England Research Environment. This does not include taught students on the MSc Genomic Medicine, who have access to a small subset of Main Programme data.

Identifying this data release

The clinical data, secondary data, and tabulated bioinformatic data for this data release, and the paths to the applicable genome files, are found in the following LabKey folder:

main-programme /main-programme_v5_2018-10-31

Subsequent releases will be identified by an incremental increase in the version number and the date of data release.

The main genome sequence files are found in the User’s AWS Home Drive, organised by date. Some of the included genomic data produced by the Genomics England Bioinformatics pipeline (such as rare disease de novo variants or tiering, structural and copy-number variant reports for cancer genomes or internal allele frequency VCFs for cancer genomes) are found in the Genomics England Data Resources (see Section 8.5).

Scope

In scope

Data that are in scope for this release:

  • Cancer and rare disease data for the main programme participants with current consent. These data include:
  • Genomic data for participants when available
  • WGS family-based quality control for rare disease, reporting sex checks and pedigree checks
  • Outputs of the Genomics England Bioinformatics rare diseases interpretation pipeline
    • Tiering data - rare disease
    • GMC outcome data ("exit questionnaire data") - rare disease
    • Interpretation request data - rare disease
    • Multi-sample VCF for interpreted genomes - rare disease
  • Outputs of the Genomics England Bioinformatics cancer interpretation pipeline
    • Gold standard cancer genomes which have been through interpretation and passed quality checks
    • Tumour signature and mutational burden data - cancer
    • Annotation and tiering of small variants -cancer
    • Tiering, structural and copy number variant report
    • Cancer PCA stats
    • Internal allele frequency VCF - cancer
  • Primary clinical data, including formal pedigree data on rare disease participants where it is available; and
  • Secondary datasets (medical history), including:
    • Hospital Episode Statistics (HES), including HES Admitted Patient Care, HES Adult Critical Care episodes, HES Accident and Emergency and HES Outpatient care.
    • Diagnostic Imaging Dataset (DID)
    • Patient Reported Outcome Measures (PROMs)
    • Mental Health Services Data Set (MHSDS)
    • Office for National Statistics (ONS)

Out of scope

Data that are out of scope for this release:

  • Clinical and genomic data for participants that have withdrawn from the 100,000 Genomes Project.
  • Participant data from the pilot phases of the project (i.e. not main programme).
  • Secondary data updates post January 2018
  • Sources of secondary data other than HES, DID, PROMs, MHMDS and ONS / CEN.

Quality Notes

  • BAM and VCF genomic data files are as they have been delivered to us by our sequencing provider. These have all passed an initial QC check based on sequencing quality and coverage. They have, however, not all undergone our full in-house genetic checks and we therefore cannot guarantee against genetic versus reported sex and family relationship discrepancies.
  • For Rare Disease genomes, you should note that all tiered genomes have passed through Genomics England in-house QCs and that all tiered genomes come from the pool of genomes that have had family checks applied to them, as a first step towards Genomics England tiering.
  • For Cancer genomes, you should note that all gold standard genomes that have been through Genomics England interpretation and passed quality checks are found in the cancer quick view table cancer_analysis.
  • Some rare disease families lack a proband due to the availability of data at the time of release. These families without probands will also lack a diagnosis unless there is a second affected individual in the family. The missing data will be made available in a future release.
  • Clinical data and secondary data have been provided as submitted and have undergone limited validation.
  • Human Phenotype Ontology (HPO) terms may be missing or incomplete for some participants. This will be updated in future releases.
  • Formal pedigree data are only available for a subset of rare disease participants. This will be updated in future releases. Each participant’s relationship to their family’s proband is available for all cases; this can be used to determine family relationships instead of formal pedigree data.
  • WGS family selection quality checks are provided for rare disease genomes on GRCh38, reporting abnormalities of sex chromosomes, family relatedness, Mendelian inconsistencies and reported vs genetic sex summary checks (only sex checks are unpacked into individual data fields).

Conditions of Use

Participants identified as TracerX in the field normalised_consent_form in the participant table must not be used by commercial organisations.

Data Release Description

The Genomics England data are organised into data views (displayed within LabKey as tables) categorised into Quick View, Common, Rare Disease and Cancer.

The Data Dictionary that describes the table structure and provides data definitions for this release can be found here.

Quick View

Data views that bring together data from several LabKey tables for convenient access:

Name of Table / Data View Description
rare_disease_analysis Data for all rare disease participants including: sex, ethnicity, disease recruited for and relationship to proband; latest genome build, QC status of latest genome, path to latest genomes and whether tiering data are available; as well as family selection quality checks for rare disease genomes on GRCh38, reporting abnormalities of the sex chromosomes, family relatedness, Mendelian inconsistencies and reported vs genetic sex summary checks. Please note that only sex checks are unpacked into individual data fields; a final status is shown in the “genetic vs reported results” column.
cancer_analysis Data for all cancer participants whose genomes have been through Genomics England bioinformatics interpretation and passed quality checks, including: sex, ethnicity, disease recruited for and diagnosis; tumour ID, build of latest genome, QC status of latest genome and path to latest genomes; as well file paths to the genomes. This table includes information derived from laboratory_sample and cancer_participant_tumour.
Tumour Mutational Burden
The table includes the relative proportions of the different mutational signatures demonstrated by the tumour. Analysis of large sequencing datasets (10,952 exomes and 1,048 whole-genomes from 40 distinct tumour types) has allowed patterns of relative contextual frequencies of different SNVs to be grouped into specific mutational signatures. Using mathematical methods (decomposition by non-negative least squares) the contribution of each of these signatures to the overall mutation burden observed in a tumour can be derived. Further details of the 30 different mutational signatures used for this analysis, their prevalence in different tumour types and proposed aetiology can be found at the Sanger Institute Website.
Cancer PCA QC Statistics
The cancer analysis pipeline employs a sequencing quality control check which selects several important statistics associated with the sequencing returned by the sequencing provider, and uses them to check whether or not the sample in question is an outlier with respect to previous samples that have been run through the pipeline. It is, in effect, a safety net that can spot issues that have occurred at the tissue collection stage (i.e. at the GMC (Genomic Medicine Centre)) or at the library preparation step (i.e. at the sequencing provider), both of which may impact upon the final genomic analysis returned to the clinician.

Common

Data views that are common to both the rare disease and the cancer domains. This data pertains to sample handling, genome sequencing, and participant data.

Data Relating to Participants:

Name of Table / Data View Description
participant Data on each individual participant in the 100,000 Genomes Project, e.g. personal information (such as relatives or self-reported ethnicity); points of contact with the Project (e.g. handling Genomic Medicine Centre or Trust); and a record of the status of their clinical review.
sequencing_report For each participant in the 100,000 Genomes Project, this table contains data describing the sequencing of their genome(s) and associated output, as well as the sample type that the sequence is from.
domain_assignment For each participant in the 100,000 Genomes Project, this table contains: data describing the disease type to which they were recruited; the disease panel applied to their genome; the GECIP domain to which their genome has been assigned for the purposes of administering the GECIP publication moratorium; as well as the end date of the GECIP moratorium associated with their genome(s).
genome_file_paths_and_types Data that specifies the genomic files and their folder locations for a given a participant.

Data Relating to Samples:

Name of Table / Data View Description
clinic_sample Data describing the taking and handling of participant samples at the Genomic Medicine Centres, i.e. in the clinic, as well as the type of samples obtained. Because of the complexities of handling and managing tumour tissues samples in a clinical setting, there are many fields that are cancer-specific.
clinic_sample_quality_check_result Data describing the quality control of obtaining and handling participant samples at the Genomic Medicine Centres, i.e. in the clinic.
laboratory_sample Data describing the handling of samples at the biorepository and in preparation for sequencing, as well as the type of sample.

Rare Diseases

Rare Disease data are presented at the level of Rare Disease families (families of probands), Rare Disease pedigrees, and participants. Participants are individuals who have consented to be part of the project with the expectation that a sample of their DNA will be obtained and their genome sequenced. Pedigree members are extended members of the proband’s family, this includes participants as well a small amounts of deidentified data recorded to allow a full picture of the proband’s extended family. This additional information is extracted from the proband’s medical record.

All Rare Disease table names are prefixed with “rare_diseases_”.

Data at the Level of Rare Disease Families:

Name of Table / Data View Description
rare_diseases_family Data describing the families of rare disease probands participating in the 100,000 Genomes Project. It includes the family group type, the status of the family’s pre-interpretation clinical review and the settings that were chosen for the interpretation pipeline at the clinical review.
rare_diseases_pedigree Data describing the Rare Disease participants, linking pedigrees to probands and their family members.
rare_diseases_pedigree_member Data describing the Rare Disease pedigree members, similar to the data about each individual participant in the COMMON data view. It includes some additional data, such as the age of onset of predominant clinical features; data on links to other family members; as well as data collected only for Phenotypes.

Data at the Level of Rare Disease Participants.

The data presented in these tables provides information on disease progression and pertinent medical history:

Name of Table / Data View Description
rare_diseases_participant_disease Data describing the rare disease participants' disease type/subtype assigned to them upon enrolment, and the date of diagnosis.
rare_diseases_participant_phenotype Data describing the Rare Disease participants’ phenotypes. For each Rare Disease participant in the 100,000 Genomes Project, there are data about whether a phenotypic abnormality as defined by an HPO term is present and what the HPO term is, as well as the age of onset, the severity of manifestation, the spatial pattern in the body and whether it is progressive or not. Please note that these data are only available for a subset of the rare disease participants.
rare_diseases_gen_measurement For Rare Disease participants in the 100,000 Genomes Project, this table contains general measurements relevant to the disease, alongside the date that the measurements were taken on. Please note that these data are only available for a subset of the rare disease participants.
rare_diseases_early_childhood_observation For Rare Disease participants in the 100,000 Genomes Project, this table contains measurements and milestones provided by the GMCs, related to childhood development. Please note that these data are only available for a subset of the rare disease participants.
rare_diseases_imaging For Rare Disease participants in the 100,000 Genomes Project, this table contains various data and measurements from past scans, alongside the date of the scans. Please note that these data are only available for a subset of the rare disease participants.
rare_diseases_invest_genetic For Rare Disease participants in the 100,000 Genomes Project, this table contains information on any genetic tests carried out. Data characterising the genetic investigation is recorded alongside records of the sample tissue source and the type of testing laboratory. Please note that these data are only available for a subset of the rare disease participants.
rare_diseases_invest_genetic_test_result For Rare Disease participants in the 100,000 Genomes Project, this table contains the results of any genetic tests carried out. Following on from the rare_diseases_invest_genetic table, a summary of the results is presented and contextualised by testing method and scope. Please note that these data are only available for a subset of the rare disease participants.
rare_diseases_invest_blood_laboratory_test_report For Rare Disease participants in the 100,000 Genomes Project, this table contains the results of any blood tests carried out. Over 400 blood values are recorded alongside type and technique of testing and the status of the participating patient in the care pathway. Please note that these data are only available for a subset of the rare disease participants.

Data output from the Genomics England interpretation pipeline

Name of Table / Data View Description
panels_applied For each participant of the 100,000 Genomes Project, this table contains the name and version of the panel(s) that was applied to his or her genome.
tiering
tiered_variants_frequency This table contains the frequencies of each tiered variant for every Project participant for whom we provide tiered variants.
gmc_exit_questionnaire Data reporting back from the Genomic Medicine Centres, for variants reported to them by Genomics England, to what extent a family’s presenting case can be explained by the combined variants reported to them (including any segregation testing performed); confidence in the identification and pathogenicity of each variant; and the clinical validity of each variant or variant pair in general and clinical utility in a specific case (only the most recent update will be shown and only one questionnaire per report).

Cancer

Cancer data are presented for either the patient level cancer diagnosis or “disease type” or the tumour specific sample details of participants in the Cancer arm of the 100,000 Genomes Project.

Data Relating to Cancer Participants:

Name of Table / Data View Description
cancer_participant_disease For each cancer participant in the 100,000 Genomes Project, this table includes data about their cancer disease type and subtype.
cancer_participant_tumour For each cancer participant’s tumour in the 100,000 Genomes Project, this table contains data that characterises the tumour, e.g. staging and grading; morphology and location; recurrence at time of enrolment; and the basis of diagnosis.
cancer_participant_tumour_metastatic_site For each cancer participant in the 100,000 Genomes Project, this table contains the site of their metastatic disease in the body (if applicable) at diagnosis.
cancer_care_plan For a proportion of cancer participants in the 100,000 Genomes Project, this table contains information from their NHS cancer care plan on their treatment and care intent, in particular outcomes of MDT meetings and coded connected data (e.g. diagnoses from scans).
cancer_surgery For a proportion of cancer participants in the 100,000 Genomes Project, this table contains details of what surgical procedures were had, as well as the specific location of the intervention.
cancer_risk_factor_general For a proportion of cancer participants in the 100,000 Genomes Project, this table contains data on general cancer risk factors, namely smoking status, height, weight and alcohol consumption. This table was compiled with input from GECIP members.
cancer_risk_factor_cancer_specific For a proportion of cancer participants in the 100,000 Genomes Project, this table contains data on specific risk factors related to particular cancer types. This table was compiled with input from GECIP members.
cancer_invest_imaging For a proportion of cancer participants in the 100,000 Genomes Project, this table contains: coded data on imaging investigations characterising the scan, its modality, anatomical site and outcome; as well as the outcome of the imaging report in free text form.

Data derived from or relating to tumour samples:

Name of Table / Data View Description
cancer_invest_sample_pathology For a subset of cancer participants in the 100,000 Genomes Project, this table contains full pathology reports and other related data on and from their tumour samples around diagnosis and characterisation of the cancer. Please note that much of this information is also found in the clinic_sample and cancer_participant_tumour tables.
cancer_specific_pathology For a subset of tumours from cancer participants in the 100,000 Genomes Project, this table contains pathology data specific to that participant’s cancer type. This may provide additional data to the cancer_invest_sample_pathology and cancer_participant_tumour tables.
cancer_systemic_anti_cancer_therapy For a subset of tumours from cancer participants in the 100,000 Genomes Project, this table contains details the regimen and intent of the patients’ chemotherapy.
cancer_invest_circulating_tumour_marker For a subset of tumours from cancer participants in the 100,000 Genomes Project, this table contains biomarker measurements specific to particular cancer types.

Genomics England Data Resources

Genomics England data resources are available in the following locations:

From the AWS desktop:

~/gel_data_resources/

From the high performance compute (HPC) cluster:

/gel_data_resources/

The data resources available here are:

Tiering data for rare disease: Tiering data are available for rare disease participants who have been through the Genomics England interpretation platform. These data provide information on the pathogenicity of variants that have been identified in the proband’s genome. Tiering data for rare disease probands can also be found in the designated LabKey table outlined above.

GMC exit questionnaires for rare disease: Outcomes questionnaire for interpreted genomes generated by Genomics England and Clinical Interpretation Providers.

Interpretation request data for rare disease: The following information can be found within the interpretation request JSON file: Family Pedigree and Other Family History, Analysis Panels and versions, Specific Disorder, Tiered Variants and Tiering version, HPO terms, Workspace (NHS GMC or LDP site code), Gene Panel Coverage, Disease Penetrance, Variant Classification.

Tiering, structural, and copy-number variant reports for Cancer: Annotated in JSON format. The file paths are available in the Quick View titled cancer_analysis.

Contact and Support

For all queries relating to this data release please contact the Genomics England Service Desk portal: Service Desk (accessible from outside the Research Environment). The Service Desk is supported by dedicated Genomics England staff for all relevant questions.


  1. This excludes 31 TracerX genomes from 16 participants (refer to 6.4 for further information).