Release v19 (31/10/2024)¶
This document provides a description of the 100k Genomes Project (previously known as Main Programme) Data Release v19 dated 31st October 2024.
Each progressive release incorporates new content, enhances existing content, and enables more effective use of the data in the National Genomics Research Library (NGRL). You should use the most recent data release for your research project.
This data are presented within the Genomics England Research Environment, accessed via the AWS virtual desktop interface and subject to all Genomics England data protection and privacy principles.
Please see the Research Environment User Guide for detailed documentation on how to use and query the Genomics England dataset.
Genomes in this release¶
Data Release Version 19 provides clinical data for 90,173 participants, and 106,292 genomes (uniquely sequenced samples) from 88,505 of these participants. We hold germline genomes from 72,884 rare disease programme participants1 and germline and somatic genomes from 15,621 cancer programme participants from the 100k Genomes Project.
Table containing genome counts by sample, and delivery types.2
Programme | Sample Type | Delivery Type | Genomes | Participants |
---|---|---|---|---|
Cancer | Germline | V2 (GRCh37) | 265 | 265 |
Somatic | V2 (GRCh37) | 294 | 265 | |
Cancer | Germline | V4 (GRCh38) | 15,771 | 15,421 |
Somatic | V4 (GRCh38) | 17,200 | 15,613 | |
Cancer | Germline | D2 (GRCh38) | 15,825 | 15,068 |
Somatic | D2 (GRCh38) | 16,117 | 15,069 | |
Cancer | Germline | Combined | 31,861 | 15,617 |
Somatic | Combined | 33,611 | 15,615 | |
Cancer | Total | Combined | 65,4723 | 15,621 |
Rare Disease | Germline | V2 (GRCh37) | 10,336 | 10,249 |
Germline | V4 (GRCh38) | 65,598 | 64,909 | |
Rare Disease | Total | Combined | 75,934 | 72,884 |
Total | Combined | Combined | 141,4064 | 88,505 |
The above table contains genomes that have been processed using different alignment methods and therefore can be counted multiple times. The table below provides the number of uniquely sequenced samples.
Programme | Sample Type | Genomes count | Participant count |
---|---|---|---|
Cancer | Germline | 15,900 | 15,617 |
Tumour | 17,003 | 15,615 | |
Cancer | Total | 32,748 | 15,621 |
Rare Disease | Germline | 73,527 | 72,884 |
Total | 106,275 | 88,505 |
Clinical data in this release¶
The Genomics England 100k Genomes Project clinical data are organised into tables found in LabKey. You can find details of these tables and their contents in our clinical data documentation, and data dictionary.
Activity period coverage for the longitudinal secondary data tables¶
Source | Category | Dataset | Start | End |
---|---|---|---|---|
NHSE | Hospital Episode Statistics | op | 01/04/2003 | 31/01/2024 |
NHSE | Hospital Episode Statistics | apc | 13/09/1992 | 31/01/2024 |
NHSE | Hospital Episode Statistics | ae | 01/04/2007 | 31/03/2020 |
NHSE | Hospital Episode Statistics | ecds | 05/04/2017 | 04/08/2022 |
NHSE | Hospital Episode Statistics | cc | 01/04/2008 | 31/01/2024 |
NHSE | Other | covid_test_results | 16/03/2020 | 05/01/2022 |
NHSE | Other | cancer_register_nhsd | 07/09/1971 | 02/06/2023 |
NHSE | Other | did | 19/05/2000 | 30/11/2023 |
NHSE | Mental Health | mhmd | 01/04/2011 | 31/03/2015 |
NHSE | Mental Health | mhldds | 01/04/2014 | 30/11/2015 |
NHSE | Mental Health | mhsds | 01/04/2016 | 01/03/2019 |
NHSE | Office of National Statistics Mortality | mortality | 27/06/1995 | 25/02/2024 |
NCRAS | NCRAS | sact_uncurated | 09/04/2008 | 30/12/2022 |
NCRAS | NCRAS | sact | 06/04/2012 | 30/08/2022 |
NCRAS | NCRAS | rtds | 15/04/2009 | 28/02/2022 |
NCRAS | NCRAS | av_treatment | 08/02/1985 | 22/08/2022 |
NCRAS | NCRAS | av_tumour | 01/01/1985 | 31/12/2019 |
NCRAS | NCRAS | cwt | 05/01/2009 | 31/12/2018 |
NCRAS | NCRAS | lucada_2013 | 25/02/2005 | 02/01/2015 |
NCRAS | NCRAS | lucada_2014 | 23/03/2012 | 08/12/2014 |
NCRAS | NCRAS | ncras_did | 01/03/2012 | 26/03/2019 |
Change summary¶
Bug Fixes¶
A few tables have been updated to fix data issues:
sequencing_report
andgenome_file_paths_and_types
: Around 21k genomes had inaccurate delivery date values that could cause issues when trying to select the latest delivery for a sample. Now all delivery dates are within 7 days of the date that appears in the file paths.rare_disease_interpreted
: Around 1,500 participants had innaccessible paths in the 'Alignment File Path' column and they have now been updated. The date of birth of a participant has also been updated to match that in theparticipant
table.transcriptome_file_paths_and_types
:*SJ.out.tab
files have been removed from the table because they cotain incorrect information in the intron motif column.
Data updates¶
Two tables have had a data refresh:
- NHSE Hospital Episode Statistics: There is now HES data available until January 2024.
cancer_ont_cohorts
: A new columnlr_merged_germline_methylation_path
has been added with the paths of germline BAM files with methylation tags. This release also includes new tumour methylation BAMs for a total of 83 tumour and 73 germline methylation BAMs. In addition, the columnmethylation_guppy_version
has been removed since it's always the same asguppy_version
.
How to find release 19 data¶
LabKey¶
The clinical data, secondary data, and tabulated bioinformatic data for this data release, and the paths to the applicable genome files, are found in the following LabKey folder:
main_programme/main-programme_v19_2024-10-31
Flatfiles¶
Genomics England data resources are available in the following locations:
From the AWS desktop:
~/gel_data_resources/
From the high performance compute (HPC) cluster:
/gel_data_resources/
The data resources available here are:
Data | Format | Associated LabKey tables |
---|---|---|
Genome alignments | BAM or CRAM | genome_file_paths_and_types |
Variant calls | VCF | genome_file_paths_and_types |
Cancer tiering reports | JSON | cancer_tier_and_domain_variants |
Structural and copy number variant reports | JSON | n/a |
Aggregated gVCF dataset (aggV2) | aggregated VCF | aggregate_gvcf_sample_stats |
Scope¶
In scope¶
Data that are in scope for this release:
- Cancer and rare disease data for the main programme participants with current consent. These data include:
- Genomic data for participants when available
- Whole genome sequencing (WGS) family-based quality control for rare disease, reporting sex checks and pedigree checks
- Outputs of the Genomics England Bioinformatics Research Services
- An aggregated Illumina gVCF for germline genomes (including genomes up to late 2019). Researchers are responsible for only using the participants who are consented for research in the latest data release. Please see the documentation here: Aggregated Variant Calls (aggV2)
- Principal components for germline genomes
- Inferred ancestry assigned to samples based on genomic data. (See also aggV2)
- The list of samples for this aggregate for the latest data release can be found in the
aggregate_gvcf_sample_stats
table in LabKey (n=78,128). - Phased data for participants in our aggV2 dataset. The data contains over 342 million phased small variants (SNPs and short indels) across chromosomes 1-22 of aggV2. Detailed documentation can be found here: Aggv2 Phased Data (Provided by University of Oxford). This data was provided by Sinan Shi from the University of Oxford.
- A somatic aggregate containing 16,341 somatic VCF files (including genomes up to early 2021) from the 100,000 Genomes Project which we made available as a multi-sample VCF dataset (somAgg). Researchers are responsible for only using the participants who are consented for research in the latest data release. More information can be found here: Somatic Aggregated Variant Call (somAgg v0.2 ALPHA version). This is an early stage release and feedback is very welcome.
- Annotated single nucleotide variants and small indels (≤50bp) from quality controlled tumour whole genomes.
- Genome-wide de novo variant dataset for 13,917 trios from 12,577 families from the rare disease programme. This was built for main programme data release v9. Researchers are responsible for only using the participants who are consented for research in the latest data release. Please see the documentation here: The de novo variant research dataset for the 100,000 Genomes Project
- Polygenic Risk Score values have been made available for 12 complex traits, for ~40k participants from the aggV2 dataset. Detailed documentation can be found here: Polygenic Risk Scores (Provided by Genomics PLC). This data was provided by Genomics PLC.
- A collection of 16,369 reports produced by the bioinformatics cancer interpretation pipeline aggregated in the
cancer_tier_and_domain_variants
table. - Long read sequencing data provided by internal teams as part of collaborative pilot projects. These are now been placed under a separate section in LabKey called 'Long Read Sequencing'.
- Oxford Nanopore Technologies (ONT)
lrs_laboratory_sample
andlrs_sequencing_data
: <100 Rare Disease programme participants.cancer_ont_cohorts
: 101 Cancer programme participants from various cancer disease types. Starting in V18, we also include BAM files with methylation sites called.
- PacBio
rare_disease_pacbio_pilot
: <100 Rare Disease programme participants
- Oxford Nanopore Technologies (ONT)
- Outputs of the Genomics England Bioinformatics rare diseases interpretation pipeline
- Tiering data – rare disease
- Exomiser results for interpreted genomes – rare disease
- GMC outcome data ("exit questionnaire data") – rare disease - up until 03/06/2024.
- Platypus VCFs used for the genomic interpretation of 33,964 families (3,430 GRCh37 and 30,536 GRCh38) are available via the
rare_disease_interpreted
table. Platypus VCFs are provided unannotated.
- Outputs of the Genomics England Bioinformatics cancer interpretation pipeline
- 'Gold standard' cancer genomes which have been through interpretation and passed quality checks
- Tumour signature and mutational burden data
- Annotation and tiering of small variants
- Tiering, structural and copy number variant report
- Cancer Principal Component Analysis (PCA). For more information on these metrics please see the following document: Cancer Analysis Technical Information Document.
- Submitted Diagnostic Discovery data provided by the Research Community. These are potential diagnoses that were not identified as causal in the initial analysis by GEL, but were identified by Research Environment users and submitted to GEL as part of a process called 'Diagnostic Discovery'. If you would like to be involved in this yourself please contact service desk. These findings are in addition to the GMC exit questionnaire data, and may remain listed as "unsolved cases" therein.
- Clinical data collected upon enrollment, including formal pedigree data on rare disease participants where it is available
- Secondary datasets (medical history), these are available at varying levels of completeness and include:
- Hospital Episode Statistics (HES), including HES Accident and Emergency, HES Admitted Patient Care, and HES Outpatient Care.
- Emergency Care Data Set (ECDS).
- Diagnostic Imaging Dataset (DID).
- Mental Health Minimum Dataset (MHMDS).
- Mental Health Learning Disabilities Dataset (MHLDDS).
- Mental Health Services Dataset (MHSDS).
- Office for National Statistics - Death details data, Cancer Registration (MORTALITY, CANCER_REGISTRY).
- Systemic Anti-Cancer Therapy Dataset (SACT).
- Systemic Anti-Cancer Therapy Dataset - UNCURATED (SACT_UNCURATED).
- National Radiotherapy Dataset (RTDS).
- Cancer Registration (AV) tables.
- Cancer waiting times (CWT).
- Lung Cancer Data Audit (LUCADA).
- National Cancer Registration and Analysis Service Diagnostic Imaging Dataset (NCRAS_DID).
- COVID Test Results data (covid_test_results). This was previously included in the frequent release section.
- Sample datasets describing:
- Handling and quality control of DNA samples at the Genomic Medicine Centres, the biorepository and the sequencer.
- Omics samples stored at the biorepository.
- Orthogonal standard-of-care test data collected from GMCs for a subset of cancer patients
- Transcriptomics (RNA-sequencing) Pilot data provided by internal teams as part of collaborative pilot projects. These are now been placed under two tables in LabKey called 'transcriptome_file_paths_and_types' and 'rnaseq_qc_metrics'
- The GEL Transcriptomics Pilot comprises RNA-sequencing of a subset (>5,000) of rare disease probands from the 100,000 Genomes Project who did not receive a genetic diagnosis through the Genomics England Interpretation Pipeline. We prioritised probands who were found to carry variants of unknown significance.
Out of scope¶
Additional time is required to update the applications/tools that are available in the RE to the current data release, e.g. IVA, Participant Explorer. Please refer to the Application Data Versions page for the data release version used in the RE products and services.
Data out of scope for this release:
- Clinical and genomic data for participants that have withdrawn from the 100,000 Genomes Project or were otherwise ineligible.
- Participant data from the pilot phases of the project.
- Clinical data for participants on expired child consent collected after their 16th birthday (for more details see Clinical and phenotype data : Secondary data - Participant Consent).
- Data relating to the NHS Genomic Medicine Service (GMS) . Genomic and clinical data on NHS GMS participants is currently released separately to the main_programme release. For more information on the NHS GMS data releases please see: NHS GMS data release notes
Quality notes¶
- BAM and VCF genomic data files are as they have been delivered to us by our sequencing provider (Illumina). These have all passed an initial QC check based on sequencing quality and coverage. They have, however, not all undergone our full in-house quality checks and they are therefore subject to potential discrepancies or inaccuracies. Such checks include, but are not limited to, discrepancies in genetic versus reported sex and in family relationships.
- As participants undergo the in-house checks and pass through the Genomics England interpretation pipeline, any inaccuracies we identify will be rectified in subsequent releases.
- Any samples that have been affected prior to this release (e.g. sample swaps or samples that have been retracted as part of the in-house QC process) are listed in Section 10 below.
- You are encouraged to work on the subset of samples that have already passed our internal QC checks; these can be found below for rare disease and cancer genomes, respectively.
- For Rare Disease genomes, you should note that all tiered genomes have passed through Genomics England in-house QCs and that all tiered genomes come from the pool of genomes that have had family checks applied to them, as a first step towards Genomics England tiering. For rare disease interpretation including tiering, small variants are called using the Platypus variant caller. Please see the Rare Disease Results Guide on our Further reading and documentation page for more information.
- Different QC filtering has been applied to the Illumina VCF files and the Platypus VCFs that are used for tiering in the rare disease programme. There may therefore, be tiered variants that have been filtered out of the Illumina VCF files, and, conversely, variants present in the Illumina VCF file that have been filtered out of the platypus VCFs.
- Some rare disease families lack a proband.
- Human Phenotype Ontology (HPO) terms may be missing or incomplete for some participants.
- Each participant's relationship to their family's proband is available in the
rare_diseases_pedigree_member
table and can be used to determine family relationships, especially for cases without formal pedigree data. Pedigree data are only available for a subset of rare disease participants. - WGS family selection quality checks are provided for rare disease genomes on GRCh38, reporting abnormalities of sex chromosomes and reported vs genetic sex summary checks (computed from family relatedness, Mendelian inconsistencies, and sex chromosome checks). Full details on why a family has failed a reported vs genetic sex check can be requested via the Service Desk.
- For Cancer genomes, you should note that all 'gold standard genomes' that have been through Genomics England interpretation and passed quality checks are found in the cancer quick view table cancer_analysis. We strongly recommend using the data from this table for all cancer analyses.
- Clinical data and secondary data have been provided as submitted and have undergone limited validation and cleaning
- sact_uncurated is the table with the raw data from NCRAS which feeds into their curation process producing the SACT table, which remains the gold standard. A major point to raise is that neither of the SACT tables contain tumour IDs, thus you must match this dataset to other NCRAS registries by adjusting for date. A lot of familiar data fields remain in their raw non-standardised form (sex, treatmentintent, clinicaltrialindicator). Pending feedback, these fields can be normalised in subsequent updates.
Terms of use for specific cohorts¶
Participants with a participant ID that commences with 125 or 226 were recruited through the Scottish Genomes Partnership Research Programme. These are under the governance of a separate but linked consent and protocol to the 100,000 genomes project. Only the removal of summary level statistics is permitted. Airlock approval will not be granted for the removal of record level data associated with these participants.
Cohort metadata¶
Within the data release, there is genomic data and clinical data for participants that are part of non-NHS research cohorts that have been sequenced by Illumina and analysed via the Genomics England pipeline.
These research cohorts can be distinguished via their clinic ID as each has been given their own unique code. These clinic IDs are primarily located in the participant table filtering either the registered_at_ldp_ods_code or registered_at_ldp_bioinformatics_ods_code for the respective clinic ID. If any genomic or clinical data from the research cohorts is used in your analysis and subsequent publication, reference to the cohort organisation in the first column of the below table will need to be made.
Non-NHS Cohort Name | Clinic ID | Rare Disease/ Cancer | Description | Constraints | Requirements of Use | Number of genomes (in v19 data release) | Opportunities for further research |
---|---|---|---|---|---|---|---|
Breast Cancer Now | BCN | Cancer | The Breast Cancer Now Tissue Bank (BCNTB) is a multi-centre tissue bank established to fill the gap in the Triple Negative breast cancer (TNBC) research community. It systematically collects high quality tissues and data under an established ethical framework. Full clinico-pathological and follow-up data is due to be made available with ongoing longitudinal data collection. This cohort is curated group of 81 treatment naïve TNBC patients. Additional tissue for many is available through the BCNTB for further matched ‘omic analysis. |
Consistent with Genomics England acceptable uses | Any publication referencing the Sequence Data generated, needs to ensure reference is made to the contribution of the Provider to the generation of the Sequence Data | 81 | Potential to remove identifiers for the purpose of requesting access to Breast Cancer Now biobank samples. |
Genomics England CLL pilot study | CLL | Cancer | The original Chronic lymphocytic leukaemia (CLL) Genomics England Pilot aimed to develop the protocols and analytical methods required to perform whole genome sequencing (WGS) at scale for patients with CLL recruited into national clinical trials as a prelude to the Genomics England main programme. This cohort is a small subset of the pilot to allow for the provision of validation data. | Consistent with Genomics England acceptable uses | Any publication referencing the Sequence Data generated, needs to ensure reference is made to the contribution of the Provider to the generation of the Sequence Data | 4 | |
UKALL2003 trial | ALL | Cancer | The aim of this project is to explore the genomic landscape of patients with acute lymphoblastic leukaemia at initial presentation in order identify mutations that could explain their poor response and potentially be future biomarkers. The objective was to perform whole genome sequencing and targeted screening for mismatch repair deficiency on a large well annotated cohort of patients with ALL treated on the UKALL2003 trial. This will generate, for the first time, a comprehensive genomic landscape of chemo-resistant acute lymphoblastic leukaemia. | Consistent with Genomics England acceptable uses | Any publication referencing the Sequence Data generated, needs to ensure reference is made to the contribution of the Provider to the generation of the Sequence Data | 67 | |
NIHR Bioresource | NB3 | Rare Disease | The NIHR BioResource is comprised of volunteers from around the country who have given their consent to taking of a biological sample, and they are willing to be approached to participate in research studies and trials on the basis of their genotype, and or phenotype. This cohort consists of rare disease participants who consented to WGS as part of the 100,000 Genomes Project. | Consistent with Genomics England acceptable uses, | Any publication referencing the Sequence Data generated, needs to ensure reference is made to the contribution of the Provider to the generation of the Sequence Data | 309 | Anyone who wishes to be granted permission to contact any of the NIHR BioResource participants should follow the process of applying to the NIHR BioResource. The steps to be made can be found on the NIHR BioResource website |
Contact and support¶
For all queries relating to this data release please contact the Genomics England Service Desk portal: Service Desk (accessible from outside the Research Environment). The Service Desk is supported by dedicated Genomics England staff for all relevant questions.