Long-read sequencing pilot project¶
Genomics England is piloting the use of long-read Oxford Nanopore Technologies (ONT) sequencing for genomic characterisation of cancers. ONT sequencing offers many potential advantages to short-read sequencing, including superior characterisation of structural variants and detection of methylated DNA bases.
During phase 1 of the long-reads pilot project, a subset of samples from cancer patients from the 100,000 Genomes Project were re-sequenced with ONT. Some of these sequences have been made available to the community on the GEL trusted research environment. This document outlines the long-read data available and how it was generated.
Sequencing protocol¶
DNA from a subset of 100,000 Genomes Project participants was depleted of low molecular weight DNA (<10 Kb) before library preparation. Libraries for ONT sequencing were prepared with the protocol indicated in the library_prep field of the cancer_ont_cohorts
table in LabKey. Data were acquired with the PromethION Beta for 42-60hrs in high-accuracy mode.
Long-read cancer cohorts¶
Cohort name | Cancer type(s) | Regular BAMs (tumour/normal) | Methyl BAMs (tumour/normal) |
---|---|---|---|
cohort_CML |
Chronic Myeloid Leukemia | 24/23 | 23/23 |
cohort_TALL |
Acute Lymphoblastic Leukemia | 7/7 | 7/7 |
cohort_TJ |
Paediatric brain tumours | 9/0 | 9/0 |
cohort_TNBC |
Triple Negative Breast Cancer | 43/34 | 30/30 |
cohort_neuroendocrine |
Neuroendocrine tumours | 9/8 | 9/8 |
cohort_neuroendocrine2 |
Neuroendocrine tumours | 10/5 | 5/5 |
Long-read data files available¶
There are two types of long-read data available for samples in the 100,000 Genomes Project cohorts:
- Bam files. These are regular bam files and can be found in the relevant cohort directory. For example, the regular bam file for the normal sample from participant 215001566 can be found at
/gel_data_resources/LRS_cohort_genomes/cohort_CML/215001566_normal.bam
. - Methyl bam files. These bam files contain additional information on the methylation status of CpG motifs, and can be found within the methyl_BAMs directory within the relevant cohort directory. For example, the methyl bam file for the normal sample from participant 215001566 can be found at
/gel_data_resources/LRS_cohort_genomes/cohort_CML/methyl_BAMs/215001566_normal.bam
.
All data is generated with R9 chemistry.
Bioinformatics tools used to generate files¶
Basecalling was performed with guppy, the version used will vary and can be found in the cancer_ont_cohorts
Labkey table.
In general, two flowcells of DNA were run for each tumour sample and 1 for the normal sample. Each flow cell was basecalled individually and resulting bam files were merged with samtools.
ONT cancer cohorts table¶
The specific tools versions and most important basecalling and alignment parameters have been included, alongside the file paths of the modified BAMs, in the cancer_ont_cohorts
LabKey table. Note that this table has a single line per participant, and so file paths and parameters specific to the methyl BAMs have the prefix "Methylation". The column guppy_version
refers to the version installed on the PromethION during sequencing while basecall_version
refers to the version used for basecalling after sequencing.
Note that statistics such as number of reads and aligned base pairs were calculated for the regular bam but may slightly differ for methyl BAMs.
Quality control checks performed¶
Samples that failed the following criteria were excluded:
- Flow cell N50 > 5000bps.
- Flow cell total aligned base pairs > 47,000,000,000 (roughly corresponds to 15X coverage).
In addition, regular BAMs but not methyl BAMs were required to meet the following criteria: - Two passing flow cells per tumour sample.
In addition, methyl BAMs but not regular BAMs were required to meet the following criterion:
- One passing flow cells for normal sample.
Differences between regular and methyl bam files available in RE¶
For maximum backwards compatibility, we will maintain both versions of the bam files in the TRE. However, there are differences in the specific files available and in some cases how they were generated.
- Tumour samples for participants lacking a long-read normal sequence do not have a methyl bam (except for cohort_TJ).
- The guppy version and parameters used are different and can be found in the
cancer_ont_cohorts
Labkey table. - In some cases, additional flow cells were run and these have been included in the methyl but not regular bam file.
Methylation data¶
ONT sequencing can detect epigenetic modifications of DNA such as 5-methylcytosine. Nanopore sequencing works by drawing a DNA molecule through a tiny pore embedded in a membrane. A current is applied across the membrane and by measuring small changes to the current, the DNA sequence is established. Modified DNA bases are sufficiently different to their unmodified counterparts that they also introduce characteristic changes to the electric signal and this can be used to determine the modification status of the base.
Modified bam files with cytosine methylation (5mC only) data are provided for each cohort. These files were generated with guppy 6.3.8 using the dna_r9.4.1_450bps_modbases_5mc_cg_hac.cfg
config. With this config, guppy generates a probability of methylation for each cytosine in a CpG context, using ONT's Remora methylation model. The methylation data is stored in the MM and ML tags which give the position of the relevant bases and the probability of methylation respectively.
The settings used to generate methyl bam files have not been tested for applications other than methylation analysis. For other applications (eg structural variant calling), the regular bam files provided should be used.