Long-read genomic data from ONT¶

There is pilot long-read ONT data available in the RE. This page includes information on the sequencing protocol, the analytical pipeline and a summary of the data within the research environment.

Genomics England, in partnership with the Sanger Institute, has been assessing the advantages of long-reads sequencing technologies over Illumina short-reads whole genome sequencing. The primary objective is to identify structural variants (SVs), repeats expansions and contractions, and epigenetic modifications that cannot be accurately detected with short-read sequencing.

The dataset consists of human genomes from a subset of 100,000 Genomes Project participants assembled with ultra-long reads. The genomic data deposited in the Research Environment were generated with the Oxford Nanopore Technologies (ONT) Promethion (Beta) and comprise the full output of the long-reads analytical pipeline 1.0.

Sequencing protocol¶

Germline DNA from a subset of 100,000 Genome Project participants was depleted of low molecular weight DNA (<10 Kb) before library preparation. Libraries for ONT sequencing were prepared with the protocol indicated in the library_prep field of the ‘LRS_sample’ table in LabKey. Data were acquired with the PromethION Beta for 42-60hrs in high-accuracy mode. Full details of the protocol can be found here:

v1_protocol_ONT_LSK109.pdf

Bioinformatics pipeline¶

PromethION SV calling pipeline GRCh38.docx

File structure¶

ONT samples

The ONT samples are structured as follows:

run_id/sequencing_output_id/  
  aligned_minimap/  
  fast5_fail/  
  fast5_pass/  
  fastq_fail/  
  fastq_pass/

Files:
- Fast5: These files contain the raw output from the ONT sequencer in a HDF5 format. Each file contains the data for up to 4000 sequences.
- Fastq: This is the output of the ONT basecaller Guppy, containing the sequence and base-quality scores of each read.
- Bam: The BAM file contains all pass filter reads and information on their alignment to GRCh38.

Methylation data¶

ONT sequencing can detect epigenetic modifications of DNA such as 5-methylcytosine. Nanopore sequencing works by drawing a DNA molecule through a tiny pore embedded in a membrane. A current is applied across the membrane and by measuring small changes to the current, the DNA sequence is established. Modified DNA bases are sufficiently different to their unmodified counterparts that they also introduce characteristic changes to the electric signal and this can be used to determine the modification status of the base.

Modified bam files with cytosine methylation (5mC only) data are provided for the Cancer TJ cohort, they can be found with the regular bam files, in a folder called methyl_bams. These files were generated with guppy 6.3.8 using the dna_r9.4.1_450bps_modbases_5mc_cg_hac.cfg config. With this config, guppy generates a probability of methylation for each cytosine in a CpG context, using ONT's Remora methylation model. The methylation data is stored in the MM and ML tags which give the position of the relevant bases and the probability of methylation respectively.

The specific tools versions and most important parameters users have been included, alongside the file paths of the modified bams, in the cancer_ont_cohorts LabKey table. The table below describes the relevant columns:

Column name	Description
`methylation_guppy_version`	Guppy toolkit version used for modified BAMs
`methylation_basecall_version`	Guppy basecalling version used for modified BAMs
`methylation_basecall_model`	Model used for guppy basecalling for modified BAMs
`methylation_basecall_filter_threshold`	Basecalling filter threshold value used for modified BAMs
`methylation_minimap2_version`	Minimap2 version used for read alignment for modified BAMs
`lr_merged_tumour_methylation_path`	Path of the modified BAM file with methylation tags for the tumour sample

The settings used to generate these files have not been tested for applications other than methylation analysis. For other applications (eg structural variant calling), the regular bam files provided should be used.