Genomic data¶
For every sample we have the full output of whole genome sequencing (WGS) and analysis, carried out by Illumina, including alignments and called variants.
For a more in-depth description, you can consult the Whole Genome Sequencing Service Informatics and Cancer Analysis Services Guide from Illumina.
Different stages of the project used different sequencing and aligning technologies.
See how our genomic data was produced
Genomic data structure¶
Folders¶
There is a folder for each sample containing the genomic data for that sample. These data are an exact copy of those generated by the Illumina sequencing pipeline.
The sample folders are found in the following folder structure:
genomic
: The main folder, accessible from both the desktop and HPC.by_date
- Folders named by date of delivery in the form
yyyy-mm-dd
- Delivery folders for each genome, in the form
HX00123456
- Folders of genomic data, named for the sequencing platekey of the sample, which is a combination of plate barcode and well coordinate; for example:
LP12345678-DNA_A01
- Folders of genomic data, named for the sequencing platekey of the sample, which is a combination of plate barcode and well coordinate; for example:
- Delivery folders for each genome, in the form
- Folders named by date of delivery in the form
Below is an example of the genomic file structure for the sample LP3000987-DNA_A01:
Genomes folder | By date folder | Delivery ID folder | Platekey folder |
---|---|---|---|
~/genomes | ~/genomes/2018-04-25 | ~/genomes/2018-04-25/HX98765432 | ~/genomes/2018-04-25/HX98765432/LP3000987-DNA_A01 |
The same file structure is also accessible in s3 buckets using the CloudOS application. This is the only way to access genomic data for COVID-19 participants.
You can find the location of a participant's genome files using the latest LabKey genome_file_types_and_paths
table or on the latest version Participant Explorer. All the participants you find by these methods are currently consented for use.
You should not try to find these files by directory traversing or browsing, as you may find genomes of participants who have since withdrawn consent and any requests to export these data via Airlock will be rejected.
In the desktop and terminal interface, you will be able see all genome delivery folders.
Note that this does not mean you will have access to all of these folders.
You will only have access to the genome folders that you have been given permissions to based on your credentials.
Files¶
The Illumina Whole Genome Sequencing Service and Cancer Analysis Service performs a series of processes with the following software packages.
All consented participants¶
Software | Description |
---|---|
Issac | Aligns reads to the reference genome, trims and flags duplicates in the raw sequence. |
Starling | A germline small variant caller and generates small variant (SNV and small indels ≤ 50 bp) analysis calls. |
Manta | A germline and somatic structural variant caller; it generates structural variant (SV) analysis calls. |
Canvas | A germline and somatic copy number variant caller; it generates copy number (CNV) and loss of heterozygosity (LOH) analysis calls. |
Strelka | Joint tumour/normal small-variant caller. |
The tools used for alignment and variant calling will vary, particularly between 100K and NHS GMS data releases. For details, see the latest release information for 100K and NHS GMS
A subset of consented participants¶
Software | Description |
---|---|
ExpansionHunter | A tool which looks for repeat expansions at several positions of interest. |
HLATyper | A tools to generate likely HLA types for the sample. |
ROHcaller | Identifies runs of homozygosity (ROHs) from whole-genome SNV variant call sets and predicts the most likely relationships of the sequenced individual's parents. |
The tools and versions used for each Illumina genome delivery will depended on the sample (germline/tumour) and on the Illumina pipeline version.
You can look at the header of the BAMs/VCFs to identify the Illumina pipeline and tool version used for each delivery.
All genomes have been aligned against either GRCh37 or GRCh38. For each data release, the reference genome is indicated in LabKey along with the path to the folder where the genome is stored.
Genomic data contents¶
An example file structure for LP12345678-DNA_A01 is shown below.
The key files are:
- ./Assembly/[Platekey].bam - Archival BAM file for sample
- ./Assembly/[Platekey].bam.bai - Index for the BAM file
- ./Variations/[Platekey].vcf.gz - Single nucleotide polymorphism (SNVs) and small insertion/deletion (1 bp–50 bp) calls in VCF format.
- ./Variations/[Platekey].genome.vcf.gz - Genome *.VCF file containing SNVs, indels, and reference covered regions
- ./Variations/[Platekey].SV.vcf.gz - Large Structural Variation calls (51 bp–10 kb) and copy number calls (10 kb+) in *.VCF format.
Note that the genotyping folders from early genome deliveries contain the results of the sample's run on the Infinium platform, an initial run done to confirm the sample identity and make sure that it is of high quality.
Quality metrics and other data such as coverage information can be found in the Metrics folder for each genome.
File types explained¶
BAM files¶
The BAM file contains all pass filter reads input into the analysis pipeline for a sample and includes aligned, duplicate and unaligned reads. It adheres to the SAM format specification wherever possible.
All vcf files are compressed and indexed using tabix; the tabix index files show up as the corresponding *.tbi file.
gVCF files¶
Human genome sequencing applications require sequencing information for both variant and nonvariant positions, yet there is no common exchange format for such data. gVCF addresses this issue.
gVCF is a set of conventions applied to the standard variant call format. These conventions allow representation of genotype, annotation and additional information across all sites in the genome, in a reasonably compact format (typically about 1/50 the size of the BAM file used for variant calling).
- gVCF is also equally appropriate for representing and compressing targeted sequencing results. Compression is achieved by joining contiguous nonvariant regions with similar properties into single 'block' VCF records. To maximise the utility of gVCF, especially for high stringency applications, the properties of the compressed block are conservative. Block properties such as depth and genotype quality reflect the minimum of any site in the block. The gVCF file is also a valid vCF v4.1 file and can be indexed and used with existing tools such as tabix and IGV.