Why do some participants have multiple genomic data on the same reference assembly?
Why do some participants have multiple genomic data on the same reference assembly?
Question
I have noticed that there are some individuals who have multiple genome alignments and VCF files on the same genome build (ie. one on GRCh37 and three on GRCh38). Why is this the case and is there an easy way to filter these out to just analyse one alignment per individual?
Answer
In the Research Environment, there are some participants who have multiple genome delivery IDs on the same reference assembly. Genome delivery IDs are assigned by Illumina and are related to an analysis run carried out in their system. Delivery IDs are unique and correspond to only one analysis run.
When a sample is re-delivered:
- The delivery ID will be exactly the same, if the original analysis has been re-delivered e.g. a BAM has not been transferred successfully and was truncated
- The delivery ID will be different, if Illumina re-run the analysis and therefore regenerated sample data
A re-run the Illumina analysis is often caused by one or more of the earlier genome deliveries not meeting the contractual or quality requirements. These include:
- Alignment File Quality Checks:
The BAM file contains the sequence read pairs for the sample mapped to the reference genome and is the source of all secondary data generated for downstream analysis such as somatic variants, SNVs, InDels and structural rearrangements. It is therefore important to check that this file is valid and formatted that it adheres to the SAM format. We check this by running a third-party software tool called ValidateSamFile from the Picard Toolkit. If the BAM is found to be invalid the pipeline will stop processing the sample and a fail status will be assigned to the run (Validate BAM Picard). Next, coverage distribution statistics are generated for the BAM file using samtools. These statistics are produced twice: once with base call and mapping quality filtering (Filtered Bamstats, Q30 Bamstats) applied and again with no filtering (Unfiltered Bamstats). The following base call and mapping quality filters are applied in the Filtered Bamstats step:
- exclude duplicates and secondary mappings
- exclude reads mapping outside autosomal non-N regions
- read mapping quality > 10
Samtools generates many metrics that can be used to evaluate the quality of the sequence alignment; these include: the number of mapped reads, average insert size, the number of read pairs mapped to different chromosomes (used to calculate the % chimeric DNA metric) and the number of duplicate reads.
The key statistics that are generated and automatically checked (by components Generate Q30 Metrics Bamstats, Generate Filtered Metrics Bamstats) and result in QC failure if they do not pass the threshold are:
Metric Catalog Key | Metric Description | Threshold |
---|---|---|
perc_bases_ge_15x_mapQ_ge11 | bases in the genome covered by a read depth of at least 15X | ≥95% |
GbQ30NoDupsNoClip | mapped bases with a base call quality of >=30 in gigabases [Gb] | 85Gb for germline samples 220Gb for somatic samples |
- Variant File Quality Checks:
The final part of the intake QC pipeline calculates statistics for the sample BAM and VCF files using the raw output generated by samtools and bcftools. VCF (variant call format) files containing variants are checked to ensure they are well formatted (VCF QC) and adhere to VCF 4.1 specifications. This is done using bcftools and gnuplot. Statistics generated by bcftools include the number of SNPs in the VCF, number of InDels and number of multi-allelic SNP sites. The full list of stats generated for the VCF is given in Appendix 3. Any uncompressed VCF files are compressed and indexed using bgzip and tabix (part of the samtools).
Once genome deliveries have passed intake QC, they are subjected to further data quality checks to determine whether the samples are appropriate for downstream analysis and to prepare the data for interpretation. Again, if a a genome delivery fails at any one of these stages, a re-run of the Illumina pipeline will be requested and a subsequent genome delivery sent.
Last updated
This page was last updated on the 12 Apr 2019.