Skip to content

AggV3 sample quality metrics

We provide a set of sample-level quality control (QC) metrics for the samples in aggV3, including coverage metrics and those generated during alignment and variant calling. Sample quality metrics are provided as a table that can be accessed via CloudOS, where the table is available in an S3 bucket at s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/sampleqc_resources/sample_qc_aggregated.csv.

The distributions of key quality metrics across all samples are available.

Future updates to sample quality metrics

For the initial release, samples have not been excluded based on their quality metrics. You may choose to apply your own sample QC criteria using the metrics provided, depending on your analysis needs. In future releases, we plan to conduct a quality assessment and provide sample-level flags to indicate those we consider lower quality.

DRAGEN 3.7.8 QC metrics

Many of the metrics were generated by DRAGEN 3.7.8 during sample re-alignment and are available in various .csv files alongside other sample-level outputs from the pipeline. For convenience, we have aggregated a select set of these metrics for aggV3 samples into a single table. To facilitate programmatic use, we standardised the naming conventions of the metrics originally produced by DRAGEN.

Please note that because variant call metrics refer to those from DRAGEN 3.7.8, they may not be identical to those observed in the aggregate data which uses ML-recalibrated DRAGEN-called variants as input, but should act as a relative proxy.

The table below lists each metric name as it is in the provided table, its original name from the DRAGEN pipeline and the source .csv file.

Original DRAGEN name Standardised name DRAGEN output file
Total input reads total_input_reads PLATEKEY.mapping_metrics.csv
Mapped reads mapped_reads PLATEKEY.mapping_metrics.csv
Total alignments total_alignments PLATEKEY.mapping_metrics.csv
Supplementary (chimeric) alignments supplementary_chimeric_alignments PLATEKEY.mapping_metrics.csv
Insert length: median insert_length_median PLATEKEY.mapping_metrics.csv
Insert length: mean insert_length_mean PLATEKEY.mapping_metrics.csv
Insert length: standard deviation insert_length_standard_deviation PLATEKEY.mapping_metrics.csv
Unmapped reads unmapped_reads PLATEKEY.mapping_metrics.csv
Number of duplicate marked reads number_of_duplicate_marked_reads PLATEKEY.mapping_metrics.csv
Number of unique reads (excl. duplicate marked reads) number_of_unique_reads_excl_duplicate_marked_reads PLATEKEY.mapping_metrics.csv
Number of unique & mapped reads (excl. duplicate marked) number_of_unique_mapped_reads_excl_duplicate_marked_reads PLATEKEY.mapping_metrics.csv
Singleton reads (itself mapped; mate unmapped) singleton_reads_itself_mapped_mate_unmapped PLATEKEY.mapping_metrics.csv
Paired reads (itself & mate mapped) paired_reads_itself_mate_mapped PLATEKEY.mapping_metrics.csv
Properly paired reads properly_paired_reads PLATEKEY.mapping_metrics.csv
Not properly paired reads (discordant) not_properly_paired_reads_discordant PLATEKEY.mapping_metrics.csv
Paired reads mapped to different chromosomes paired_reads_mapped_to_different_chromosomes PLATEKEY.mapping_metrics.csv
Provided sex chromosome provided_sex_chromosome PLATEKEY.mapping_metrics.csv
Supplementary (chimeric) alignments % supplementary_chimeric_alignments_percentage PLATEKEY.mapping_metrics.csv
Mapped reads % mapped_reads_percentage PLATEKEY.mapping_metrics.csv
Unmapped reads % unmapped_reads_percentage PLATEKEY.mapping_metrics.csv
Number of duplicate marked reads % number_of_duplicate_marked_reads_percentage PLATEKEY.mapping_metrics.csv
Number of unique reads (excl. duplicate marked reads) % number_of_unique_reads_excl_duplicate_marked_reads_percentage PLATEKEY.mapping_metrics.csv
Number of unique & mapped reads (excl. duplicate marked) % number_of_unique_mapped_reads_excl_duplicate_marked_reads_percentage PLATEKEY.mapping_metrics.csv
Singleton reads (itself mapped; mate unmapped) % singleton_reads_itself_mapped_mate_unmapped_percentage PLATEKEY.mapping_metrics.csv
Paired reads (itself & mate mapped) % paired_reads_itself_mate_mapped_percentage PLATEKEY.mapping_metrics.csv
Properly paired reads % properly_paired_reads_percentage PLATEKEY.mapping_metrics.csv
Not properly paired reads (discordant) % not_properly_paired_reads_discordant_percentage PLATEKEY.mapping_metrics.csv
Paired reads mapped to different chromosomes % paired_reads_mapped_to_different_chromosomes_percentage PLATEKEY.mapping_metrics.csv
PCT of genome with coverage [15x: inf) pct_of_genome_with_coverage_15x_inf PLATEKEY.vc_metrics.csv
Average chr X coverage over genome average_chr_x_coverage_over_genome PLATEKEY.vc_metrics.csv
Average chr Y coverage over genome average_chr_y_coverage_over_genome PLATEKEY.vc_metrics.csv
Average mitochondrial coverage over genome average_mitochondrial_coverage_over_genome PLATEKEY.vc_metrics.csv
Average autosomal coverage over genome average_autosomal_coverage_over_genome PLATEKEY.vc_metrics.csv
Median autosomal coverage over genome median_autosomal_coverage_over_genome PLATEKEY.vc_metrics.csv
SNPs snps PLATEKEY.wgs_coverage_metrics.csv
Het/Hom ratio het_hom_ratio PLATEKEY.wgs_coverage_metrics.csv
Total total PLATEKEY.wgs_coverage_metrics.csv
Biallelic biallelic PLATEKEY.wgs_coverage_metrics.csv
Multiallelic multiallelic PLATEKEY.wgs_coverage_metrics.csv
SNP Transitions snp_transitions PLATEKEY.wgs_coverage_metrics.csv
SNP Transversions snp_transversions PLATEKEY.wgs_coverage_metrics.csv
Ti/Tv ratio ti_tv_ratio PLATEKEY.wgs_coverage_metrics.csv
Heterozygous heterozygous PLATEKEY.wgs_coverage_metrics.csv
Homozygous homozygous PLATEKEY.wgs_coverage_metrics.csv
Insertions (Hom) insertions_hom PLATEKEY.wgs_coverage_metrics.csv
Insertions (Het) insertions_het PLATEKEY.wgs_coverage_metrics.csv
Deletions (Hom) deletions_hom PLATEKEY.wgs_coverage_metrics.csv
Deletions (Het) deletions_het PLATEKEY.wgs_coverage_metrics.csv
Indels (Het) indels_het PLATEKEY.wgs_coverage_metrics.csv
SNPs % snps_percentage PLATEKEY.wgs_coverage_metrics.csv
Biallelic % biallelic_percentage PLATEKEY.wgs_coverage_metrics.csv
Multiallelic % multiallelic_percentage PLATEKEY.wgs_coverage_metrics.csv
Insertions (Hom) % insertions_hom_percentage PLATEKEY.wgs_coverage_metrics.csv
Insertions (Het) % insertions_het_percentage PLATEKEY.wgs_coverage_metrics.csv
Deletions (Hom) % deletions_hom_percentage PLATEKEY.wgs_coverage_metrics.csv
Deletions (Het) % deletions_het_percentage PLATEKEY.wgs_coverage_metrics.csv
Indels (Het) % indels_het_percentage PLATEKEY.wgs_coverage_metrics.csv
Ploidy estimation dragen_karyotypic_sex_estimation PLATEKEY.dragen_karyotypic_sex_estimation_metrics.csv
Autosomal median coverage autosomal_median_coverage PLATEKEY.ploidy_estimation_metrics.csv
X median coverage x_median_coverage PLATEKEY.ploidy_estimation_metrics.csv
Y median coverage y_median_coverage PLATEKEY.ploidy_estimation_metrics.csv

DNA contamination

In addition to the QC metrics generated by DRAGEN, we calculated a measure of DNA contamination using a method described by Lu et al. called CHARR (Contamination from Homozygous Alternate Reference reads). CHARR estimates contamination by examining biallelic homozygous alternate variants. The presence of unexpected reference reads at these sites is interpreted as a signal of potential contamination. This signal is quantified - accounting for population allele frequencies - by assigning per-variant contamination values and averaging them to produce a sample-level estimate.

To compute CHARR scores, we used allele frequencies from aggV2, restricted to biallelic SNPs that had not undergone MNP-to-SNP decomposition. The aggV2 allele frequency file used for calculating CHARR is available at s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/sampleqc_resources/gel_af_filtered.vcf.gz.

For each sample, variants meeting the following criteria were extracted from their DRAGEN gVCF:

  • Autosomal biallelic SNP
  • 20 ≤ DP ≤ 100
  • GQ ≥ 20
  • AN from aggV2 (AN_whole_cohort) > 20,000
  • 0.15 ≤ AF from aggV2 (AF_whole_cohort) ≤ 0.85
  • Homozygous alternate genotype

Variant-level values were then calculated using the formula:

\[ ABref_{variant}=\frac{ADref_{variant}}{ADref_{variant} + ADalt_{variant}} \]

where \(ABref_{variant}\) is the reference allele balance of a variant, \(ADref_{variant}\) is the number of reads supporting the reference allele, and \(ADalt_{variant}\) is the number of reads supporting the alternate allele.

The final CHARR score was calculated by dividing \(ABref_{variant}\) by the respective allele frequency \(AF_{variant}\) and getting an average across all variants.

\[ CHARR_{sample}=\text{mean(}\frac{ABref_{variant}}{AF_{variant}}\text{)} \]

For each sample, we report the CHARR score as charr_score and the number of variants used for its calculation (charr_total_variants) in the table of quality metrics. We also provide a metric named charr_total_na_variants which quantifies the number of variants in a sample for which a variant-level CHARR score could not be calculated.

Comparison of CHARR with VerifyBamID

In their original publication, the authors demonstrated that CHARR strongly correlates with Freemix from VerifyBamId. To validate our implementation of CHARR, we calculated contamination estimates for 500 samples in aggV2 using two sets of variant calls: one generated by the NSV4 pipeline and another by DRAGEN 3.7.8. We then compared both sets of CHARR estimates to their Freemix values. Overall, we observed a strong correlation between the two measures, with CHARR estimates tending to be slightly higher when calculated based on DRAGEN-called variants. This inflation is likely due to the increased number of variants called by the DRAGEN pipeline.

The figure below illustrates this correlation (left panel) and shows the corresponding number of variants used to calculate the CHARR score (right panel).