Aggv3 site QC¶
For all variants on autosomes and chrX, we applied a number of quality control (QC) metrics. All the metrics are reported in the INFO column in a separate siteQC VCF file corresponding to each shard and subshard. The first iteration of the siteQC VCFs do not include a FILTER column based on these QC metrics, but this will be updated in the future. We do not remove any variants from these files.
The biallelic VCF files contain a FILTER field provided by Illumina. The sites can fail on two conditions:
LowGTR- variants with genotyping rate <0.9LowMLSQ- variants with Machine Learning Site Quality < 0.1. The MLSQ takes into consideration genotyping rate and genotyping quality from all samples. This filtering is applied only to variants passing the LowGTR filter.
A file containing paths to the siteQC VCFs is available in BED format at s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/manifests/site_qc/2026-01-06/siteqc_shards.bed. For guidance on working with these VCFs, please refer to our code book.
Site QC of autosomes¶
To calculate the SiteQC metrics, we first decomposed the multiallelic VCF (Processing of Multiallelic VCFs), and calculated the following per-variant quality metrics. They can be found in the INFO field of site QC VCF files.
| Metric TAG | Description | Further description |
|---|---|---|
MEDIAN_DP |
Median DP calculated across all samples. | |
MEDIAN_GQ |
Median GQ calculated across samples with complete genotypes. | The median GQ was calculated from GTs in which fully missing genotypes were filtered out. Values are capped at 99. The filtering method was equivalent to: bcftools query $infile -e "GT=\\".\\"" [...] |
MISSINGNESS_RATE |
Fraction of samples with fully missing genotypes. | Missingness is calculated as: missing_genotypes/(complete_genotypes+missing_genotypes). Missing genotypes: bcftools query $infile -e "GT=\\".\\"" [...]. Complete genotypes: bcftools query $infile -e "GT!=\\".\\"" [...] |
AB_RATIO |
For each heterozygous genotype, a binomial test is performed on reads supporting the called alleles (ref/alt or alt1/alt2). AB_RATIO is the proportion of heterozygotes passing the test (p>0.01) out of all heterozygotes. | The AB ratio is a measure of the evidence supporting whether a heterozygous call is correct or not. This is achieved by testing the distribution of reads supporting ref and alt alleles for each genotype. We apply a binomial test with a strict threshold of p-value >0.01. The ratio is number of heterozygous calls passing this test divided by the total number of heterozygous calls for a variant. For heterozygotes containing a reference allele, LAD[0] and LAD[1] are used for the binomial test, where LAD[0] refers to the REF, and LAD[1] refers to the ALT allele. For heterozygotes where there is no reference allele (e.g., with a 1/3 genotype in the multiallelic VCF), we calculate the binomial test on LAD[1] and LAD[2]. This ensures that AB ratio is calculated using the correct numbers even if the sites are represented as 0/1 and 1/0 in the biallelic VCF (the LAD retains 3 values). |
AN |
Total number of alleles in called genotypes calculated across all samples. | These values are all calculated using the BCFtools plugin fill-tags. |
AC_Hom |
Allele counts in homozygous genotypes calculated across all samples. | As above. |
AC_Het |
Allele counts in heterozygous genotypes calculated across all samples. | As above. |
AC_Hemi |
Allele counts in hemizygous genotypes calculated across all samples. | As above. |
AF |
Allele frequency calculated across all samples. | As above. |
Site QC of chrX¶
Sex chromosome QC was handled in a similar manner to autosomal QC, however some metrics were calculated separately for XX and XY samples. The table below provides a summary of the sex-specific metrics and their names. Sex chromosome karyotype was determined on the basis of DRAGEN 3.7.8 ploidy estimation.
| Metric type | Metric | Metric TAG |
|---|---|---|
| Metrics calculated on all samples | AN, AC, AC_Hemi, AC_Het, AC_Hom | AN, AC, AC_Hemi, AC_Het, AC_Hom / unchanged |
| Metrics calculated on XX and XY samples separately | AN, AC, AC_Hemi, AC_Het, AC_Hom, MISSINGNESS_RATE, MEDIAN_DP, MEDIAN_GQ | e.g. AN_XX or MEDIAN_DP_XY |
| Metrics calculated only on XX samples | AB_RATIO | AB_RATIO_XX |