AggV3 Principal Components, genetically inferred ancestry and relatedness¶
We calculated pairwise genetic relatedness amongst samples, inferred genetic ancestry using 1000 Genomes Project's reference populations, and generated universal and superpopulation-specific Principal Components (PCs), using the binary PLINK files derived from the multi-sample VCFs from AggV3. Results from these analyses can be found in CloudOS in the following directory (in File Explorer it is placed under GEL Germline DRAGEN 3.7.8) :
s3://357851407625-germline-aggregate-v3-supporting-data/population-structure-and-relatedness_2026_03_27/
High confidence independent SNPs¶
The current release of this data is based on 63,523 high-confidence, independent SNPs (referred to as HQ SNPs) generated during the AggV2 release. The detailed description of the selection process of these SNPs is available in the AggV2 documentation, but briefly, these SNPs meet the following criteria:
- Autosomal and bi-allelic,
- Common (MAF>5%) in AggV2 and 1000 Genomes Project Phase 3 data,
- Missingness < 1%
- Median GQ ≥ 30
- Median Depth ≥ 30
- AB Ratio ≥ 0.9
- Completeness ≥ 0.9
- Exclude variants in complex regions, as defined in the 'high LD exclusion regions' file
- Remove all SNPs where the ref/alt combination was AT or GC (A/T, T/A, G/C, C/G), to avoid ambiguous allele swaps
- LD prune using plink version v1.9 with an r2 0.1, 500kb window
- Remove all SNPs which are out of Hardy Weinberg Equilibrium (HWE) in any of the afr, eas, eur or sas super-populations, with a p-value cutoff of pHWE < 1e-5
The AggV2 HQ SNPs were subset from AggV3's PLINK files, and additionally filtered for:
- Missingness < 1%
- Median GQ ≥ 30
- AB Ratio ≥ 0.9
- Not being covered by DUST (low complexity regions)
Note: 5390 out of 63,523 AggV2 HQSNPs were identified as multiallelic in AggV3. Several of them failed on DUST filtering, and 4509 of remaining sites had very low AC values for the secondary allele (AC<5). We therefore included the main alleles (i.e. the ones included in the AggV2 HQ SNPs list) of these new multiallelic sites to preserve the highest possible number of the original HQSNPs, as long as they satisfied the other filtering conditions.
This resulted in a set of 58,763 of HQ SNPs. The PLINK2 file set containing the genotypes of all AggV3 samples at all HQ SNPs can be found here:
| File name | File description |
|---|---|
merge_plink_files.pgen/pvar/psam |
PLINK2 file set containing the genotypes at HQ SNPs of all AggV3 samples. The list of HQ SNPs is available in the .pvar file. |
Genetic relatedness inference¶
Using these SNPs, we generated a pairwise kinship matrix using the PLINK2 implementation of the KING-Robust algorithm.
These were then partitioned into related (up to, and including third degree relationships) and unrelated sample lists using the PLINK2 --king-cutoff relationship-pruning algorithm, with a threshold of 0.0442.
| File name | File description |
|---|---|
make_king_table.kin0 |
Kinship coefficients in KING table for pairwise relationships with kinship coefficient ≥ 0.0442. |
assign_relatedness.king.cutoff.in.id |
List of unrelated individuals. |
assign_relatedness.king.cutoff.out.id |
List of related individuals. |
make_king_triangle.king.bin |
Full kinship matrix of all AggV3 pairs in PLINK triangular binary format. |
make_king_triangle.king.id |
Sample IDs accompanying the KING triangle. |
{FILE_PREFIX}.log |
PLINK2 run log recording parameters used, sample/variant counts after filtering, and any warnings or errors encountered during execution. E.g. assign_relatedness.log |
Genetic ancestry inference¶
Using PLINK2 binary files from AggV3, we have inferred genetic ancestries using 1000 Genomes Project's data. We used the five broad super-populations, and for more granular results, the 26 populations. Below are the reference tables for populations and super-populations:
| Code | Description |
|---|---|
| AFR | African |
| AMR | Admixed American |
| EAS | East Asian |
| EUR | European |
| SAS | South Asian |
Granular population table
| Population | Population Code | Super Population |
|---|---|---|
| Han Chinese in Beijing, China | CHB | EAS |
| Japanese in Tokyo, Japan | JPT | EAS |
| Southern Han Chinese | CHS | EAS |
| Chinese Dai in Xishuangbanna, China | CDX | EAS |
| Kinh in Ho Chi Minh City, Vietnam | KHV | EAS |
| Utah Residents (CEPH) with Northern and Western European Ancestry | CEU | EUR |
| Toscani in Italia | TSI | EUR |
| Finnish in Finland | FIN | EUR |
| British in England and Scotland | GBR | EUR |
| Iberian Population in Spain | IBS | EUR |
| Yoruba in Ibadan, Nigeria | YRI | AFR |
| Luhya in Webuye, Kenya | LWK | AFR |
| Gambian in Western Divisions in the Gambia | GWD | AFR |
| Mende in Sierra Leone | MSL | AFR |
| Esan in Nigeria | ESN | AFR |
| Americans of African Ancestry in SW USA | ASW | AFR |
| African Caribbeans in Barbados | ACB | AFR |
| Mexican Ancestry from Los Angeles USA | MXL | AMR |
| Puerto Ricans from Puerto Rico | PUR | AMR |
| Colombians from Medellin, Colombia | CLM | AMR |
| Peruvians from Lima, Peru | PEL | AMR |
| Gujarati Indian from Houston, Texas | GIH | SAS |
| Punjabi from Lahore, Pakistan | PJL | SAS |
| Bengali from Bangladesh | BEB | SAS |
| Sri Lankan Tamil from the UK | STU | SAS |
| Indian Telugu from the UK | ITU | SAS |
In this release, we used the samples and their population labels from 1000 Genomes Project in a process nearly identical to the one described in AggV2. It is based on PCA, which generates eigenvectors representing the genetic variation among participants as a continuous, multidimensional distribution, in which those with ancestors from the same geographical area often cluster together. Projecting the AggV3 samples onto the 1000 Genomes Project PC space allows us to estimate their genetic ancestries. The process is briefly outlined below:
- We took all unrelated samples from the 1000 Genomes Project.
- Subsetted 1KGP SNPs to just our 58,763 AggV2-derived HQ SNPs (as described above)
- We calculated the first 20 PCs using PLINK2
- We projected the AggV3 data onto the 1000 Genomes Project PC loadings
- We trained a random forest model to predict ancestries based on:
- We trained a random forest model to predict ancestries based on:
a. The first 10 PCs for superpopulations, or 20 PCs for more granular populations
b. Ntrees set to 500
c. 1000 Genomes Project super-population and population labels - Samples with prediction higher than 0.8 was assigned to the assigned ancestry, otherwise it was left as "unassigned"
| File name | File description |
|---|---|
assign_ancestry.predanc |
A TSV file containing the sample_id and ancestry probabilities for each superpopulation (columns 2–6) and population (columns 7–32). The ancestry column indicates the superpopulation for which the sample has a probability > 0.8. |
{ANCESTRY}.txt |
Sample lists for each superpopulation (e.g. EUR.txt). |
Additional ancestry-related resources:
| File name | File description |
|---|---|
project_aggv3_samples_onto_1kgp_pcs.sscore |
Projected PC scores for all AggV3 samples in the 1kGP3 PC space. Each row is a sample; columns are #FID, IID, ALLELE_CT, NAMED_ALLELE_DOSAGE_SUM, and PC1–PC20. |
calculate_1kgp_pcs.acount |
Allele counts for each HQ SNP from unrelated 1kGP samples, generated using --freq counts. Contains chromosome, reference/alternate alleles, and allele counts used as input frequency data for PC projection. |
calculate_1kgp_pcs.eigenvec |
Principal component scores (eigenvectors) for each unrelated 1kGP individual. Each row is a sample; columns are PC1–PCn as specified by --pca. |
calculate_1kgp_pcs.eigenvec.allele |
Allele-specific eigenvector weights for each variant contributing to each PC, produced using the allele-wts modifier. Contains chromosome, ref, and alt columns followed by per-PC loading weights. Used as --score input to project additional samples. |
calculate_1kgp_pcs.eigenval |
Eigenvalues corresponding to each principal component. |
{FILE_PREFIX}.log |
PLINK2 run log recording parameters used, sample and variant counts after filtering, and any warnings or errors encountered during execution. |
Ancestry summary stats¶
| Population | N | % |
|---|---|---|
| AFR | 361 | 32.61 |
| AMR | 624 | 0.45 |
| EAS | 1199 | 0.87 |
| EUR | 10956 | 79.16 |
| SAS | 1273 | 9.20 |
| unassigned | 1067 | 27.71 |

Principal Components¶
We ran PCA as implemented in PLINK2 using 58,763 HQ SNPs. We calculated 50 PCs on 102,341 unrelated individuals. Due to the high number of samples, we used the -approx modifier in the PLINK command. We then projected all 138,406 individuals (related and unrelated) onto these PCs with PLINK2's -sscore functionality with variance standardisation, computed without mean imputation for missing genotypes.
We also provide the first 50 PCs calculated on samples from each super-population, using the inferred genetic ancestries described above. We ran PCA as described above on unrelated sets of individuals from each super-population, and then projected all individuals onto these PCs.
| File name | File description |
|---|---|
project_aggv3_samples_onto_unrelated_aggv3_pcs_{ANCESTRY}.sscore |
Projected PC scores for each sample in the ancestry‑filtered set. ALL refers to all AggV3 samples. Each row is a sample; columns are #FID, IID, ALLELE_CT, NAMED_ALLELE_DOSAGE_SUM, and PC1–PC50. |
calculate_unrelated_aggv3_pcs_{ANCESTRY}.acount |
Allele counts for each variant in the ancestry‑filtered sample subset, generated using --freq counts. Contains chromosome, reference/alternate alleles, and allele counts used as input frequency data for PC projection. |
calculate_unrelated_aggv3_pcs_{ANCESTRY}.eigenvec |
Principal component scores (eigenvectors) for each unrelated individual in the ancestry‑filtered set. Each row is a sample; columns are PC1–PCn as specified by --pca. |
calculate_unrelated_aggv3_pcs_{ANCESTRY}.eigenvec.allele |
Allele‑specific eigenvector weights for each variant contributing to each PC, produced using the allele-wts modifier. Contains chromosome, ref, and alt columns followed by per‑PC loading weights. Used as --score input to project additional samples. |
calculate_unrelated_aggv3_pcs_{ANCESTRY}.eigenval |
Eigenvalues corresponding to each principal component. |
{FILE_PREFIX}.log |
PLINK2 run log recording parameters used, sample and variant counts after filtering, and any warnings or errors encountered during execution. |
Note on monomorphic SNPs
During the PCA projection step, we found that 154 SNPs in the EAS population were monomorphic in the unrelated set.
Due to the variance-standardise option used in PLINK, we removed these SNPs from the EAS projection step to complete the projection.
Monomorphic SNPs were identified by filtering the .acount files per superpopulation for sites where AF = 0 or AF = 1.
We acknowledge that Hardy–Weinberg Equilibrium filtering should have removed these sites and accept this as a caveat of using the HQ SNPs derived in AggV2.
We intend to correct this in future releases.
The list of monomorphic SNPs provided in the main output directory:
| File name | File description |
|---|---|
identify_monomoprhic_snps_{ANCESTRY}.txt |
A list of monomorphic SNPs. Files are empty for all populations except EAS. |
AggV3 PC plots¶
Below we show the first 4 PCs, derived from PCA on the AggV3 samples. The graphs show samples coloured by their inferred genetic ancestry (using a threshold of T=0.8), and 'best-guess' ancestry based on their inferred super-population. Additionally, we provide the percentage of variance explained by each principal component, estimated by dividing the eigenvalues by the total number of samples used to calculate the PCs.