AggV2 ancestry inference¶
Using the multi-sample VCFs from aggV2, we have estimated probabilities of genetic ancestry for five broad super-populations, calculated Principal Components (PCs) for participants in aggV2, and calculated pairwise relatedness amongst samples. Alternatively, we have also calculated more fine-grained mappings to 15 worldwide reference populations for AggV2.
We estimated broad genetic ancestry using super-populations from the 1000 genomes project phase 3 (1KGP3) as the truth, by generating PCs for 1KGP3 samples and projecting all aggV2 participants onto these. The five broad super-populations are:
Code | Description |
---|---|
afr |
African |
amr |
Admixed American |
eas |
East Asian |
eur |
European |
sas |
South Asian |
Ancestry inference¶
We used the 1KGP3 to infer ancestry as follows:
- We took all unrelated samples from the 1KGP3
- We subsetted to just our 188382 HQ SNPs
- Further filtered for MAF > 0.05 in 1KGP3 (as well as in our data)
- We calculated the first 20 PCs using GCTA
- We projected the AggV2 data onto the 1KGP3 PC loadings
- We trained a random forest model to predict ancestries based on
- First eight 1KGP3 PCs
- set Ntrees = 300
- Train and predict on 1KGP3 amr, afr, eas, eur and sas super-populations
Model performance¶
Below we show the summary data for the random forest model fit. The out-of-bag (OOB) error rate and confusion matrix show very high performance in the prediction of 1KGP3 super-populations.
Random Forest ancestry model fit
Call:
randomForest(x = rfdat[, pcs1_8], y = SuperPopLabels, ntree = 400, keep.inbag = T)
Type of random forest: classification
Number of trees: 400
No. of variables tried at each split: 2
OOB estimate of error rate: 0.24%
Confusion matrix:
AFR AMR EAS EUR SAS class.error
AFR 638 2 0 0 0 0.00312500
AMR 3 342 0 1 0 0.01156069
EAS 0 0 498 0 0 0.00000000
EUR 0 0 0 499 0 0.00000000
SAS 0 0 0 0 480 0.00000000
The probabilities for each individual is found at:
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/ancestry/MAF5_superPop_predicted_ancestries.tsv
If you are interested in more fine-grained population structure, we provide a set of ancestry predictions based sub-population ancestries from the 1KGP3. The steps to calculate are as above and differ only for steps 3 and 6.
3 - MAF filter of >0.01 for 1KGP3 and aggV2 data
6 - We trained a random forest model to predict ancestries based on 1KGP3 sub-populations
These data are available at:
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/ancestry/MAF1_subpops/MAF1_subPop_predicted_ancestries.tsv
Ancestry summary stats¶
Below is a summary table for the number of individuals (and as a percent of the cohort) assigned with a probability of >0.8 for any one ancestry.
Population | N | % |
---|---|---|
afr |
2002 | 2.56 |
amr |
238 | 0.3 |
eas |
518 | 0.66 |
eur |
62349 | 79.7 |
sas |
7195 | 9.2 |
unassigned | 5893 | 7.54 |
PCs with 1KG samples and projected aggV2 samples, coloured by predicted ancestry¶
Below we show the first six PCs, which were used for the ancestry inference of the aggV2 samples. The plots to the left show all samples (in grey), with the 1KGP3 samples plotted in different colours by super-population. The plots to the right show all samples (in grey), with the aggV2 samples plotted in different colours by predicted super-population (using a threshold of T = 0.8). 1KG samples are represented by crosses, and aggV2 samples by solid circles.
The following plot focuses on EUR and EAS sub-populations from 1KGP3. 1KG samples are represented by crosses, and aggV2 samples by solid circles. PCs for all 1KGP3 and aggV2 samples are included, in grey. In addition:
Left: 1KGP3 samples in different colours by super-population
Middle: 1KGP3 samples in different colours by EAS sub-populations, with aggV2 predicted EAS plotted on top
Right: 1KGP3 samples in different colours by NFE and FIN populations, with aggV2 predicted EUR samples plotted in darkblue.