Genetic similarity to worldwide populations (ancestry) in the UK Biobank¶
We calculate genome-wide genetic similarity to worldwide populations (also known as "genetic ancestry", "inferred ancestry", "ethnicity" or "race"; these terms all have slightly different meanings and our preferred term is "genetic similarity"). This can help you to assign participants to groups for further studies. These data are available in flatfiles in the RE. We also provide details on how these similarity data were calculated.
Genetic similarity to worldwide populations was calculated for the 100kGP release 17 and NHS GMS release 2. They were compared to reference groups curated using an extended dataset including the UK Biobank following the methodology outlined in Prive et al 2022 (a). These data are provided as part of the Genomics England Diverse Data initiative.
Use cases¶
It can be useful to know genetic similarity to worldwide populations for:
- Genome Wide Association Studies (GWAS) where independently analysing genetically homogeneous groups of participants may be a methodological requirement.
- Comparative or meta-analyses across groups of participants, labelling these participants with where in the world people genetically similar to them are from.
File locations¶
You can find the results on these analyses in the RE at:
File and descriptions | File path |
---|---|
100,000 Genomes Project (10kGP release 17) | /gel_data_resources/gel_diverse_data/100k/310523/100k-main_programme_v17_ukbb_worldwide_populations_310523.csv.gz |
NHS Genomic Medicine service (release 2) | /gel_data_resources/gel_diverse_data/nhs-gms/310523/nhs-gms_v2_ukbb_worldwide_populations_310523.csv.gz |
Column descriptors¶
44 total columns, 93,813 individuals
Field (snake_case) | Enumerations/Data Type | Description |
---|---|---|
Participant ID (participant_id ) |
participantId, xs:string | Participant Identifier (supplied by Genomics England) |
Platekey (platekey ) |
varchar | Concatenation of Plate ID and Well ID - unique identifier for a processed well |
Programme (programme ) |
varchar | 100,000 Genomes Project or NHS Genomic Medicine Service |
Genome Build (genome_build) | varchar | Reference Genome Build (GRCh37 or GRCh38) |
Approximate FST to reference* (approximate_fst_to_reference_* ) |
float | Squared euclidean distance to Reference PC centres transformed to approximate FST (metric of genetic differentiation). Calculated after projecting individual genotypes and Reference allele frequencies onto PC1 to PC16 constructed using a curated subset of genomes from the UK Biobank and the 1000 Genomes Project. |
Proportional similarity to reference (_proportional_similarity_to_reference_* ) |
float | Coefficient of relative genetic similarity to Reference modelled as a convex combination (non-negative and sum to 1) of coefficients from all reference groups. Calculated after projecting individual genotypes and Reference allele frequencies onto PC1 to PC16 constructed using a curated subset of genomes from the UK Biobank and the 1000 Genomes Project. |
* One column for each of the 21 reference groups described below.
Reference Groups¶
A short description of the 21 reference groups and their respective continental populations.
Reference population | Super population | Super population code | Founder population? |
---|---|---|---|
United Kingdom | Europe | EUR |
No |
Ireland | Europe | EUR |
No |
Scandinavia | Europe | EUR |
No |
Europe South East | Europe | EUR |
No |
Europe North East | Europe | EUR |
No |
Europe South West | Europe | EUR |
No |
Italy | Europe | EUR |
No |
Finland | Europe | EUR |
Yes |
Ashkenazi | Europe | EUR |
Yes |
Africa North | North Africa | NAF |
No |
Middle East | Middle East | MID |
Yes |
Pakistan | South Asia | SAS |
No |
Sri Lanka | South Asia | SAS |
No |
Bangladesh | South Asia | SAS |
No |
Asia East | East Asia | EAS |
No |
Japan | East Asia | EAS |
No |
Philippines | East Asia | EAS |
No |
South America | South America | AMR |
No |
Africa East | East Africa | EAF |
No |
Africa South | Africa Niger-Congo * | ANC |
No |
Africa West | Africa Niger-Congo * | ANC |
No |
* despite being commonplace in genomics research, we do not recommend using a single "African" or AFR
super-population for group assignment. If you wish to use information in this table to align group assignments with gnomAD or the 1,000 Genomes Project who do utilise this terminology, we recommend using the "Africa Niger-Congo" or ANC
super-population category which includes both Africa West and Africa South reference populations described in Prive et al 2022 (a).
For more information on Africa genomics, diversity, and identity we recommend these reviews by Yere et al 2022 and Pereira et al 2021.
How do I use this table for group assignment?¶
You can use these tables to identify groups of participants genetically similar to some larger super-population, such as South Asia, or to analyse a smaller subset of participants, such as those genetically similar to Ashkenazi Jews. We strongly recommend you read the original methodological paper to best understand how best to use this table.
Here is a short walkthrough using R:
First, we read the table into memory and select the columns containing information on approximate FST:
library(tidyverse)
gms_table <- read_csv("/gel_data_resources/gel_diverse_data/nhs-gms/310523/nhs-gms_v2_ukbb_worldwide_populations_310523.csv.gz")
fst_matrix <- gms_table %>%
select(
starts_with("Approximate FST to")
) %>%
rename_all(
~stringr::str_replace(., "^Approximate FST to ", "")
) %>%
as.matrix()
Let's say we want to identify participants genetically similar to European groups, akin to those described in gnomAD (Non-Finnish European), and assign all other participants to super populations.
To do this, we simply group together multiple sub-populations into super-populations.
# First select sub-population names from columns
group <- colnames(fst_matrix)
# Next define each super-population using vectors of sub-populations
group[group %in% c("Scandinavia", "United Kingdom", "Ireland", "Italy", "Europe South West", "Europe South East", "Europe North East")] <- "Non-Finnish Europe"
group[group %in% c("Africa West", "Africa South")] <- "Africa Niger-Congo"
group[group %in% c("Pakistan", "Sri Lanka", "Bangladesh")] <- "South Asia"
group[group %in% c("Asia East", "Japan", "Philippines")] <- "East Asia"
Now we can define a threshold of approximate FST between a participant and a particular reference group to assign the participant to the group. You should consider the size of your threshold, and the corresponding stringency, for your studies.
In this example the threshold is set to FST = 0.002, as was used in Prive et al 2022 (b). For GWAS, you might want a more lenient threshold like FST = 0.005, as used in the bigsnpr manual.
In many instances, the FST threshold may be lenient enough that an individual could be reasonably assigned to multiple groups. In such instances, we can simply take the minimum FST between those multiple groups in order to assign. Individuals who cannot be assigned to any single group at the given threshold are given NA.
Now putting this all together...
cluster <- group[apply(fst_matrix, 1, function(x) {
ind <- which.min(x)
if (isTRUE(x[ind] < threshold)) ind else NA
})]
assignments <- data.frame(
participant_id = gms_table$`Participant ID`,
group_assignment = cluster
)
assignments %>%
count(group_assignment)
# group_assignment n
# 1 Africa East 5
# 2 Africa Niger-Congo 82
# 3 Africa North 23
# 4 Ashkenazi 38
# 5 East Asia 31
# 6 Middle East 118
# 7 Non-Finnish Europe 3221
# 8 South America 3
# 9 South Asia 526
# 10 <NA> 325
And now you should have a table of participants labelled according to the reference groups defined above.
For example, here we can see that there are 3,221 participants in the NHS Genomic Medicine Service release 2 that are genetically similar enough to at least one Non-Finnish European (NFE) reference group in the UK Biobank at FST < 0.002 to be confidently labelled as such.
Differences to other genetic similarity and ancestry data in the RE¶
Participants in aggV2 have been classified into groups using a random Forest classifier trained on data from the 1,000 Genomes Project. The differences between these data are:
Difference | Genetic similarity | AggV2 ancestry |
---|---|---|
Participants | All participants in 100kGP release 17 and NHS GMS release 2 | Participants in release 10 whose genomes were aligned to GRCh38 |
Method | Prive et al 2022 (a) | random Forest classifier |
Compared to | an extended dataset including the UK Biobank | 1,000 Genomes Project |
The benefits of using the genetic similarity data tables over the aggV2 ancestry inference include:
- a greater diversity of worldwide populations (e.g. Middle East or East Africa)
- coefficients of proportional genetic similarity (e.g. 75% United Kingdom, 25% West Africa)
How was this table created?¶
This table was created using the methodologies outlines in Prive et al 2022 (a) and further described in the documentation for the R package bigsnpr. Please cite Florian Prive's work if using this methodology in your own analyses.
Public File Paths¶
File and descriptions | File path |
---|---|
UK Biobank and 1,000 Genomes reference PC loadings (GRCh38) | /public_data_resources/ukbb_prive/uk_biobank_1kg_grch38_projection.csv.gz |
UK Biobank and 1,000 Genomes reference allele frequencies (GRCh38) | /public_data_resources/ukbb_prive/uk_biobank_1kg_grch38_freqs.csv.gz |
UK Biobank and 1,000 Genomes reference PC loadings (GRCh37) | /public_data_resources/ukbb_prive/uk_biobank_1kg_grch37_projection.csv.gz |
UK Biobank and 1,000 Genomes reference allele frequencies (GRCh37) | /public_data_resources/ukbb_prive/uk_biobank_1kg_grch37_freqs.csv.gz |
Methodology¶
60,825 high-quality (HQ) SNPs aligned to GRCh38 were extracted from the aggV2 and aggCOVID v5. These HQ SNPs were lifted over to GRCh37 coordinates using UCSC LiftOver when necessary. For each participant or rare disease family, genotypes at these HQ SNPs coordinates were extracted from VCF files using PLINK 2.0, and genotypes at non-variable sites in each individual set as homozygous for the reference allele using bcftools 1.12.
assign_pipeline.sh
For each participant, their genetic similarity to worldwide populations in the UK Biobank was estimated using the ukbb-assigner tool.
HQ sites were intersected with 5,816,590 loci aligned to GRCh37 detailed in Prive et al 2022 (a) (converted to to GRCh38 coordinates using UCSC Liftover when necessary) resulting in 55,706 sites used for inference.
The big_prodMat() function from the bigsnpr R package was used to project participant genotypes and reference group allele frequencies at these 55,706 sites onto the top 16 linkage-disequilibrium scaled principal components (PC1-PC16) calculated using a selection of individuals from the UK Biobank and the 1,000 Genomes Project as calculated in Prive et al 2022 (a).
ukbb-assigner.R
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
|
The SNP intersection and projection takes approximately 10 seconds.
Squared Euclidean distance between each participant and 21 curated reference groups on PC space were converted into approximate FST (Prive et al 2022 (b)).
ukbb-assigner.R
Obtaining approximate FST to reference groups after SNP intersection and projection takes approximately 0.5 seconds.
Coefficients of proportional genetic similarity to each of these 21 reference groups were calculated using an extension of the Summix method also described in Prive et al 2022 (a).
ukbb-assigner.R
Obtaining proportional genetic similarity to reference groups using the Summix method after SNP intersection and projection takes approximately 10 seconds.
Don't you mean ancestry groups?¶
You may be familiar with other terms such as genetic ancestry, inferred ancestry, ethnicity or race. Here, we prefer instead to use the more precise term of genetic similarity.
We prefer this term because it doesn't infer anything about where a person ot their family come from, which can lead to misunderstandings or misinterpretations. A person could have genetic similarity to, for example, Middle East labelled reference genomes in the UK Biobank (to some threshold value e.g. approximate FST < 0.002), but it would be wrong to say they are of Middle Eastern ancestry or Middle Eastern descent, as they and their family are from Europe!
Please do not use this information to describe an individual's personal identity, and do not over-interpret results. For more information, we recommend reading discussions such as those authored by Graham Coop.
What if the genomes I wish to analyse are not included here?¶
Currently, consented participants from the 100,000 Genomes Project 100kGP release 17 or the NHS GMS release 2 are included in this table. However, we recognise that a smaller subset of consented participants from the NHS GMS release 2 may not be included in this table owing to one or more individuals within a family withdrawing consent.
We plan on including these genomes at a later date. However, in the meantime, please see the Methodology section below for guidance on how to generate this information for any participant.
Contact¶
For any additional help or advice on genetic similarity assignment, participant genetic ancestry, or utilising this updated methodology please raise a ticket and tag Sam Tallman (samuel.tallman@genomicsengland.co.uk).