Skip to content

Summary statistics across genetically-inferred ancestry groups for 100,000 Genomes Project participants

Here we provide summary statistics, stratified by genetically-inferred ancestry groups, for the AggV2 dataset. AggV2 is an aggregate of gVCFs comprising high-quality germline genomes from genetically diverse individuals derived from release 10 of the 100,000 Genomes Project (100kGP) dataset. Summary statistics include basic demographic data, information on socio-economic status, population structure, sample sources, sequencing quality, genotype and disease status. We provide these data as a reference and to help guide you analysing data from AggV2 where there is a need to take potential confounders related to ancestral diversity into account.

Participants

This analysis covers 76,849 individuals recruited to the 100kGP that are included in AggV2. We excluded participants who have withdrawn from the study after recruitment (N = 9); with indeterminate sex or mismatched self-reported and genotype sex traits (N = 1,284) from our analysis, leaving 76,849 participants.

Ancestry group classification

Participants were classified into five super-population groups based on PC scores. Participants who didn’t meet the criteria for classification into super-population groups were classified as “unassigned” (See Definition of phenotypes). All analysed participants had genomes aligned to GRCh38.

The following tables include summary statistics based on genetically inferred ancestry groups. These are genetic summary statistics that don’t consistently correspond to observable phenotypes. This, together with the fact that all extracted phenotypes have more than five participants (sum of the rows), ensures compliance with the current Airlock policy.

Figure 1. Distribution of key variables across genetically inferred ancestry groups.

Phenotypes with fewer than five patients are not displayed.

Characteristics of all participants across populations

Phenotypes AFR
(N = 1967;
2.6%)
AMR
(N = 238;
0.3%)
EAS
(N = 505;
0.7%)
EUR
(N = 61,279;
79.7%)
SAS
(N = 7,077;
9.2%)
Unassigned
(N = 5,783;
7.5%)
Age (Min – Max) 45 (25; 1-98) 39 (30; 1-78) 44 (21; 0-91) 45 (32; 0-100) 37 (29; 0-89) 35 (36; 0-96)
Sex (M / F, %) 46.1 / 53.9 42.9 / 57.1 40.6 / 59.4 46.5 / 53/5 49.7 / 50.3 50.4 / 49.6
Socio-economic status
Deprivation (%)
Lowest 20.1 9.3 10.7 8.8 24.3 15.5
Lower middle 61 51.4 47.2 38.2 50.8 48.1
Upper middle 15.8 27.9 32.6 41.8 20.8 29.7
Highest 3.1 11.4 9.4 11.3 4.1 6.8
Index of Multiple Deprivation (Rank) 32.4 (21.8) 20.3 (23.9) 20.9 (24.4) 16.0 (18.9) 29.6 (27.1) 23.8 (25.7)
Population structure
Self-reported ethnicity (%)
Asian 0.3 0.0 69.3 0.0 78.5 11.2
Black 73.5 0.4 0.0 0.0 0.4 6.0
White 2.9 40.3 2.4 84.1 0.9 36.6
Mixed 4.6 13.9 2.0 0.4 1.1 17.2
Not known 0.5 1.3 0.4 0.9 0.6 0.4
Not stated 16.8 16.8 14.3 14.3 17.0 18.0
Other 1.4 27.3 11.7 0.4 1.6 10.7
Sample source (%)
Blood 99.1 99.2 98.0 98.7 99.4 99.0
Fibroblast 0.1 0.0 0.2 0.1 0.1 0.1
Saliva 0.7 0.4 0.8 1.0 0.4 0.6
Tissue 0.2 0.4 1.0 0.1 0.1 0.3
Sequencing quality
Callability (%) 95.21 (0.64) 95.21 (0.69) 95.18 (0.67) 95.28 (0.66) 95.30 (0.66) 95.31 (0.66)
Array concordance (%) 99.95 (0.0001) 99.96 (0.0001) 99.96 (0.003) 99.96 (0.01) 99.96 (0.0001) 99.96 (0.0001)
Contamination 0.0034 (0.002) 0.0032 (0.002) 0.0034 (0.002) 0.0030 (0.002) 0.0032 (0.001) 0.0031 (0.001)
Sequencing coverage
All chromosomes 39.28 (7.71) 38.80 (6.87) 39.06 (7.84) 39.09 (8.22) 39.46 (8.48) 39.11 (8.11)
Autosomal chromosomes 39.72 (7.9) 39.45 (7.26) 39.45 (8.33) 39.58 (8.37) 39.97 (8.57) 39.63 (8.21)
Mapping Quality (%)
Percent MAPQ 10 read 89.19 (1.45) 89.37 (1.43) 89.31 (1.33) 89.34 (1.33) 89.29 (1.31) 89.30 (1.35)
Autosome Coverage at 15X 96.03 (0.35) 96.05 (0.34) 96.03 (0.37) 96.07 (0.37) 96.08 (0.38) 96.06 (0.36)
Average base call quality 36.7 (1.0) 36.6 (0.9) 36.6 (1.0) 36.7 (0.9) 36.7 (1.0) 36.7 (1.0)
Percent aligned reads (%) 93.7 (1.50) 93.7 (1.41) 93.7 (1.39) 93.7 (1.37) 93.7 (1.30) 93.7 (1.36)
Percent Q30 bases (%) 84.19 (3.51) 84.09 (3.05) 83.96 (3.42) 84.12 (3.48) 84.03 (3.39) 84.01 (3.50)
Sample error rate 0.0100 (0.0019) 0.0098 (0.0018) 0.0099 (0.002) 0.0097 (0.0018) 0.0097 (0.0019) 0.0097 (0.0019)
Mismatch rate (%) 0.76 (0.17) 0.76 (0.17) 0.75 (0.18) 0.75 (0.18) 0.75 (0.17) 0.76 (0.17)
Fragment length median 482 (32) 485 (34) 482 (31) 482 (33) 483 (33) 483 (34)
Genotype
Number of variants
Indel (per 10000) 118.1 (3.26) 100 (3.72) 98.3 (2.39) 97.5 (2.38) 99.1 (2.91) 99.8 (4.92)
SNV (per 10000) 473.6 (3.26) 400.8 (3.72) 393.8 (2.39) 391.6 (2.38) 398.3 (2.91) 398.6 (4.92)
Indel + SNV (per 10000) 592 (9.8) 500.8 (15.3) 492 (5.2) 489.2 (5.2) 497.7 (10.0) 498.1 (19.5)
Transition/transversion ratio 2.065 (0.003) 2.062 (0.003) 2.056 (0.003) 2.062 (0.003) 2.060 (0.004) 2.062 (0.004)
Indel Het/Hom ratio 2.92 (0.12) 2.29 (0.16) 1.84 (0.05) 2.14 (0.04) 2.17 (0.25) 2.24 (0.27)
Deletion Het/Hom ratio 3.20 (0.13) 2.46 (0.19) 1.97 (0.06) 2.29 (0.05) 2.32 (0.27) 2.40 (0.30)
Insertion Het/Hom ratio 2.67 (0.11) 2.14 (0.14) 1.74 (0.05) 2.01 (0.04) 2.03 (0.24) 2.10 (0.25)
SNP Het/Hom ratio 2.01 (0.09) 1.68 (0.12) 1.35 (0.03) 1.58 (0.03) 1.59 (0.17) 1.65 (0.17)

AFR, African ancestry; AMR, Admixed American ancestry; EAS, East Asian ancestry, EUR; European ancestry; SAS, South Asian ancestry.

Median (IQR) for continuous variables. Median (IQR, min-max) for age. Sample sizes for sequencing quality metrics are different from those stated in the top panel as these cover blood samples only; AFR (N = 1,949; 2.6%), AMR (N = 236; 0.3EAS (N = 495; 0.7%), EUR (N = 60,493; 79.7%), SAS (N = 7,034; 9.3%), Unassigned (N = 5,724; 7.5%).

Characteristics across rare disease participants by ancestry

Phenotypes AFR
(N = 1485; 2.4%)
AMR
(N = 203;
0.3%)
EAS
(N = 382;
0.6%)
EUR
(N = 48,295;
77.8%)
SAS
(N = 6,531;
10.5%)
Unassigned
(N = 5,160;
8.3%)
Age (Min – Max) 40 (27; 1-98) 38 (34; 1-78) 41 (20; 0-91) 39 (28;0-100) 36 (29;0-89) 33 (35; 0-96)
Sex (M / F, %) 46.8 / 53.2 44.8 / 55.2 44.8 / 55.2 47.1 / 52.9 50.5 / 49.5 51.4 / 48.6
Socio-economic status
Deprivation (%)
Lowest 21.0 9.8 10.9 9.1 24.8 15.7
Lower middle 59.5 48.4 46.9 38.4 51.1 48.5
Upper middle 16.1 32.0 31.8 41.3 20.3 29.3
Highest 3.4 9.8 10.5 11.2 3.8 6.4
Index of Multiple Deprivation (Rank) 32.5 (22.2) 19.9 (24.3) 20.2 (24.7) 16.3 (19.2) 30.0 (27.1) 24.1 (26.1)
Population structure
Ethnicity (%)
Asian 0.2 0.0 74.1 0.0 79.5 11.8
Black 75.3 0.5 0.0 0.0 0.4 6.2
White 2.5 42.4 2.1 85.4 0.8 35.0
Mixed 4.9 13.8 1.8 0.4 1.1 18.1
Not stated 15.9 17.7 12.6 13.8 17.0 17.8
Other 1.2 25.6 9.4 0.4 1.3 11.1
Consanguinity (%)
No 48.0 46.3 40.6 42.6 21.8 42.5
Possible 0.3 0.0 0.3 0.1 0.9 0.5
Yes 0.5 2.0 1.0 0.4 16.2 6.6
Unknown 51.2 51.7 58.1 57.0 61.1 50.4
Family structure (%)
Singleton 29.0 19.2 29.6 19.5 13.8 18.9
Duo 20.8 13.3 9.7 13.8 11.0 14.2
Trio 35.4 56.7 47.6 49.1 50.6 47.7
Other 14.9 10.8 13.1 17.6 24.6 19.1
Sample source (%)
Blood 99.8 100.0 99.7 99.6 99.7 99.5
Fibroblast 0.1 0.0 0.0 0.0 0.1 0.1
Saliva 0.1 0.0 0.0 0.4 0.2 0.3
Tissue 0.1 0.0 0.3 0.0 0.0 0.0
Sequencing quality
Callability (%) 95.2 (0.64) 95.22 (0.65) 95.2 (0.68) 95.27 (0.66) 95.3 (0.65) 95.31 (0.66)
Array concordance (%) 99.95 (0.0001) 99.96 (0.0001) 99.96 (0.01) 99.96 (0.01) 99.96 (0.0001) 99.96 (0.0001)
Contamination 0.0035 (0.001) 0.0031 (0.001) 0.0035 (0.001) 0.003 (0.001) 0.0032 (0.001) 0.0031 (0.001)
Sequencing coverage
All chromosomes 39.5 (7.6) 39.1 (7.2) 39.3 (7.9) 39.2 (8.0) 39.5 (8.4) 39.1 (8.0)
Autosomal chromosomes 40.0 (7.8) 39.6 (7.0) 39.9 (8.3) 39.7 (8.2) 40.1 (8.5) 39.7 (8.1)
Mapping Quality (%)
Percent MAPQ 10 reads 89.18 (1.37) 89.42 (1.37) 89.27 (1.26) 89.34 (1.25) 89.29 (1.29) 89.30 (1.31)
Autosome Coverage at 15X 96.03 (0.35) 96.06 (0.34) 96.03 (0.365) 96.07 (0.37) 96.08 (0.37) 96.06 (0.36)
Average base call quality 36.7 (0.9) 36.7 (0.9) 36.6 (1.0) 36.7 (0.9) 36.7 (1.0) 36.7 (1.0)
Percent aligned reads (%) 93.7 (1.35) 93.7 (1.28) 93.6 (1.31) 93.7 (1.27) 93.7 (1.28) 93.7 (1.29)
Percent Q30 bases (%) 84.12 (3.28) 84.14 (2.87) 83.93 (3.26) 84.09 (3.25) 84.00 (3.36) 83.99 (3.39)
Sample error rate (%) 0.0100 (0.0019) 0.0097 (0.0017) 0.0099 (0.0019) 0.0097 (0.0018) 0.0097 (0.0018) 0.0097 (0.0018)
Mismatch rate (%) 0.76 (0.17) 0.76 (0.16) 0.75 (0.17) 0.75 (0.16) 0.75 (0.17) 0.75 (0.17)
Fragment length median 483 (33) 485 (39) 481 (35) 483 (34) 483 (33) 483 (34)
Genotype
Number of variants
TIER 1 0 (0) 0 (0) 0 (1) 0 (0) 0 (0) 0 (0)
TIER 2 1 (4) 1 (2) 1 (2) 0 (2) 1 (3) 1 (3)
TIER 3 243 (379) 95 (281) 240 (345) 118 (226) 73 (329) 131 (282)
Indel (per 10000) 118.1 (3.3) 100.0 (3.6) 98.2 (2.2) 97.5 (2.3) 99 (3.0) 99.7 (4.9)
SNV (per 10000) 473.7 (3.3) 401.3 (3.6) 393.5 (2.2) 391.6 (2.3) 398.1 (3.0) 398.7 (4.9)
Indel + SNV (per 10000) 592.2 (9.8) 501.2 (13.9) 491.7 (4.9) 489.1 (5.2) 497.4 (10.4) 498.2 (19.6)
Transition/transversion ratio 2.065 (0.003) 2.062 (0.0035) 2.056 (0.003) 2.061 (0.003) 2.060 (0.004) 2.062 (0.004)
Indel Het/Hom ratio 2.91 (0.12) 2.30 (0.15) 1.84 (0.05) 2.14 (0.04) 2.16 (0.27) 2.24 (0.27)
SNV Het/Hom ratio 2.00 (0.09) 1.68 (0.11) 1.35 (0.03) 1.58 (0.03) 1.59 (0.17) 1.65 (0.18)
Penetrance
(Incomplete / complete, %)
21.3 / 78.7 15.3 / 84.7 18.3 / 81.7 24.5 / 75.5 21.9 / 78.1 22.5 / 77.5
Rare disease type distribution
(% within ancestry group)
Cardiovascular disorders 11.7 4.9 7.9 11.9 7.3 7.3
Ciliopathies 0.3 0.0 1.1 0.9 1.4 0.9
Dermatological disorders 2.4 0.0 0.5 0.9 2.3 1.3
Dysmorphic and congenital abnormality syndromes 1.3 0.0 1.6 1.6 1.5 2.2
Endocrine disorders 2.0 3.9 3.2 2.2 2.6 2.7
Gastroenterological disorders 0.1 0.0 0.0 0.3 0.4 0.4
Growth disorders 0.4 0.0 0.0 0.6 0.3 0.6
Haematological and immunological disorders 2.0 2.9 2.6 2.3 1.6 2.2
Haematological disorders 0.1 0.0 0.5 0.6 0.3 0.3
Hearing and ear disorders 1.6 1.9 3.7 1.9 3.4 3.2
Metabolic disorders 2.0 1.0 0.5 1.8 3.5 2.2
Neurology and neurodevelopmental disorders 33.4 43.7 31.7 40.8 39.0 42.0
Ophthalmological disorders 14.8 11.7 12.7 7.5 13.4 9.4
Psychiatric disorders 0.5 0.0 0.5 0.1 0.0 0.2
Renal and urinary tract disorders 18.1 12.6 24.3 9.7 10.2 12.5
Respiratory disorders 0.3 1.0 0.0 1.1 0.4 0.8
Rheumatological disorders 0.5 1.0 0.5 0.8 0.4 0.9
Skeletal disorders 1.8 4.9 0.0 2.5 2.1 2.5
Tumour syndromes 2.8 3.9 3.2 5.4 1.6 2.3
Ultra-rare disorders 2.5 5.8 3.7 5.4 6.1 4.6
Multi 1.5 1.0 1.6 1.6 1.8 1.4
Other disorders 0.0 0.0 0.0 0.2 0.1 0.0
Outcome
Case solved (Proband, %)
Yes 10.0 8.9 11.3 7.9 9.7 10.5
No 3.6 3.9 3.4 2.4 2.5 2.8
Unknown 86.4 87.2 85.3 89.8 87.8 86.7

AFR, African ancestry; AMR, Admixed American ancestry; EAS, East Asian ancestry, EUR; European ancestry; SAS, South Asian ancestry.

Median (IQR) for the continuous variables. Median (IQR, min-max) for age. Sample sizes for sequencing quality metrics are different from those stated in the top panel as these cover blood samples only; AFR (N = 1,482; 2.4%), AMR (N = 203; 0.3%), EAS (N = 381; 0.6%), EUR (N = 48,085; 77.8%), SAS (N = 6,511; 10.5%), Unassigned (N = 5,136; 8.3%).

Characteristics across cancer participants by ancestry

Phenotypes AFR
(N = 482;
3.3%)
AMR
(N = 35;
0.2%)
EAS
(N = 123;
0.8%)
EUR
(N = 12,984;
87.8%)
SAS
(N = 546;
3.7%)
Unassigned
(N = 623;
4.2%)
Age (Min – Max) 59
(18; 1 - 98)
48
(10; 1 - 78)
56
(20; 0 - 91)
67
(18; 0 - 100)
57
(22; 0 - 89)
60
(24; 0 - 96)
Sex (M / F, %) 43.8 / 56.2 31.4 / 68.6 27.6 / 72.4 44.1 / 55.9 41.0 / 59.0 42.2 / 57.8
Socio-economic status
Deprivation (%)
Lowest 16.7 5.6 10.3 7.2 18.0 12.8
Lower middle 66.7 72.2 48.5 37.2 46.7 44.3
Upper middle 14.7 0.0 35.3 44.0 27.8 33.2
Highest 2.0 22.2 5.9 11.7 7.6 9.7
Index of Multiple Deprivation (Rank) 32.0 (18.7) 22.8 (14.0) 21.7 (24.7) 15.2 (17.6) 24.6 (26.4) 20.9 (22.9)
Population structure
Ethnicity (%)
Asian 0.6 0.0 54.5 0.0 66.8 5.6
Black 67.8 0.0 0.0 0.0 0.5 4.8
White 4.1 28.6 3.3 79.2 2.2 49.8
Mixed 3.7 14.3 2.4 0.1 1.5 9.5
Not known 2.1 8.6 1.6 4.2 7.1 3.5
Not stated 19.7 11.4 19.5 16.0 16.8 19.6
Other 1.9 37.1 18.7 0.4 4.9 7.2
Consanguinity (%)
Unknown 100 100 100 100 100 100
Family structure (%)
Other 100 100 100 100 100 100
Sample source (%)
Blood 96.9 94.3 92.7 95.6 95.8 94.4
Fibroblast 0.0 0.0 0.8 0.3 0.2 0.2
Germline 0.0 0.0 0.0 0.2 0.2 0.2
Saliva 2.5 2.9 3.3 3.4 2.2 3
Tissue 0.6 2.9 3.3 0.5 1.6 2.2
Sequencing quality††
Callability (%) 95.23 (0.66) 94.98 (0.65) 95.10 (0.63) 95.28 (0.68) 95.20 (0.67) 95.22 (0.67)
Array concordance (%) 99.95 (0.0001) 99.96 (0.01) 99.96 (0.0001) 99.96 (0.01) 99.96 (0.0001) 99.96 (0.01)
Contamination 0.0029 (0.004) 0.0032 (0.004) 0.0024 (0.004) 0.0026 (0.003) 0.003 (0.004) 0.0025 (0.003)
Sequencing coverage
All chromosomes 38.79 (8.38) 37.45 (3.73) 38.31 (8.67) 38.65 (9.96) 38.8 (10.19) 38.79 (8.38)
Autosomal chromosomes 39.06 (8.52) 37.62 (4.48) 38.785 (8.47) 39.1 (9.99) 39.37 (10.62) 39.06 (8.52)
Mapping Quality (%)
Percent MAPQ 10 reads 89.27122917 (1.782) 88.917761 (1.585) 89.47939067 (1.676) 89.34424062 (1.761) 89.27198333 (1.565) 89.37981845 (1.822)
Autosome Coverage at 15X 96.04 (0.36) 95.93 (0.34) 96.035 (0.387) 96.07 (0.4) 96.07 (0.41) 96.075 (0.392)
Average base call quality 36.8 (1.25) 36.4 (1.1) 36.7 (1.175) 36.7 (1.2) 36.8 (1.1) 36.7 (1.2)
Percent aligned reads (%) 93.83 (2.115) 93.49 (1.84) 93.84 (1.92) 93.7 (1.96) 93.6 (1.625) 93.765 (1.965)
Percent Q30 bases (%) 84.61 (4.835) 83.35 (3.92) 84.165 (4.255) 84.28 (4.78) 84.35 (4.17) 84.15 (4.585)
Sample error rate (%) 0.0098 (0.002) 0.0102 (0.002) 0.0096 (0.002) 0.0095 (0.002) 0.0095 (0.002) 0.0096 (0.002)
Mismatch rate (%) 0.77 (0.185) 0.79 (0.2) 0.745 (0.2) 0.75 (0.2) 0.74 (0.165) 0.76 (0.202)
Fragment length median 481 (28) 483 (16) 486 (23.75) 481 (32) 482 (29) 481 (33)
Genotype
Number of variants
TIER 1* 0 (0) 0 (1) 0 (1) 0 (1) 0 (1) 0 (1)
TIER 3* 7 (4) 3 (2) 3 (3) 2 (2) 3 (2) 3 (2)
Domain 1** 2 (3) 2 (2) 2 (3) 3 (4) 2 (3) 2 (3)
Domain 2** 4 (5) 4 (6) 4 (5) 5 (6) 4 (4) 4 (5)
Domain 3** 103 (94) 80 (50.5) 102 (100) 107 (120) 90 (91) 92 (98)
Indel (per 10000) * 117.9 (3.3) 99.9 (4.4) 98.4 (3.2) 97.7 (2.9) 99.9 (2.8) 100.0 (5.5)
SNV (per 10000) * 473.4 (7.6) 399.8 (15.2) 394.6 (3.4) 391.7 (3.8) 400.6 (4.6) 398.2 (13.6)
Indel + SNV (per 10000) * 591.8 (10.2) 499.6 (16.6) 493.2 (5.7) 489.6 (5.2) 500.6 (6.4) 498.1 (18.9)
Transition/transversion ratio* 2.065 (0.003) 2.061 (0.004) 2.057 (0.004) 2.062 (0.004) 2.060 (0.004) 2.063 (0.004)
Indel Het/Hom ratio* 2.93 (0.13) 2.25 (0.29) 1.85 (0.05) 2.14 (0.05) 2.20 (0.08) 2.25 (0.26)
SNP Het/Hom ratio* 2.02 (0.09) 1.66 (0.19) 1.35 (0.03) 1.58 (0.03) 1.62 (0.06) 1.64 (0.16)
Cancer type distribution
Glioma 0.8 0.0 4.1 4.1 4.0 4.3
Bladder 2.3 0.0 0.8 2.8 2.0 2.3
Breast 29.9 34.3 30.9 18.9 23.3 21.5
Colorectal 13.1 2.9 17.1 18 17.1 13.8
Endometrial 7.1 5.7 5.7 5.4 7.9 4.7
Haematological 4.2 5.7 8.9 5.3 6.4 5.9
Hepatopancreatobiliary 1.0 0.0 0.8 2.3 0.6 1.1
Lung 5.2 0.0 7.3 10.8 4 9.3
Oral 0.6 0.0 0.8 1.7 3.1 1.0
Ovarian 2.7 5.7 4.1 4.1 5.1 5.1
Prostate 17.0 2.9 2.4 3.3 2.2 4.0
Renal 6.4 22.9 4.1 9.3 9.9 8.0
Sarcoma 6.2 11.4 9.8 7.3 10.5 10.9
Other 3.3 8.6 2.4 6.5 3.9 7.9
Multiple cancer 0.0 0.0 0.8 0.1 0.0 0.0

AFR, African ancestry; AMR, Admixed American ancestry; EAS, East Asian ancestry, EUR; European ancestry; SAS, South Asian ancestry.

These data represent a snapshot of the 100kGP and may not be fully representative of the general population. A previous analysis investigated how representative the 100kGP cancer programme is in terms of cancer rates for different ethnicities in England (recorded by Public Health England).

†† Sample sizes for sequencing quality metrics are different from those stated in the top panel as these cover blood samples only; AFR (N = 467; 3.3%), AMR (N = 33; 0.2%), EAS (N = 114; 0.8%), EUR (N = 12,408; 87.8%), SAS (N = 523; 3.7%), Unassigned (N = 588; 4.2%). Bonferroni corrected P-value threshold for sequencing quality-related metrics = 0.0036 (0.05/14).

*Germline variants. **Somatic variants. Median (IQR) for the continuous variables. Median (IQR, min-max) for age

Definition of phenotypes

Phenotypes Definition
Ancestral group Individuals were classified into five genetically inferred ancestry groups (AFR / AMR / EAS / EUR / SAS) based on probabilities (> 0.8) derived from PC scores (PC 1-8). Detailed methods for ancestry inference can be found here. Individuals who didn’t meet the criteria for ancestral classification were grouped into an “Unassigned” group.
Age = Year that the DNA is sequenced – Year of birth
Sex Sex (male, female) was defined using a combination of self-report and genotype data. Males were defined as individuals who identify themselves as male and have "XY", "XXY", or "XYY" chromosomes. Females were defined as individuals who identify themselves as female and have "XX", "X0", or "XXX" chromosomes. Participants with mismatched self-reported and genotype sex traits were treated as missing and excluded from the analysis. Patients who reported themselves to be indeterminate sex were also excluded.
Socio-economic status
Deprivation group Individuals were categorised into four different deprivation groups based on Index of Multiple Deprivation (IMD) decile as follows:
Lowest = Most deprived 10%;
Lower middle = More deprived 10-50%;
Upper middle = Less deprived 10-50%;
Highest = Least deprived 10%
Index of Multiple Deprivation MDI index (rank) at the time of registration. Higher index indicates more deprivation.
Population structure
Ethnicity Individuals were grouped into “Asian”, “Black”, “White”, “Mixed”, “Others”, “Unknown” and “Not stated” based on a self-reported questionnaire for ethnicity (the 16+1 ethnic data categories defined in the 2001 census).
Consanguinity* This indicates a consanguineous relationship (No / Possible / Yes / Unknown). Consanguinity was defined based on runs of homozygosity from the whole genome SNV variant call set. Missingness in this variable was treated as "Unknown".
Family structure* Type of family enrolled in the study (Singleton, Trio, Duo, Other). “Other” includes duos and trios with relatives other than their biological mother; or father or families with more than three members.
Sample source From “aggregate_gvcf_sample_stats” table.
Sample source Type of sample: Blood, Fibroblast, Saliva, Tissue or Germline
Sequencing quality From “aggregate_gvcf_sample_stats” table. Germline variants only.
Flow cell version Version of the flow cell used for the sequencing process.
Callability Callability is defined as the fraction of non-N reference positions having a passing genotype call.
Array concordance Concordance rate is the proportion of matching genotype array calls to all non-missing variant sequencing calls.
Contamination The proportion of a sample that is contaminated with sequence from other humans is calculated using VerifyBAMID (with the parameter FREEMIX). Only participants with a contamination rate of less than 3% were included in the AggV2 data after quality control.
Sequencing coverage Mean sequencing coverage across all chromosomes and across autosomes only. Coverage is defined as the total number of aligned bases divided by the genome size.
Mapping Quality Percentage of reads with a map quality score (MAPQ) >=10 as a proportion of total pass-filter (PF) reads. Mean coverage of MAPQ >= 10 reads at 15x.
Average base call quality Average quality of the base calls. A ratio of the sum of base qualities to total length (Scaled to Phred).
Percent aligned reads Percentage of reads aligned to the reference genome.
Percent Q30 bases The total number of bases with a base quality ≥ 30.
Sample error rate Sequencing error rate calculated using Samtools. Error rate refers to ratio of mismatches to bases mapped (cigar) = N mismatches (from the NM auxiliary tag) / N aligned bases.
Mismatch rate The average percentage of mismatches across reads 1 and 2 over all cycles.
Fragment length median Median length of sequenced fragments. The fragment length is calculated based on the locations at which a read pair aligns to the reference.
Genotype
Number of tiered and domain variants The total number of variants in each tier (1, 2, and 3 for rare disease patients; 1 and 3 for cancer patients) or in each domain (1, 2, and 3).
Indel Total number of indels (germline)
SNV Total number of SNVs (germline)
Indel + SNV Indel + SNV (germline)
Transition/transversion ratio Transition to transversion ratio (germline)
Indel Het/Hom ratio Heterozygote / homozygote ratio for indels (germline)
Outcome
Penetrance* Defined by the referring clinician at the genetic test ordering stage. “Complete” indicates that the condition is thought to be penetrant, or the pedigree shows potential penetrance. “Incomplete” indicates that penetrance could be complete or incomplete. Singletons (proband-only) are classified as “complete”.
Rare disease type distribution* Percentage distribution of patients diagnosed with each type of rare disease within each ancestry group.
Cancer type distribution** Percentage distribution of patients diagnosed with each type of cancer within each ancestry group.
Case solved* Percentage of participants (probands) whose cases were partially / fully explained by any of the tiered variants.
Yes = The variants explained the cases (yes / partially)
No = No
Unknown = marked as NA or no data recorded for case resolution.

*Rare disease patients only. **Cancer patients only. Blood samples only.

References

  1. Zhang, F. et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res 30, 185–194 (2020).