Summary statistics across genetically-inferred ancestry groups for 100,000 Genomes Project participants¶

Here we provide summary statistics, stratified by genetically-inferred ancestry groups, for the AggV2 dataset. AggV2 is an aggregate of gVCFs comprising high-quality germline genomes from genetically diverse individuals derived from release 10 of the 100,000 Genomes Project (100kGP) dataset. Summary statistics include basic demographic data, information on socio-economic status, population structure, sample sources, sequencing quality, genotype and disease status. We provide these data as a reference and to help guide you analysing data from AggV2 where there is a need to take potential confounders related to ancestral diversity into account.

Participants¶

This analysis covers 76,849 individuals recruited to the 100kGP that are included in AggV2. We excluded participants who have withdrawn from the study after recruitment (N = 9); with indeterminate sex or mismatched self-reported and genotype sex traits (N = 1,284) from our analysis, leaving 76,849 participants.

Ancestry group classification¶

Participants were classified into five super-population groups based on PC scores. Participants who didn’t meet the criteria for classification into super-population groups were classified as “unassigned” (See Definition of phenotypes). All analysed participants had genomes aligned to GRCh38.

The following tables include summary statistics based on genetically inferred ancestry groups. These are genetic summary statistics that don’t consistently correspond to observable phenotypes. This, together with the fact that all extracted phenotypes have more than five participants (sum of the rows), ensures compliance with the current Airlock policy.

Figure 1. Distribution of key variables across genetically inferred ancestry groups.

Phenotypes with fewer than five patients are not displayed.

Characteristics of all participants across populations¶

Phenotypes^‡	AFR (N = 1967; 2.6%)	AMR (N = 238; 0.3%)	EAS (N = 505; 0.7%)	EUR (N = 61,279; 79.7%)	SAS (N = 7,077; 9.2%)	Unassigned (N = 5,783; 7.5%)
Age (Min – Max)	45 (25; 1-98)	39 (30; 1-78)	44 (21; 0-91)	45 (32; 0-100)	37 (29; 0-89)	35 (36; 0-96)
Sex (M / F, %)	46.1 / 53.9	42.9 / 57.1	40.6 / 59.4	46.5 / 53/5	49.7 / 50.3	50.4 / 49.6

Socio-economic status
Deprivation (%)
Lowest	20.1	9.3	10.7	8.8	24.3	15.5
Lower middle	61	51.4	47.2	38.2	50.8	48.1
Upper middle	15.8	27.9	32.6	41.8	20.8	29.7
Highest	3.1	11.4	9.4	11.3	4.1	6.8
Index of Multiple Deprivation (Rank)	32.4 (21.8)	20.3 (23.9)	20.9 (24.4)	16.0 (18.9)	29.6 (27.1)	23.8 (25.7)

Population structure
Self-reported ethnicity (%)
Asian	0.3	0.0	69.3	0.0	78.5	11.2
Black	73.5	0.4	0.0	0.0	0.4	6.0
White	2.9	40.3	2.4	84.1	0.9	36.6
Mixed	4.6	13.9	2.0	0.4	1.1	17.2
Not known	0.5	1.3	0.4	0.9	0.6	0.4
Not stated	16.8	16.8	14.3	14.3	17.0	18.0
Other	1.4	27.3	11.7	0.4	1.6	10.7

Sample source (%)
Blood	99.1	99.2	98.0	98.7	99.4	99.0
Fibroblast	0.1	0.0	0.2	0.1	0.1	0.1
Saliva	0.7	0.4	0.8	1.0	0.4	0.6
Tissue	0.2	0.4	1.0	0.1	0.1	0.3

Sequencing quality^†
Callability (%)	95.21 (0.64)	95.21 (0.69)	95.18 (0.67)	95.28 (0.66)	95.30 (0.66)	95.31 (0.66)
Array concordance (%)	99.95 (0.0001)	99.96 (0.0001)	99.96 (0.003)	99.96 (0.01)	99.96 (0.0001)	99.96 (0.0001)
Contamination	0.0034 (0.002)	0.0032 (0.002)	0.0034 (0.002)	0.0030 (0.002)	0.0032 (0.001)	0.0031 (0.001)
Sequencing coverage
All chromosomes	39.28 (7.71)	38.80 (6.87)	39.06 (7.84)	39.09 (8.22)	39.46 (8.48)	39.11 (8.11)
Autosomal chromosomes	39.72 (7.9)	39.45 (7.26)	39.45 (8.33)	39.58 (8.37)	39.97 (8.57)	39.63 (8.21)
Mapping Quality (%)
Percent MAPQ 10 read	89.19 (1.45)	89.37 (1.43)	89.31 (1.33)	89.34 (1.33)	89.29 (1.31)	89.30 (1.35)
Autosome Coverage at 15X	96.03 (0.35)	96.05 (0.34)	96.03 (0.37)	96.07 (0.37)	96.08 (0.38)	96.06 (0.36)
Average base call quality	36.7 (1.0)	36.6 (0.9)	36.6 (1.0)	36.7 (0.9)	36.7 (1.0)	36.7 (1.0)
Percent aligned reads (%)	93.7 (1.50)	93.7 (1.41)	93.7 (1.39)	93.7 (1.37)	93.7 (1.30)	93.7 (1.36)
Percent Q30 bases (%)	84.19 (3.51)	84.09 (3.05)	83.96 (3.42)	84.12 (3.48)	84.03 (3.39)	84.01 (3.50)
Sample error rate	0.0100 (0.0019)	0.0098 (0.0018)	0.0099 (0.002)	0.0097 (0.0018)	0.0097 (0.0019)	0.0097 (0.0019)
Mismatch rate (%)	0.76 (0.17)	0.76 (0.17)	0.75 (0.18)	0.75 (0.18)	0.75 (0.17)	0.76 (0.17)
Fragment length median	482 (32)	485 (34)	482 (31)	482 (33)	483 (33)	483 (34)

Genotype
Number of variants
Indel (per 10000)	118.1 (3.26)	100 (3.72)	98.3 (2.39)	97.5 (2.38)	99.1 (2.91)	99.8 (4.92)
SNV (per 10000)	473.6 (3.26)	400.8 (3.72)	393.8 (2.39)	391.6 (2.38)	398.3 (2.91)	398.6 (4.92)
Indel + SNV (per 10000)	592 (9.8)	500.8 (15.3)	492 (5.2)	489.2 (5.2)	497.7 (10.0)	498.1 (19.5)
Transition/transversion ratio	2.065 (0.003)	2.062 (0.003)	2.056 (0.003)	2.062 (0.003)	2.060 (0.004)	2.062 (0.004)
Indel Het/Hom ratio	2.92 (0.12)	2.29 (0.16)	1.84 (0.05)	2.14 (0.04)	2.17 (0.25)	2.24 (0.27)
Deletion Het/Hom ratio	3.20 (0.13)	2.46 (0.19)	1.97 (0.06)	2.29 (0.05)	2.32 (0.27)	2.40 (0.30)
Insertion Het/Hom ratio	2.67 (0.11)	2.14 (0.14)	1.74 (0.05)	2.01 (0.04)	2.03 (0.24)	2.10 (0.25)
SNP Het/Hom ratio	2.01 (0.09)	1.68 (0.12)	1.35 (0.03)	1.58 (0.03)	1.59 (0.17)	1.65 (0.17)

AFR, African ancestry; AMR, Admixed American ancestry; EAS, East Asian ancestry, EUR; European ancestry; SAS, South Asian ancestry.

^‡Median (IQR) for continuous variables. Median (IQR, min-max) for age. ^†Sample sizes for sequencing quality metrics are different from those stated in the top panel as these cover blood samples only; AFR (N = 1,949; 2.6%), AMR (N = 236; 0.3EAS (N = 495; 0.7%), EUR (N = 60,493; 79.7%), SAS (N = 7,034; 9.3%), Unassigned (N = 5,724; 7.5%).

Characteristics across rare disease participants by ancestry¶

Phenotypes^‡	AFR (N = 1485; 2.4%)	AMR (N = 203; 0.3%)	EAS (N = 382; 0.6%)	EUR (N = 48,295; 77.8%)	SAS (N = 6,531; 10.5%)	Unassigned (N = 5,160; 8.3%)
Age (Min – Max)	40 (27; 1-98)	38 (34; 1-78)	41 (20; 0-91)	39 (28;0-100)	36 (29;0-89)	33 (35; 0-96)
Sex (M / F, %)	46.8 / 53.2	44.8 / 55.2	44.8 / 55.2	47.1 / 52.9	50.5 / 49.5	51.4 / 48.6

Socio-economic status
Deprivation (%)
Lowest	21.0	9.8	10.9	9.1	24.8	15.7
Lower middle	59.5	48.4	46.9	38.4	51.1	48.5
Upper middle	16.1	32.0	31.8	41.3	20.3	29.3
Highest	3.4	9.8	10.5	11.2	3.8	6.4
Index of Multiple Deprivation (Rank)	32.5 (22.2)	19.9 (24.3)	20.2 (24.7)	16.3 (19.2)	30.0 (27.1)	24.1 (26.1)

Population structure
Ethnicity (%)
Asian	0.2	0.0	74.1	0.0	79.5	11.8
Black	75.3	0.5	0.0	0.0	0.4	6.2
White	2.5	42.4	2.1	85.4	0.8	35.0
Mixed	4.9	13.8	1.8	0.4	1.1	18.1
Not stated	15.9	17.7	12.6	13.8	17.0	17.8
Other	1.2	25.6	9.4	0.4	1.3	11.1
Consanguinity (%)
No	48.0	46.3	40.6	42.6	21.8	42.5
Possible	0.3	0.0	0.3	0.1	0.9	0.5
Yes	0.5	2.0	1.0	0.4	16.2	6.6
Unknown	51.2	51.7	58.1	57.0	61.1	50.4
Family structure (%)
Singleton	29.0	19.2	29.6	19.5	13.8	18.9
Duo	20.8	13.3	9.7	13.8	11.0	14.2
Trio	35.4	56.7	47.6	49.1	50.6	47.7
Other	14.9	10.8	13.1	17.6	24.6	19.1

Sample source (%)
Blood	99.8	100.0	99.7	99.6	99.7	99.5
Fibroblast	0.1	0.0	0.0	0.0	0.1	0.1
Saliva	0.1	0.0	0.0	0.4	0.2	0.3
Tissue	0.1	0.0	0.3	0.0	0.0	0.0

Sequencing quality^†
Callability (%)	95.2 (0.64)	95.22 (0.65)	95.2 (0.68)	95.27 (0.66)	95.3 (0.65)	95.31 (0.66)
Array concordance (%)	99.95 (0.0001)	99.96 (0.0001)	99.96 (0.01)	99.96 (0.01)	99.96 (0.0001)	99.96 (0.0001)
Contamination	0.0035 (0.001)	0.0031 (0.001)	0.0035 (0.001)	0.003 (0.001)	0.0032 (0.001)	0.0031 (0.001)
Sequencing coverage
All chromosomes	39.5 (7.6)	39.1 (7.2)	39.3 (7.9)	39.2 (8.0)	39.5 (8.4)	39.1 (8.0)
Autosomal chromosomes	40.0 (7.8)	39.6 (7.0)	39.9 (8.3)	39.7 (8.2)	40.1 (8.5)	39.7 (8.1)
Mapping Quality (%)
Percent MAPQ 10 reads	89.18 (1.37)	89.42 (1.37)	89.27 (1.26)	89.34 (1.25)	89.29 (1.29)	89.30 (1.31)
Autosome Coverage at 15X	96.03 (0.35)	96.06 (0.34)	96.03 (0.365)	96.07 (0.37)	96.08 (0.37)	96.06 (0.36)
Average base call quality	36.7 (0.9)	36.7 (0.9)	36.6 (1.0)	36.7 (0.9)	36.7 (1.0)	36.7 (1.0)
Percent aligned reads (%)	93.7 (1.35)	93.7 (1.28)	93.6 (1.31)	93.7 (1.27)	93.7 (1.28)	93.7 (1.29)
Percent Q30 bases (%)	84.12 (3.28)	84.14 (2.87)	83.93 (3.26)	84.09 (3.25)	84.00 (3.36)	83.99 (3.39)
Sample error rate (%)	0.0100 (0.0019)	0.0097 (0.0017)	0.0099 (0.0019)	0.0097 (0.0018)	0.0097 (0.0018)	0.0097 (0.0018)
Mismatch rate (%)	0.76 (0.17)	0.76 (0.16)	0.75 (0.17)	0.75 (0.16)	0.75 (0.17)	0.75 (0.17)
Fragment length median	483 (33)	485 (39)	481 (35)	483 (34)	483 (33)	483 (34)

Genotype
Number of variants
TIER 1	0 (0)	0 (0)	0 (1)	0 (0)	0 (0)	0 (0)
TIER 2	1 (4)	1 (2)	1 (2)	0 (2)	1 (3)	1 (3)
TIER 3	243 (379)	95 (281)	240 (345)	118 (226)	73 (329)	131 (282)
Indel (per 10000)	118.1 (3.3)	100.0 (3.6)	98.2 (2.2)	97.5 (2.3)	99 (3.0)	99.7 (4.9)
SNV (per 10000)	473.7 (3.3)	401.3 (3.6)	393.5 (2.2)	391.6 (2.3)	398.1 (3.0)	398.7 (4.9)
Indel + SNV (per 10000)	592.2 (9.8)	501.2 (13.9)	491.7 (4.9)	489.1 (5.2)	497.4 (10.4)	498.2 (19.6)
Transition/transversion ratio	2.065 (0.003)	2.062 (0.0035)	2.056 (0.003)	2.061 (0.003)	2.060 (0.004)	2.062 (0.004)
Indel Het/Hom ratio	2.91 (0.12)	2.30 (0.15)	1.84 (0.05)	2.14 (0.04)	2.16 (0.27)	2.24 (0.27)
SNV Het/Hom ratio	2.00 (0.09)	1.68 (0.11)	1.35 (0.03)	1.58 (0.03)	1.59 (0.17)	1.65 (0.18)
Penetrance (Incomplete / complete, %)	21.3 / 78.7	15.3 / 84.7	18.3 / 81.7	24.5 / 75.5	21.9 / 78.1	22.5 / 77.5

Rare disease type distribution (% within ancestry group)
Cardiovascular disorders	11.7	4.9	7.9	11.9	7.3	7.3
Ciliopathies	0.3	0.0	1.1	0.9	1.4	0.9
Dermatological disorders	2.4	0.0	0.5	0.9	2.3	1.3
Dysmorphic and congenital abnormality syndromes	1.3	0.0	1.6	1.6	1.5	2.2
Endocrine disorders	2.0	3.9	3.2	2.2	2.6	2.7
Gastroenterological disorders	0.1	0.0	0.0	0.3	0.4	0.4
Growth disorders	0.4	0.0	0.0	0.6	0.3	0.6
Haematological and immunological disorders	2.0	2.9	2.6	2.3	1.6	2.2
Haematological disorders	0.1	0.0	0.5	0.6	0.3	0.3
Hearing and ear disorders	1.6	1.9	3.7	1.9	3.4	3.2
Metabolic disorders	2.0	1.0	0.5	1.8	3.5	2.2
Neurology and neurodevelopmental disorders	33.4	43.7	31.7	40.8	39.0	42.0
Ophthalmological disorders	14.8	11.7	12.7	7.5	13.4	9.4
Psychiatric disorders	0.5	0.0	0.5	0.1	0.0	0.2
Renal and urinary tract disorders	18.1	12.6	24.3	9.7	10.2	12.5
Respiratory disorders	0.3	1.0	0.0	1.1	0.4	0.8
Rheumatological disorders	0.5	1.0	0.5	0.8	0.4	0.9
Skeletal disorders	1.8	4.9	0.0	2.5	2.1	2.5
Tumour syndromes	2.8	3.9	3.2	5.4	1.6	2.3
Ultra-rare disorders	2.5	5.8	3.7	5.4	6.1	4.6
Multi	1.5	1.0	1.6	1.6	1.8	1.4
Other disorders	0.0	0.0	0.0	0.2	0.1	0.0

Outcome
Case solved (Proband, %)
Yes	10.0	8.9	11.3	7.9	9.7	10.5
No	3.6	3.9	3.4	2.4	2.5	2.8
Unknown	86.4	87.2	85.3	89.8	87.8	86.7

AFR, African ancestry; AMR, Admixed American ancestry; EAS, East Asian ancestry, EUR; European ancestry; SAS, South Asian ancestry.

^‡Median (IQR) for the continuous variables. Median (IQR, min-max) for age. ^† Sample sizes for sequencing quality metrics are different from those stated in the top panel as these cover blood samples only; AFR (N = 1,482; 2.4%), AMR (N = 203; 0.3%), EAS (N = 381; 0.6%), EUR (N = 48,085; 77.8%), SAS (N = 6,511; 10.5%), Unassigned (N = 5,136; 8.3%).

Characteristics across cancer participants by ancestry^†¶

Phenotypes	AFR (N = 482; 3.3%)	AMR (N = 35; 0.2%)	EAS (N = 123; 0.8%)	EUR (N = 12,984; 87.8%)	SAS (N = 546; 3.7%)	Unassigned (N = 623; 4.2%)
Age (Min – Max)	59 (18; 1 - 98)	48 (10; 1 - 78)	56 (20; 0 - 91)	67 (18; 0 - 100)	57 (22; 0 - 89)	60 (24; 0 - 96)
Sex (M / F, %)	43.8 / 56.2	31.4 / 68.6	27.6 / 72.4	44.1 / 55.9	41.0 / 59.0	42.2 / 57.8

Socio-economic status
Deprivation (%)
Lowest	16.7	5.6	10.3	7.2	18.0	12.8
Lower middle	66.7	72.2	48.5	37.2	46.7	44.3
Upper middle	14.7	0.0	35.3	44.0	27.8	33.2
Highest	2.0	22.2	5.9	11.7	7.6	9.7
Index of Multiple Deprivation (Rank)	32.0 (18.7)	22.8 (14.0)	21.7 (24.7)	15.2 (17.6)	24.6 (26.4)	20.9 (22.9)

Population structure
Ethnicity (%)
Asian	0.6	0.0	54.5	0.0	66.8	5.6
Black	67.8	0.0	0.0	0.0	0.5	4.8
White	4.1	28.6	3.3	79.2	2.2	49.8
Mixed	3.7	14.3	2.4	0.1	1.5	9.5
Not known	2.1	8.6	1.6	4.2	7.1	3.5
Not stated	19.7	11.4	19.5	16.0	16.8	19.6
Other	1.9	37.1	18.7	0.4	4.9	7.2
Consanguinity (%)
Unknown	100	100	100	100	100	100
Family structure (%)
Other	100	100	100	100	100	100

Sample source (%)
Blood	96.9	94.3	92.7	95.6	95.8	94.4
Fibroblast	0.0	0.0	0.8	0.3	0.2	0.2
Germline	0.0	0.0	0.0	0.2	0.2	0.2
Saliva	2.5	2.9	3.3	3.4	2.2	3
Tissue	0.6	2.9	3.3	0.5	1.6	2.2

Sequencing quality^††
Callability (%)	95.23 (0.66)	94.98 (0.65)	95.10 (0.63)	95.28 (0.68)	95.20 (0.67)	95.22 (0.67)
Array concordance (%)	99.95 (0.0001)	99.96 (0.01)	99.96 (0.0001)	99.96 (0.01)	99.96 (0.0001)	99.96 (0.01)
Contamination	0.0029 (0.004)	0.0032 (0.004)	0.0024 (0.004)	0.0026 (0.003)	0.003 (0.004)	0.0025 (0.003)
Sequencing coverage
All chromosomes	38.79 (8.38)	37.45 (3.73)	38.31 (8.67)	38.65 (9.96)	38.8 (10.19)	38.79 (8.38)
Autosomal chromosomes	39.06 (8.52)	37.62 (4.48)	38.785 (8.47)	39.1 (9.99)	39.37 (10.62)	39.06 (8.52)
Mapping Quality (%)
Percent MAPQ 10 reads	89.27122917 (1.782)	88.917761 (1.585)	89.47939067 (1.676)	89.34424062 (1.761)	89.27198333 (1.565)	89.37981845 (1.822)
Autosome Coverage at 15X	96.04 (0.36)	95.93 (0.34)	96.035 (0.387)	96.07 (0.4)	96.07 (0.41)	96.075 (0.392)
Average base call quality	36.8 (1.25)	36.4 (1.1)	36.7 (1.175)	36.7 (1.2)	36.8 (1.1)	36.7 (1.2)
Percent aligned reads (%)	93.83 (2.115)	93.49 (1.84)	93.84 (1.92)	93.7 (1.96)	93.6 (1.625)	93.765 (1.965)
Percent Q30 bases (%)	84.61 (4.835)	83.35 (3.92)	84.165 (4.255)	84.28 (4.78)	84.35 (4.17)	84.15 (4.585)
Sample error rate (%)	0.0098 (0.002)	0.0102 (0.002)	0.0096 (0.002)	0.0095 (0.002)	0.0095 (0.002)	0.0096 (0.002)
Mismatch rate (%)	0.77 (0.185)	0.79 (0.2)	0.745 (0.2)	0.75 (0.2)	0.74 (0.165)	0.76 (0.202)
Fragment length median	481 (28)	483 (16)	486 (23.75)	481 (32)	482 (29)	481 (33)

Genotype
Number of variants
TIER 1^*	0 (0)	0 (1)	0 (1)	0 (1)	0 (1)	0 (1)
TIER 3^*	7 (4)	3 (2)	3 (3)	2 (2)	3 (2)	3 (2)
Domain 1^**	2 (3)	2 (2)	2 (3)	3 (4)	2 (3)	2 (3)
Domain 2^**	4 (5)	4 (6)	4 (5)	5 (6)	4 (4)	4 (5)
Domain 3^**	103 (94)	80 (50.5)	102 (100)	107 (120)	90 (91)	92 (98)
Indel (per 10000) ^*	117.9 (3.3)	99.9 (4.4)	98.4 (3.2)	97.7 (2.9)	99.9 (2.8)	100.0 (5.5)
SNV (per 10000) ^*	473.4 (7.6)	399.8 (15.2)	394.6 (3.4)	391.7 (3.8)	400.6 (4.6)	398.2 (13.6)
Indel + SNV (per 10000) ^*	591.8 (10.2)	499.6 (16.6)	493.2 (5.7)	489.6 (5.2)	500.6 (6.4)	498.1 (18.9)
Transition/transversion ratio^*	2.065 (0.003)	2.061 (0.004)	2.057 (0.004)	2.062 (0.004)	2.060 (0.004)	2.063 (0.004)
Indel Het/Hom ratio^*	2.93 (0.13)	2.25 (0.29)	1.85 (0.05)	2.14 (0.05)	2.20 (0.08)	2.25 (0.26)
SNP Het/Hom ratio^*	2.02 (0.09)	1.66 (0.19)	1.35 (0.03)	1.58 (0.03)	1.62 (0.06)	1.64 (0.16)

Cancer type distribution
Glioma	0.8	0.0	4.1	4.1	4.0	4.3
Bladder	2.3	0.0	0.8	2.8	2.0	2.3
Breast	29.9	34.3	30.9	18.9	23.3	21.5
Colorectal	13.1	2.9	17.1	18	17.1	13.8
Endometrial	7.1	5.7	5.7	5.4	7.9	4.7
Haematological	4.2	5.7	8.9	5.3	6.4	5.9
Hepatopancreatobiliary	1.0	0.0	0.8	2.3	0.6	1.1
Lung	5.2	0.0	7.3	10.8	4	9.3
Oral	0.6	0.0	0.8	1.7	3.1	1.0
Ovarian	2.7	5.7	4.1	4.1	5.1	5.1
Prostate	17.0	2.9	2.4	3.3	2.2	4.0
Renal	6.4	22.9	4.1	9.3	9.9	8.0
Sarcoma	6.2	11.4	9.8	7.3	10.5	10.9
Other	3.3	8.6	2.4	6.5	3.9	7.9
Multiple cancer	0.0	0.0	0.8	0.1	0.0	0.0

AFR, African ancestry; AMR, Admixed American ancestry; EAS, East Asian ancestry, EUR; European ancestry; SAS, South Asian ancestry.

^† These data represent a snapshot of the 100kGP and may not be fully representative of the general population. A previous analysis investigated how representative the 100kGP cancer programme is in terms of cancer rates for different ethnicities in England (recorded by Public Health England).

^†† Sample sizes for sequencing quality metrics are different from those stated in the top panel as these cover blood samples only; AFR (N = 467; 3.3%), AMR (N = 33; 0.2%), EAS (N = 114; 0.8%), EUR (N = 12,408; 87.8%), SAS (N = 523; 3.7%), Unassigned (N = 588; 4.2%). Bonferroni corrected P-value threshold for sequencing quality-related metrics = 0.0036 (0.05/14).

*Germline variants. **Somatic variants. ^‡Median (IQR) for the continuous variables. Median (IQR, min-max) for age

Definition of phenotypes¶

Phenotypes	Definition
Ancestral group	Individuals were classified into five genetically inferred ancestry groups (AFR / AMR / EAS / EUR / SAS) based on probabilities (> 0.8) derived from PC scores (PC 1-8). Detailed methods for ancestry inference can be found here. Individuals who didn’t meet the criteria for ancestral classification were grouped into an “Unassigned” group.
Age	= Year that the DNA is sequenced – Year of birth
Sex	Sex (male, female) was defined using a combination of self-report and genotype data. Males were defined as individuals who identify themselves as male and have "XY", "XXY", or "XYY" chromosomes. Females were defined as individuals who identify themselves as female and have "XX", "X0", or "XXX" chromosomes. Participants with mismatched self-reported and genotype sex traits were treated as missing and excluded from the analysis. Patients who reported themselves to be indeterminate sex were also excluded.

Socio-economic status
Deprivation group	Individuals were categorised into four different deprivation groups based on Index of Multiple Deprivation (IMD) decile as follows: Lowest = Most deprived 10%; Lower middle = More deprived 10-50%; Upper middle = Less deprived 10-50%; Highest = Least deprived 10%
Index of Multiple Deprivation	MDI index (rank) at the time of registration. Higher index indicates more deprivation.

Population structure
Ethnicity	Individuals were grouped into “Asian”, “Black”, “White”, “Mixed”, “Others”, “Unknown” and “Not stated” based on a self-reported questionnaire for ethnicity (the 16+1 ethnic data categories defined in the 2001 census).
Consanguinity^*	This indicates a consanguineous relationship (No / Possible / Yes / Unknown). Consanguinity was defined based on runs of homozygosity from the whole genome SNV variant call set. Missingness in this variable was treated as "Unknown".
Family structure^*	Type of family enrolled in the study (Singleton, Trio, Duo, Other). “Other” includes duos and trios with relatives other than their biological mother; or father or families with more than three members.

Sample source	From “aggregate_gvcf_sample_stats” table.
Sample source	Type of sample: Blood, Fibroblast, Saliva, Tissue or Germline

Sequencing quality^†	From “aggregate_gvcf_sample_stats” table. Germline variants only.
Flow cell version	Version of the flow cell used for the sequencing process.
Callability	Callability is defined as the fraction of non-N reference positions having a passing genotype call.
Array concordance	Concordance rate is the proportion of matching genotype array calls to all non-missing variant sequencing calls.
Contamination	The proportion of a sample that is contaminated with sequence from other humans is calculated using VerifyBAMID (with the parameter FREEMIX). Only participants with a contamination rate of less than 3% were included in the AggV2 data after quality control.
Sequencing coverage	Mean sequencing coverage across all chromosomes and across autosomes only. Coverage is defined as the total number of aligned bases divided by the genome size.
Mapping Quality	Percentage of reads with a map quality score (MAPQ) >=10 as a proportion of total pass-filter (PF) reads. Mean coverage of MAPQ >= 10 reads at 15x.
Average base call quality	Average quality of the base calls. A ratio of the sum of base qualities to total length (Scaled to Phred).
Percent aligned reads	Percentage of reads aligned to the reference genome.
Percent Q30 bases	The total number of bases with a base quality ≥ 30.
Sample error rate	Sequencing error rate calculated using Samtools. Error rate refers to ratio of mismatches to bases mapped (cigar) = N mismatches (from the NM auxiliary tag) / N aligned bases.
Mismatch rate	The average percentage of mismatches across reads 1 and 2 over all cycles.
Fragment length median	Median length of sequenced fragments. The fragment length is calculated based on the locations at which a read pair aligns to the reference.

Genotype
Number of tiered and domain variants	The total number of variants in each tier (1, 2, and 3 for rare disease patients; 1 and 3 for cancer patients) or in each domain (1, 2, and 3).
Indel	Total number of indels (germline)
SNV	Total number of SNVs (germline)
Indel + SNV	Indel + SNV (germline)
Transition/transversion ratio	Transition to transversion ratio (germline)
Indel Het/Hom ratio	Heterozygote / homozygote ratio for indels (germline)

Outcome
Penetrance^*	Defined by the referring clinician at the genetic test ordering stage. “Complete” indicates that the condition is thought to be penetrant, or the pedigree shows potential penetrance. “Incomplete” indicates that penetrance could be complete or incomplete. Singletons (proband-only) are classified as “complete”.
Rare disease type distribution^*	Percentage distribution of patients diagnosed with each type of rare disease within each ancestry group.
Cancer type distribution^**	Percentage distribution of patients diagnosed with each type of cancer within each ancestry group.
Case solved^*	Percentage of participants (probands) whose cases were partially / fully explained by any of the tiered variants. Yes = The variants explained the cases (yes / partially) No = No Unknown = marked as NA or no data recorded for case resolution.

^*Rare disease patients only. ^**Cancer patients only. ^† Blood samples only.

References¶

Zhang, F. et al. Ancestry-agnostic estimation of DNA sample contamination from sequence reads. Genome Res 30, 185–194 (2020).

Summary statistics across genetically-inferred ancestry groups for 100,000 Genomes Project participants¶

Participants¶

Ancestry group classification¶

Characteristics of all participants across populations¶

Characteristics across rare disease participants by ancestry¶

Characteristics across cancer participants by ancestry†¶

Definition of phenotypes¶

References¶

Characteristics across cancer participants by ancestry^†¶