Aggregated Variant Calls (AggV3)¶

AggV3 is a set of multi-sample VCFs, bringing together short variants in germline genomes from 100kGP, NHS GMS and Covid-19 participants. AggV3 was prepared with by Illumina DRAGEN's Iterative GVCF Genotyper using genomes aligned using the DRAGEN 3.7.8 pipeline. Due to the size of the data, there are actually multiple VCFs, each representing a segment of the genome, known as "shards" and "subshards".

AggV3 contains information on participants who have since withdrawn consent from research. You cannot use them in any new analyses. It is extremely important to remove these samples from your analyses and only use samples included in the latest data release.

The latest updated list of samples for consented participants can be found in an S3 bucket within CloudOS (s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/samples/consented_individuals/2026-01-23/aggv3_consented_samples.txt). When working within interactive sessions, you will need to mount this file to your session before you can use it. For batch analysis, you can provide the file as a parameter by clicking the button next to the paramValue textbox and navigating to the file within the File Explorer interface.

As AggV3 is a cross-programme dataset, you may need to update the list of consented individuals yourself at a later stage. For the 100,000 Genomes Project and NHS-GMS samples, please refer to the latest data release and filtering the participant table for Consenting in the programme_consent_status column. For the COVID19 participants, the list of samples can be used that are part of the latest available release.

To filter the aggregate to these samples, all bcftools commands should include the flag -S <path_to_consented_participants_list>.

Submit a ticket to the Genomics England Service desk if you are unsure of how to filter the dataset for any other use.

What data is in AggV3?¶

AggV3 was created using a new set of variant calls on samples aligned to GRCh38 using the Illumina DRAGEN 3.7.8 pipeline. These comprise all available realigned samples from the 100,000 Genomes Project release 19 (including those which were previously only aligned to GRCh37), NHS GMS release 4 and Covid-19 release 7, plus seven samples from Genome in a Bottle (GIAB), summarised below:

Source	Data release	Number of participants
100kGP	19	86770
NHS GMS	4	30425
Covid-19	7	21204
GIAB		7
Total		138,406

For 100kGP and NHS GMS this includes all available realigned and consented germline genomes for the vast majority of rare disease probands, their family members and cancer participants. For Covid-19, this includes all of the available realigned and consented germline genomes for participants that are part of the mild and severe Covid-19 cohorts.

How many variant sites are there in AggV3?¶

Chromosome	Variant sites	Alleles	SNVs	Small insertions and deletions
chr1	52,135,521	64,412,505	55,715,547	8,696,958
chr2	55,095,029	66,871,297	58,257,976	8,613,321
chr3	45,277,939	54,140,780	47,268,995	6,871,785
chr4	43,541,570	51,924,248	45,194,377	6,729,871
chr5	40,512,468	48,196,252	41,949,891	6,246,361
chr6	37,774,188	44,660,368	38,552,985	6,107,383
chr7	37,437,632	45,837,979	39,706,602	6,131,377
chr8	34,891,828	41,938,958	36,805,125	5,133,833
chr9	29,854,428	37,352,835	32,704,293	4,648,542
chr10	31,134,770	38,177,494	33,153,036	5,024,458
chr11	32,601,056	41,384,156	36,451,508	4,932,648
chr12	31,043,643	39,260,200	34,050,700	5,209,500
chr13	22,816,461	28,581,436	24,753,579	3,827,857
chr14	19,698,103	23,133,814	19,933,320	3,200,494
chr15	19,487,709	23,699,783	20,547,505	3,152,278
chr16	21,047,357	26,046,028	22,690,983	3,355,045
chr17	19,004,842	23,394,860	19,938,680	3,456,180
chr18	20,133,492	26,836,347	23,827,328	3,009,019
chr19	13,480,883	16,523,836	13,744,960	2,778,876
chr20	15,292,330	19,318,002	16,762,801	2,555,201
chr21	8,621,191	10,517,589	8,943,165	1,574,424
chr22	9,079,278	11,190,690	9,497,913	1,692,777
chrX	27,905,402	33,537,294	29,235,211	4,302,083
chrY	6,160,144	8,054,475	7,347,408	707,067
chrM	16,503	55,951	36,899	19,052
chrZ (3341 alt contigs)	31,265,862	52,784,150	46,321,867	6,462,283
autosomes	639,961,718	783,399,457	680,451,269	102,948,188
autosomes + chrX	667,867,120	816,936,751	709,686,480	107,250,271
autosomes + chrX + chrY	674,027,264	824,991,226	717,033,888	107,957,338
autosomes + chrX + chrY + chrM	674,043,767	825,047,177	717,070,787	107,976,390
autosomes + chrX + chrY + chrM + chrZ	705,309,629	877,831,327	763,392,654	114,438,673

These counts include non-PASS variants.

How are the VCFs split up?¶

To make the data more manageable, we have split the VCFs up by genomic regions. They are first split into 102 shards each around 30Mbp, which are further split into subshards of up to 216,753 variant sites, giving a total of 3,166 subshards.

We provide a shard lookup tool in these documents to allow you to find the relevant shard for your research, as well as bed files of the shards to allow you to look up the correct shard as part of your analysis.

How can I access AggV3?¶

AggV3 is only available on CloudOS. You will be able to analyse the data in Interactive sessions and Batch analyses in the CloudOS platform. We provide a Code Book with example code for accessing genomic and functional analysis data.

To get access to CloudOS, please get in touch with Service Desk.

Where can I learn more?¶

You can get more detail on: