Aggregated Variant Calls (AggV3)¶
AggV3 is a set of multi-sample VCFs, bringing together short variants in germline genomes from 100kGP, NHS GMS and Covid-19 participants. AggV3 was prepared with by Illumina DRAGEN's Iterative GVCF Genotyper using genomes aligned using the DRAGEN 3.7.8 pipeline. Due to the size of the data, there are actually multiple VCFs, each representing a segment of the genome, known as "shards" and "subshards".
AggV3 contains information on participants who have since withdrawn consent from research. You cannot use them in any new analyses. It is extremely important to remove these samples from your analyses and only use samples included in the latest data release.
The latest updated list of samples for consented participants can be found in an S3 bucket within CloudOS (s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/samples/consented_individuals/2026-01-23/aggv3_consented_samples.txt). When working within interactive sessions, you will need to mount this file to your session before you can use it. For batch analysis, you can provide the file as a parameter by clicking the button next to the paramValue textbox and navigating to the file within the File Explorer interface.
As AggV3 is a cross-programme dataset, you may need to update the list of consented individuals yourself at a later stage. For the 100,000 Genomes Project and NHS-GMS samples, please refer to the latest data release and filtering the participant table for Consenting in the programme_consent_status column. For the COVID19 participants, the list of samples can be used that are part of the latest available release.
To filter the aggregate to these samples, all bcftools commands should include the flag -S <path_to_consented_participants_list>.
Submit a ticket to the Genomics England Service desk if you are unsure of how to filter the dataset for any other use.
What data is in AggV3?¶
AggV3 was created using a new set of variant calls on samples aligned to GRCh38 using the Illumina DRAGEN 3.7.8 pipeline. These comprise all available realigned samples from the 100,000 Genomes Project release 19 (including those which were previously only aligned to GRCh37), NHS GMS release 4 and Covid-19 release 7, plus seven samples from Genome in a Bottle (GIAB), summarised below:
| Source | Data release | Number of participants |
|---|---|---|
| 100kGP | 19 | 86770 |
| NHS GMS | 4 | 30425 |
| Covid-19 | 7 | 21204 |
| GIAB | 7 | |
| Total | 138,406 |
For 100kGP and NHS GMS this includes all available realigned and consented germline genomes for the vast majority of rare disease probands, their family members and cancer participants. For Covid-19, this includes all of the available realigned and consented germline genomes for participants that are part of the mild and severe Covid-19 cohorts.
How many variant sites are there in AggV3?¶
| Chromosome | Variant sites | Alleles | SNVs | Small insertions and deletions |
|---|---|---|---|---|
| chr1 | 52,135,521 | 64,412,505 | 55,715,547 | 8,696,958 |
| chr2 | 55,095,029 | 66,871,297 | 58,257,976 | 8,613,321 |
| chr3 | 45,277,939 | 54,140,780 | 47,268,995 | 6,871,785 |
| chr4 | 43,541,570 | 51,924,248 | 45,194,377 | 6,729,871 |
| chr5 | 40,512,468 | 48,196,252 | 41,949,891 | 6,246,361 |
| chr6 | 37,774,188 | 44,660,368 | 38,552,985 | 6,107,383 |
| chr7 | 37,437,632 | 45,837,979 | 39,706,602 | 6,131,377 |
| chr8 | 34,891,828 | 41,938,958 | 36,805,125 | 5,133,833 |
| chr9 | 29,854,428 | 37,352,835 | 32,704,293 | 4,648,542 |
| chr10 | 31,134,770 | 38,177,494 | 33,153,036 | 5,024,458 |
| chr11 | 32,601,056 | 41,384,156 | 36,451,508 | 4,932,648 |
| chr12 | 31,043,643 | 39,260,200 | 34,050,700 | 5,209,500 |
| chr13 | 22,816,461 | 28,581,436 | 24,753,579 | 3,827,857 |
| chr14 | 19,698,103 | 23,133,814 | 19,933,320 | 3,200,494 |
| chr15 | 19,487,709 | 23,699,783 | 20,547,505 | 3,152,278 |
| chr16 | 21,047,357 | 26,046,028 | 22,690,983 | 3,355,045 |
| chr17 | 19,004,842 | 23,394,860 | 19,938,680 | 3,456,180 |
| chr18 | 20,133,492 | 26,836,347 | 23,827,328 | 3,009,019 |
| chr19 | 13,480,883 | 16,523,836 | 13,744,960 | 2,778,876 |
| chr20 | 15,292,330 | 19,318,002 | 16,762,801 | 2,555,201 |
| chr21 | 8,621,191 | 10,517,589 | 8,943,165 | 1,574,424 |
| chr22 | 9,079,278 | 11,190,690 | 9,497,913 | 1,692,777 |
| chrX | 27,905,402 | 33,537,294 | 29,235,211 | 4,302,083 |
| chrY | 6,160,144 | 8,054,475 | 7,347,408 | 707,067 |
| chrM | 16,503 | 55,951 | 36,899 | 19,052 |
| chrZ (3341 alt contigs) | 31,265,862 | 52,784,150 | 46,321,867 | 6,462,283 |
| autosomes | 639,961,718 | 783,399,457 | 680,451,269 | 102,948,188 |
| autosomes + chrX | 667,867,120 | 816,936,751 | 709,686,480 | 107,250,271 |
| autosomes + chrX + chrY | 674,027,264 | 824,991,226 | 717,033,888 | 107,957,338 |
| autosomes + chrX + chrY + chrM | 674,043,767 | 825,047,177 | 717,070,787 | 107,976,390 |
| autosomes + chrX + chrY + chrM + chrZ | 705,309,629 | 877,831,327 | 763,392,654 | 114,438,673 |
These counts include non-PASS variants.
How are the VCFs split up?¶
To make the data more manageable, we have split the VCFs up by genomic regions. They are first split into 102 shards each around 30Mbp, which are further split into subshards of up to 216,753 variant sites, giving a total of 3,166 subshards.
We provide a shard lookup tool in these documents to allow you to find the relevant shard for your research, as well as bed files of the shards to allow you to look up the correct shard as part of your analysis.
How can I access AggV3?¶
AggV3 is only available on CloudOS. You will be able to analyse the data in Interactive sessions and Batch analyses in the CloudOS platform. We provide a Code Book with example code for accessing genomic and functional analysis data.
To get access to CloudOS, please get in touch with Service Desk.
Where can I learn more?¶
You can get more detail on: