AggV2 allele frequencies¶
Allele frequencies (AFs) are calculated for all variant sites in AggV2, including for the whole group and for various subsets. These are included in indexed flatfiles and the functionally annotated VCFs.
We used samples from the 100kGP data release 12, listed at
AFs are provided as indexed flatfiles located at:
or included within the VEP 109 annotated functional annotation data at:
AFs are calculated as:
Due to the decomposed representation of multiallelic variants, take care in deriving the true allele number. For half missing genotypes resulting from multiallelic decomposition, both alleles are counted as alternate (assuming they have depth > 0) for the purposes of the allele number. See "Calculating AFs" for more details.
For more information on how to query the VCF data please see our aggV2 code book.
We provide multiple sets of AFs, using differing subsets of samples based on the total sample set of aggV2:
|Set||N samples||Sample description||Description|
|Whole cohort||77,844||All samples||This is your go-to column to identify if a variant is present at all in the cohort.|
|Unrelated cohort||55,502||All samples from the unrelated set||All samples contained within the unrelated cohort as defined here.|
|Unrelated by super population||55,502||Unrelated samples with assigned to the afr, eas, eur, or sas super-populations with a threshold >=0.8||These allele frequencies are calculated using genetically inferred ancestries derived as described here. AFs for unassigned and amr samples were not calculated (hence the difference between the total number of samples in the unrelated set, and the sum of samples per super population).|
|Unrelated by super population||1,558||afr||as above|
|Unrelated by super population||424||eas||as above|
|Unrelated by super population||45,458||eur||as above|
|Unrelated by super population||4,450||sas||as above|
|Cancer||15,146||All cancer cohort samples||Allele frequencies for all cancer samples contained within aggV2.|
|Cancer unrelated||14,841||Unrelated samples from the cancer cohort||This set includes only unrelated samples from the cancer cohort germline samples. The cancer cohort is largely unrelated, hence the small difference in sample number between this and the cancer set.|
|RD||62,698||The whole RD cohort||This set includes all samples found in the rare disease cohort within aggV2.|
|RD probands||29,602||Probands recruited to the rare disease cohort||This set includes all probands from the rare disease cohort. Cryptic relationships may exist as related individuals have not been filtered out.|
Sample filtering was only conducted as described above, with no further filtering to separate based on sample features such as whether the sample originated from blood or saliva. For details on sample QC and filtering for the aggV2, please see here.
Allele frequency data are only provided for the autosomes and chr X
Format and location of files¶
Files are provided as flat text files, with the first six columns comprising the
OLD_MULTIALLELIC lines from the functional annotation files. These are followed by AFs and associated information for each of the sets described above.
These are provided at
file_path_to_AFs, as well as within the functional annotation files provided at
The AF files contain the constituent values for an AF for each set. For example, for the afr super population, columns provided include the
The allele frequency is calculated by dividing the allele count (AC) by the allele number (AN) per row. However, decomposition of SNPs for multiallelic variants means that BCFtools and other tools are unable to properly ascertain the AC and AN. Therefore we implement a method that accounts for these variant representations.
This approach occurs as follows:
- Filter to samples required
- Filter out all GT with a depth of 0 (including alt alleles)
- Count the total number of 1s per row (this is the AC)
- Count the occurences of all ".", 1s and 0s (this is the AN)
- Divide the AC by the AN
We round the allele frequencies to three decimal places if AF >= 0.01, and in scientific notation to three decimal places if AF < 0.01. The AC and AN used for each set is provided within the data if you require greater precision.
Note - Variants may be called with 0 depth, this is intended behaviour of the variant caller. We choose to treat such variants as missing for this process.
In the synthetic data example below, the AC, AN and AF columns are provided by the BCFtools plugin "fill-tags". We then have the result as calculated using the method above in manual_AF.
In the example above, we see that treating each "." with a DP > 0 as an allele results in the correct number of total alleles being estimated. We also see that ALT alleles with DP of 0 do not contribute to the allele count or allele number, and therefore are treated same as missing genotypes.
Comparisons to gnomAD data¶
Below we show comparisons of our unrelated, per super-population based AF data with gnomAD. Data below were assembled by taking the all variants from 15 random autosomal chunks, and then subsetting gnomAD v3.1.1 (path:
/public_data_resources/gnomad/v3.1.1/vcf) data to the same regions.
We show this across two super populations, comparing eur with non-finnish european (NFE) from gnomAD , and sas across both cohorts. The number of variants being plotted in each graph is denoted by N in the top left hand corner of the plot. NAs refer to variants that are found in one of the subgroups (e.g. in GEL eur), but not in the comparison set or vice versa. When looking at pass variants (in both aggV2 and gnomAD), this means that the variant does exist in both datasets, just not the specific subset being examined (e.g. may exist in afr but not in nfe within the gnomAD data).
All variants (SNPs and InDels, pass and fail)¶
All pass variants (SNPs and InDels, pass in both aggV2 and gnomAD)¶
Help and support¶
Please reach out via the Genomics England Service Desk for any issues related to the aggV2 aggregation or companion datasets, including "aggV2" in the title/description of your inquiry.