Aggv2 phased data (provided by University of Oxford)¶
As of Main Programme Data Release 15 we are providing phased data for aggV2 which has been produced and kindly provided by Sinan Shi from the University of Oxford. Please see attached document for the full list of contributors in this project. The phased dataset was generated as part of project RR91 and it was used as a reference panel for the imputation of the UK Biobank dataset.
Using the phased dataset¶
This phased dataset contains information on a subset of participants who have since been withdrawn from research. Their use in any new analyses is not permitted. Thus, it is extremely important to remove these samples from your analyses an ensure that you are only using samples included in the latest data release.
The list of samples for the consented participants can be found in the 'aggregate_gvcf_sample_stats' table in the labkey, for the latest data release.
For the main programme version 15 data release, the list of consented samples are detailed in the file main_programme_v15_samples.txt, located in the folder /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/docs/
To filter the aggregate to these samples, all bcftools commands should include the flag -S /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/docs/main_programme_v15_samples.txt
Submit a ticket to the Genomics England Service desk if you are unsure of how to filter the dataset for any other use.
The data is provided in two directories within
||multi-sample VCF files containing phased genotypes for all samples in aggV2. Files provided per chromosome (with index files).|
||VCF files containing allele frequencies for the final, phased reference panel in the INFO field (INFO/AF). AC and AN are also available. Files provided per chromosome (with index files).|
The data provided consists of 78,195 individuals consistent with the aggV2 data. However, for any new analysis, please do remember to include only participants from the latest Data Release are included (see above banner). The data contains over 342 million small variants (SNPs and short indels) across chromosomes 1-22 of the aggV2 dataset.
The following variant quality control filters were applied in two batches. The filters in the second column (batch 1) were applied across all sites, and then the lenient filtering on the right allowed for the inclusion of ~1M additional common variants (batch 2). The two sets were phased separately and then merged into the final reference panel.
|Metric||Filter applied - all sites (batch 1)||Filter applied - sites with AF>0.001 (batch 2)|
|Genotype quality (GQ) and depth (DP)||individual genotypes with GQ<15 or DP<10 were masked as missing||Same as for sites with AF<=0.001|
|Missingness||After application of the GQ and DP filter above, all sites that had a missing rate >5% were removed. Half-missing genotypes were considered as missing for this filter.||After application of the GQ and DP filter above, all sites that had a missing rate >25% were removed. Half-missing genotypes were considered as missing for this filter.|
|Allele Balance (for het calls)||ABhet was calculated for each het genotype as AD_REF/(AD_REF+AD_ALT), where AD_REF and AD_ALT are the allele depths (AD) for REF and ALT genotypes respectively. Then the number of het calls per variant where 0.25<ABhet<0.75 were marked as PASS. Finally, sites where <75% of the het calls were PASS were removed.||Same as for sites with AF<=0.001|
|Mendelian inconsistencies||Sites with more than 3 Mendelian inconsistencies among all duo and trio families for sites with allele frequency <0.001, and 7 Mendelian inconsistencies for sites with allele frequency >=0.001 were removed.||Sites with more than 250 Mendelian inconsistencies among all duo and trio families were removed.|
|Deviations from Hardy-Weinberg Equilibrium (HWE)||Sites where the HWE p-value was < 1e-05 in self-reported white British individuals were removed||Same as for sites with AF<=0.001|
|Comparison with gnomAD allele frequencies||Sites that showed discrepancies in the allele frequencies (AF) between the Genomics England and the gnomAD dataset (v3.1.1), as indicated by a Fisher's exact test with a p-value of 1e-10, were removed. AFs compared were across the whole aggregate cohort and without accounting ancestry and relatedness||Sites that showed discrepancies in the allele frequencies (AF) between the Genomics England and the gnomAD dataset (v3.1.1), as indicated by a Fisher's exact test with a p-value of 1e-20, were removed. AFs compared were across the whole aggregate cohort and without accounting ancestry and relatedness|
|Unrelated singletons||Singletons that did not occur in families were removed|
Please see the attached accompanying document for more information on the filter applied for Mendelian inconsistencies and for a comparison between the variants in the phased dataset and these in the HRC and TopMed reference panels.
Haplotype phasing was carried out using SHAPEIT4.2.2, employing a multi-stage strategy. In the first stage, makeScaffold was used to determine the phase of as many genotypes in each duo and trio as possible. The vast majority of genotypes were phased using this process, with a small number of genotypes whose phase is ambiguous due to heterozygosity or missingness patterns being phased using SHAPEIT4.2.2.
To phase the unrelated samples, a phased scaffold of common variants was created, and then the remaining variants were phased onto this scaffold. The scaffold contained phased common variants with a minor allele frequency of 0.01, using the phased related samples as the reference panel. The remaining rarer variants were then phased onto the scaffold in chunks containing around 300,000 sites with 30,000 sites on each side as buffer. The phased duo/trio dataset was used as a reference panel in this step. The chunks were merged and concatenated using bcftools.
More more information on the process please see the attached document.
In order to assess the quality of the phased reference panel, the authors phased the parents of mother-father-child trios from the 1000 Genomes Project using the Genomics England phased dataset as reference. Phasing accuracy was measured using switch error rate, which is the ratio of the number of possible switches required to obtain the true haplotype phase and the inferred one and the number of heterozygote calls minus 1. The accuracy check was carried out on 589 trio families from diverse ethnic backgrounds. The mean switch error rate was 0.18%, 0.33%, 0.31% and 0.73% for individuals of European, African, South Asian and East Asian ancestry, respectively. More information can be found in the attached document.
Link to publication¶
To be added once available.
Help and support¶
Please reach out via the Genomics England Service Desk for any queries concerning the phased data of aggV2. We will be able to relay these questions to our colleagues at the University of Oxford or answer these ourselves depending on the type of query.