Polygenic risk scores (provided by Genomics PLC)¶
We provide polygenic risk scores (PRS) for a subset of our 100,000 Genomes Project participants, as generated from Genomics PLC.
About Genomics PLC¶
Genomics PLC are a registered Genomics England Discovery Forum member and use large-scale genetic information to develop innovative precision healthcare tools, and to bring new understanding to drug discovery. In order to achieve this, Genomics PLC are currently running their proprietary algorithms on Genomics England datasets. The PRS made available were generated as part of project RR424.
The data provided consists of ~40,000 individuals providing polygenic risk scores for twelve traits. All individuals are part of the aggV2 dataset.
The following traits are included:
|Epithelial ovarian cancer||EOC|
|Coronary artery disease||CAD|
|Primary open angle glaucoma||POAG|
|Type 2 diabetes||T2D|
Data included and location¶
The PRS values are located within LabKey, within the table
genomicsplc_prs_values. This table can be found within the Main Programme folders under the heading of "Research Community Provided Data".
The data used as input for the generation of the provided polygenic risk scores can be found as a multi-sample VCF in the following directory:
Genotype data preparation (as provided by Genomics Plc)¶
For PRS validation purposes, the WGS genotypes (~722M variants) were filtered to a variant base list used for PRS model generation, which includes 18,421,839 variants.
Genetic variants used to generate PRS weights were required to have:
- an INFO score > 0.8 in UKB
- not display significant differences in AF between UKB and Gnomad (p>1e-12) and 1KG (p>1e-10)
- an absolute MAF difference between Gnomad and UKB of less than 0.2 per superpopulation, HWE P > 1e-10 per major ancestry group.
- a definitive 1-2-1 mapping between builds 37 and 38.
We then took the union of variants across the super populations. We excluded:
- the PAR region
- any variants with MAF <0.05 in the 1KG EUR subset for single study runs
- variants with MAF <0.05 across all superpopulations that are included in the PRS generation for cross ancestry runs based on the appropriate 1KG superpopulations.
- not present in the summary statistics with an info score > 0.8 on a trait by trait basis.
Due to 317,092 variants in the WGS data that were not present in our backbone, genotypes were phased (1000G imputation panel) and imputed in order to enable PRS validation using the vader workflow. The 1000G reference panel (v5a) was used for phasing and imputation, which was lifted-over from GRCh37 to GRCh38 using cross-map.
Using consented data only¶
This dataset contains information on a subset of participants who may been withdrawn from research since the first release. Their use in any new analyses is not permitted. Thus, it is extremely important to remove these samples from your analyses and ensure that you are only using samples included in the latest data release.
To facilitate the process of filtering for consent the
genomicsplc_prs_values LabKey table can be used to identify consented participants from the
The publication related to this study (Selzam et al. 2022) is not yet available. We will update the page once it is published.
Help and support¶
Please reach out via the Genomics England Service Desk for any queries concerning the PRS values. We will be able to relay these questions to our colleagues at Genomics PLC or answer these ourselves depending on the type of query.