Skip to content

Polygenic risk scores (provided by Genomics PLC)

We provide polygenic risk scores (PRS) for a subset of our 100,000 Genomes Project participants, as generated from Genomics PLC.

About Genomics PLC

Genomics PLC are a registered Genomics England Discovery Forum member and use large-scale genetic information to develop innovative precision healthcare tools, and to bring new understanding to drug discovery. In order to achieve this, Genomics PLC are currently running their proprietary algorithms on Genomics England datasets. The PRS made available were generated as part of project RR424.

Data description

The data provided consists of ~40,000 individuals providing polygenic risk scores for twelve traits. All individuals are part of the aggV2 dataset.

Traits included

The following traits are included:

Trait Trait Code
Atrial Fibrillation AF
Bowel Cancer CRC
Breast Cancer BC
Cardiovascular disease CVD
Epithelial ovarian cancer EOC
Hypertension HT
Ischaemic stroke ISS
Coronary artery disease CAD
Osteoporosis OP
Prostate cancer PC
Primary open angle glaucoma POAG
Type 2 diabetes T2D

Data included and location

PRS values

The PRS values are located within LabKey, within the table genomicsplc_prs_values. This table can be found within the 100kGP folders under the heading of "Research Community Provided Data".

Genotype data

The data used as input for the generation of the provided polygenic risk scores can be found as a multi-sample VCF in the following directory:

/gel_data_resources/main_programme/prs_values_genomicsplc/100k/20220526/genotypes/

Genotype data preparation (as provided by Genomics Plc)

For PRS validation purposes, the WGS genotypes (~722M variants) were filtered to a variant base list used for PRS model generation, which includes 18,421,839 variants.

Genetic variants used to generate PRS weights were required to have:

  • an INFO score > 0.8 in UKB
  • not display significant differences in AF between UKB and Gnomad (p>1e-12) and 1KG (p>1e-10)
  • an absolute MAF difference between Gnomad and UKB of less than 0.2 per superpopulation, HWE P > 1e-10 per major ancestry group.
  • a definitive 1-2-1 mapping between builds 37 and 38.

We then took the union of variants across the super populations. We excluded:

  • indels
  • the PAR region
  • any variants with MAF <0.05 in the 1KG EUR subset for single study runs
  • variants with MAF <0.05 across all superpopulations that are included in the PRS generation for cross ancestry runs based on the appropriate 1KG superpopulations.
  • not present in the summary statistics with an info score > 0.8 on a trait by trait basis.

Due to 317,092 variants in the WGS data that were not present in our backbone, genotypes were phased (1000G imputation panel) and imputed in order to enable PRS validation using the vader workflow. The 1000G reference panel (v5a) was used for phasing and imputation, which was lifted-over from GRCh37 to GRCh38 using cross-map.

Using consented data only

This dataset contains information on a subset of participants who may been withdrawn from research since the first release. Their use in any new analyses is not permitted. Thus, it is extremely important to remove these samples from your analyses and ensure that you are only using samples included in the latest data release.

To facilitate the process of filtering for consent the genomicsplc_prs_values LabKey table can be used to identify consented participants from the participant table.

The publication related to this study (Selzam et al. 2022) is not yet available. We will update the page once it is published.

Help and support

Please reach out via the Genomics England Service Desk for any queries concerning the PRS values. We will be able to relay these questions to our colleagues at Genomics PLC or answer these ourselves depending on the type of query.