Skip to content

Data available for the COVID-19 aggregation

aggCOVID_v4.2_aggV2 data is provided as a private s3 bucket that can only be accessed through the Genomics England Cloud Research Environment.

The s3 bucket folder containing the data is found under: GEL data resources > aggregations > covid-19 > aggCOVID_v4.2_aggV2

And the s3 path is: s3://512426816668-gel-data-resources/aggregations/covid-19/aggCOVID_v4.2_aggV2/

Resource Files

Genomic aggregate data

Genomic data for joint aggregate aggCOVID_v4.2_aggV2 is provided in bgen and pgen format within folder genomic/:

For bgen format we provide the data split into 1348 chunks and also in 23 chromosomes:

bgen/bgen_bychunk/aggCOVID_v4.2_aggV2_${chunk}_finalmerge_nodups.{bgen,bgen.bgi,sample}
bgen/bgen_mergedbychr/aggCOVID_v4.2_aggV2_${chr}_finalmerge_nodups.{bgen,bgen.bgi,sample}

For pgen format we provide the data split in 23 chromosomes:

pgen/pgen_mergedbychr/aggCOVID_v4.2_aggV2_${chr}_finalmerge_nodups.{pgen,pvar,psam}

The data split by chunk or split by chromosome contain identical information. The splitting is done to facilitate different types of analyses that may want to optimise maximum parallelism (e.g., GWAS using the by-chunk data) or analyses that require intact whole chromosome data (e.g., to calculate LD).

High-quality (HQ) independent SNPs

Genomic data for high-quality independent SNPs is provided in plink format for each aggregate separately within folder HQ_SNPs/:

aggCOVID_v4.2_aggV2_HQSNPs_common.{bed,fam,bim}

These files were used for the calculation of relatedness, PCs and ancestry and also for generating genetic relatedness matrices for the logistic mixed model methods from SAIGE.

Relatedness/Kinship

We provide lists with unrelated sets of individuals within directory relatedness/:

aggCOVID_v4.2_aggV2_unrelatedset.id
aggCOVID_v4.2_aggV2_relatedset.id
aggCOVID_v4.2_aggV2_optim_unrelatedset.id

Files without prefix optim do not take into account COVID-19 status when creating related and unrelated sets of individuals.

File with prefix optim was used in the primary GWAS analysis and preferentially retained individuals that were COVID-19 severe or mild from related pairs with 100K participants.

Genetic Relatedness Matrices (GRM)

We provide dense and sparse GRMs calculated with GCTA within directory GRM/GCTA/:

aggCOVID_v4.2_aggV2_HQSNPs_common.grm.{id, bin, N.bin}
aggCOVID_v4.2_aggV2_HQSNPs_common_sparse0.05.grm.{id, sp}

Ancestry probabilities

We provide ancestry probabilities and assigned ancestry for each participant within folder ancestry/ and provided in file:

aggCOVID_v4.2_aggV2_predanc.txt

The column format is the following:

Ancestry file header
platekey AFR AMR EAS EUR SAS ancestry

Ancestry file header info
platekey: platekey ID of participant.
AFR, AMR, EAS, EUR, SAS: probability of assignment to a super-population ancestry (row-wise sums to 1).
ancestry: assigned ancestry (P >=0.8).

Principal components (PCs)

PCA is always performed on unrelated participants and then a projection of the rest of the individuals is calculated.

We provide PCs for all individuals of aggCOVID_v4.2_aggV2 and also population-specific PCs for individuals with assigned ancestry ${pop}=AFR, AMR, EUR, EAS, SAS

Files have suffix proj.eigenvec, are located in folder PCA/ and are named as follows: aggCOVID_v4.2_aggV2_${pop}.proj.eigenvec

GWAS

GWAS phenofiles

We provide the phenotype files with covariates used for the main GWAS analyses for pop=AFR, EAS, EUR, SAS within folder GWAS/phenofiles/:

aggCOVID_v4.2_aggV2_anc${pop}_sev_vs_mld_aggV2_unreloptim.pheno

The columns inside these files are:

GWAS phenofile columns
FID IID sev_vs_mld_aggV2 concordant_sex age age_sq age_sex ancestry cohort pc1_${pop} pc2_${pop} pc3_${pop} pc4_${pop} pc5_${pop} pc6_${pop} pc7_${pop} pc8_${pop} pc9_${pop} pc10_${pop} pc11_${pop} pc12_${pop} pc13_${pop} pc14_${pop} pc15_${pop} pc16_${pop} pc17_${pop} pc18_${pop} pc19_${pop} pc20_${pop}

Note that variable ${pop} = AFR or EAS or EUR or SAS

Control-control AF filter

We provide allele frequency comparison for a smaller set of samples that were processed with both Genomics England pipeline 2.0 and Illumina NSV4 across all variants that segregate in both datasets.

We provide files by ancestry ${pop}=AFR, EAS, EUR, SAS within folder GWAS/control_control_AF_filter/:

aggCOVID_v4.2_aggV2_AFfilter_gel2_nsv4_${pop}.txt

control-control AF comparison header
varID AF_gel2 N_gel2 AF_nsv4 N_nsv4 AFreldiff
control-control AF comparison header info
varID: variant ID with format CHR:POS_REF_ALT
AF_gel2: Allele frequency in sub-sample processed with Genomics England pipeline 2.0
N_gel2: Sample size in sub-sample processed with Genomics England pipeline 2.0
AF_nsv4: Allele frequency in sub-sample processed with Illumina NSV4
N_nsv4: Sample size in sub-sample processed with Illumina NSV4
AFreldiff: relative allele frequency difference between platforms

GWAS summaries

We provide the summaries for the main GWAS that was run in the paper using SAIGE. This analysis used the COVID-19 critically ill patients as cases and 100K participant cohort and COVID-19 mild cohort as controls.

Per-population summaries

GWAS summaries are provided as tab-separated txt files within folder GWAS/summaries/ with name:

aggCOVID_v4.2_aggV2_${pop}_sev_vs_mld_aggV2_unreloptim_bimulti_AFcontrolcontrol0.01_allvariants_nodups.txt

for pop=AFR, SAS, EAS, EUR.

The columns inside these files are:

GWAS summaries file header
CHR POS REF ALT varID rsid BETA SE p.value p.value.NA Tstat varT varTstar N N_Cases N_Controls AF AF_Cases AF_Controls nallele_type AFreldiff P_miss P_hwe
GWAS summaries file header info
CHR: chromosome
POS: genome position
REF: reference hg38 allele (allele 1)
ALT: alternate allele (allele 2)
varID: variant ID with format CHR:POS_REF_ALT  
rsid: variant rsid
BETA: effect size of ALT/allele 2
SE: standard error of BETAp.value: p value (with SPA applied for binary traits)
p.value.NA: p value when SPA is not applied (only for binary traits)
Tstat: score statistic of ALT allele
varT: estimated variance of score statistic with sample relatedness incorporated
varTstar: variance of score statistic without sample relatedness incorporated
N: total sample size
N_Cases: sample size of cases
N_Controls: sample size of controls
AF: allele frequency of ALT/allele 2
AF_Cases: allele frequency of ALT/allele 2 in cases
AF_Controls: allele frequency of ALT/allele 2 in controls
nallele_type: biallelic or multiallelic
AFreldiff: control-control relative allele frequency
P_miss: mid-P value from plink1.9 for differential missingness between cases and controls
P_hwe: mid-P Hardy-Weinberg equilibrium value from plink1.9 for unrelated control
Meta-analysis summaries

GWAS summary results from METAL meta-analysis of EUR, SAS, AFR, EAS summaries within folder GWAS/summaries/ with name:

gwas.meta.genomicc.final.txt

GWAS meta-analysis summaries file header

CHR ID POS REF ALT BETA SE p.value N

GWAS meta-analysis summaries file header info
CHR: chromosome
ID: variant ID with format CHR:POS_REF_ALT
REF: reference hg38 allele (allele 1)
ALT: alternate allele (allele 2)
BETA: meta-analysis effect size of ALT/allele 2 from METAL
SE: meta-analysis standard error of BETA from METAL
p.value: meta-analysis p value from METALN: total sample size

GWAS genomic data subset

We provide a subset of genomic data split by genetically inferred ancestry only for variants and individuals (post sample-QC, unrelated) included in the per-ancestry GWAS analysis.

Useful for downstream analyses that couple genomic data with GWAS summaries such as fine-mapping.

Files are provided in plink and vcf format for the population cohorts that we run GWAS (AFR, EAS, EUR, SAS) within folder GWAS/genomic_data_subset/:

plink/merged_bychr/aggCOVID_v4.2_aggV2_${chr}_finalmerge_nodups_${pop}.{bed,fam,bim}
plink/whole_genome_merged/aggCOVID_v4.2_aggV2_finalmerge_nodups_${pop}.{bed,fam,bim}
vcf/merged_bychr/aggCOVID_v4.2_aggV2_${chr}_finalmerge_nodups_${pop}.{vcf.gz,vcf.gz.tbi}

Finemapping

We provide susieR fine-mapping results for 17 3 Mb regions in EUR and one region in SAS.

Summary files are provided as text and as R objects and we also provide .png figures with output from susieR within folder GWAS/finemapping/susieR_results/:

aggCOVID_v4.2_aggV2_EUR_sev_vs_mld_aggV2_${sentinel_variant}.{txt,Rdata,png}
aggCOVID_v4.2_aggV2_SAS_sev_vs_mld_aggV2_${sentinel_variant}.{txt,Rdata,png}
Finemapping column headers

HR POS REF ALT varID BETA SE p.value AF.Cases AF.Controls nallele_type focal CS susie_PIP purity_min purity_median CS_coverage

Finemapping column headers info
CHR: chromosome
POS: genome position
REF: reference hg38 allele (allele 1)
ALT: alternate allele (allele 2)
varID: variant ID with format CHR:POS_REF_ALT  
rsid: variant rsid
BETA: effect size of ALT/allele 2SE: standard error of BETA
p.value: p value (with SPA applied for binary traits)
focal: focal/sentinel variant for fine-mapping. This is the top variant in the window examined. ID for focal variant is provided as CHR:POS:REF:ALT
CS: Index for identified credible set in region, can be L1, L2, L3, ...
susie_PIP: Posterior inclusion probability for variant.
purity_min: Minimum absolute correlation coefficient between variants in credible set.
purity_median: Median absolute correlation coefficient between variants in credible set.
CS_coverage: Posterior inclusion probability for credible set.

Post-GWAS analyses

TWAS summaries

Per-tissue TWAS

We provide TWAS results using GTEX v8 data for lung and whole blood within folder TWAS/summaries/.

Files:

twas_Lung.csv
twas_Whole_Blood.csv
TWAS column headers

gene,gene_name,zscore,effect_size,pvalue,var_g,pred_perf_r2,pred_perf_pval,pred_perf_qval,n_snps_used,n_snps_in_cov,n_snps_in_model

TWAS column headers info
gene: a gene's id: as listed in the Tissue Transcriptome model. Ensemble Id for most gene model releases. Can also be a intron's id for splicing model releases.
gene_name: gene name as listed by the Transcriptome Model, typically HUGO for a gene. It can also be an intron's id.
zscore: S-PrediXcan's association result for the gene, typically HUGO for a gene.
effect_size: S-PrediXcan's association effect size for the gene. Can only be computed when beta from the GWAS is used.
pvalue: P-value of the aforementioned statistic.
var_g: variance of the gene expression, calculated as W' * G * W (where W is the vector of SNP weights in a gene's model, W' is its transpose, and G is the covariance matrix)
pred_perf_r2: (cross-validated) R2 of tissue model's correlation to gene's measured transcriptome (prediction performance). Not all model families have this (e.g. MASHR).
pred_perf_pval: pval of tissue model's correlation to gene's measured transcriptome (prediction performance). Not all model families have this (e.g. MASHR).
pred_perf_qval: qval of tissue model's correlation to gene's measured transcriptome (prediction performance). Not all model families have this (e.g. MASHR).
n_snps_used: number of snps from GWAS that got used in S-PrediXcan analysis
n_snps_in_cov: number of snps in the covariance matrixn_snps_in_model: number of snps in the model
Metatwas

We also provide TWAS results using GTEX v8 data with meta-analysis across all tissues.

Analysis is done per gene with results in within folder TWAS/summaries/ and file: metatwas.csv

And per intron with results in file: metatwas_sqtl_genomicc.csv

Both files have the following columns:

gene,gene_name,pvalue,n,n_indep,p_i_best,t_i_best,p_i_worst,t_i_worst,eigen_max,eigen_min,eigen_min_kept,z_min,z_max,z_mean,z_sd,tmi,status

Metatwas column headers info
gene: a gene's id: as listed in the Tissue Transcriptome model. Ensemble Id for most gene model releases. Can also be a intron's id for splicing model releases.
gene_name: gene name as listed by the Transcriptome Model, typically HUGO for a gene. It can also be an intron's id.
pvalue: significance p-value of S-MultiXcan associationn: number of "tissues" available for this gene
n_indep: number of independent components of variation kept among the tissues' predictions. (Synthetic independent tissues)
p_i_best: best p-value of single-tissue S-PrediXcan association.
t_i_best: name of best single-tissue S-PrediXcan association.
p_i_worst: worst p-value of single-tissue S-PrediXcan association.
t_i_worst: name of worst single-tissue S-PrediXcan association.
eigen_max: In the SVD decomposition of predicted expression correlation: eigenvalue (variance explained) of the top independent component
eigen_min: In the SVD decomposition of predicted expression correlation: eigenvalue (variance explained) of the last independent component
eigen_min_kept: In the SVD decomposition of predicted expression correlation: eigenvalue (variance explained) of the smalles independent component that was kept.
z_min: minimum z-score among single-tissue S-Predican associations.
z_max: maximum z-score among single-tissue S-Predican associations.
z_mean: mean z-score among single-tissue S-Predican associations.
z_sd: standard deviation of the mean z-score among single-tissue S-Predican associations.
tmi: trace of T * T', where Tis correlation of predicted expression levels for different tissues multiplied by its SVD pseudo-inverse. It is an estimate for number of indepent components of variation in predicted expression across tissues (typically close to n_indep)
status: If there was any error in the computation, it is stated here

Coloc summaries

We provide colocalisation summaries for GTEX v8 for lung and blood tissues and for eqtlgen blood data.

Files are within folder TWAS/coloc/:

coloc_eQTLGen.txt
coloc_gtexv8_eqtl_lung.txt
coloc_gtexv8_eqtl_whole_blood.txt
Colocalisation summaries column header

gene.tested ensembl.id PP.H3.5e-5 PP.H4.5e.5 PP.H3.1e-5 PP.H4.1e-5 colocalisation

HLA resources

HLA resources are provided within folder HLA/.

HIBAG HLA allele calls

We provide the HIBAG HLA allele call probabilities in a tab-separated file:

hibag_tsv_all_aggCOVID_v4.2_aggV2.tsv

and we also provide a VCF indexed file where we make a call with probability ≥0.5 :

hibag_vcf_all_aggCOVID_v4.2_aggV2_probthres_0.5.vcf.gz
hibag_vcf_all_aggCOVID_v4.2_aggV2_probthres_0.5.vcf.gz.csi

HIBAG COVID-19 association summaries

We provide the association summaries using SAIGE for the HLA haplotypes for the main analysis of the paper.

This analysis used the COVID-19 critically ill patients as cases and 100K participant cohort + COVID-19 mild cohort as controls.

aggCOVID_v4.2_aggV2_sev_vs_mld_aggV2_EUR_unrelated_masked_manual_hibag_probthres_0.5_rerun.SAIGE.gwas.txt

and with the following header columns:

HIBAG association summaries header

CHR POS SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value p.value.NA Is.SPA.converge varT varTstar AF.Cases AF.Controls N.Cases N.Controls homN_Allele2_cases hetN_Allele2_cases homN_Allele2_ctrls hetN_Allele2_ctrls

HIBAG association summaries info
CHR: chromosome
POS: genome position
SNPID: variant ID
Allele1: allele 1
Allele2: allele 2
AC_Allele2: allele count of allele 2
AF_Allele2: allele frequency of allele 2
imputationInfo: imputation info. If not in dosage/genotype input file, will output 1
N: sample size
BETA: effect size of allele 2
SE: standard error of BETA
Tstat: score statistic of allele 2
p.value: p value (with SPA applied for binary traits)
p.value.NA: p value when SPA is not applied (only for binary traits)
Is.SPA.converge: whether SPA is converged or not (only for binary traits)
varT: estimated variance of score statistic with sample relatedness incorporated
varTstar: variance of score statistic without sample relatedness incorporated
AF.Cases: allele frequency of allele 2 in cases
AF.Controls: allele frequency of allele 2 in controls
N.Cases: sample size of cases
N.Controls: sample size of controls
homN_Allele2_cases: counts of allele 2 homozygotes in cases
hetN_Allele2_cases: counts of allele 2 heterozygotes in cases
homN_Allele2_ctrls: counts of allele 2 homozygotes in controls
hetN_Allele2_ctrls: counts of allele 2 heterozygotes in controls

Summary of folders

Subfolder Content summary Subfolder/file Stats
genomic/biallelic Genomic data for all masked bi-allelic variants of the joint aggregate.
${chr}=chr1-chr22, chrX
bgen/bgen_bychunk/aggCOVID_v4.2_aggV2_${chunk}_finalmerge_nodups.{bgen,bgen.bgi,sample}
bgen/bgen_mergedbychr/aggCOVID_v4.2_aggV2_${chr}_finalmerge_nodups.{bgen,bgen.bgi,sample}
pgen/pgen_mergedbychr/aggCOVID_v4.2_aggV2_${chr}_finalmerge_nodups.{pgen,pvar,psam}
N_variants= 495,784,356
N_samples= 86,846
genomic/multiallelic/ Genomic data for masked multiallelic variants with MAF > 0.1% in both aggV2 and aggCOVID_v4.2.
${chr}=chr1-chr22, chrX
bgen/bgen_bychunk/aggCOVID_v4.2_aggV2_${chunk}_intersect_multiallelics.{bgen,bgen.bgi,sample}
bgen/bgen_mergedbychr/aggCOVID_v4.2_aggV2_${chr}_intersect_multiallelics.{bgen,bgen.bgi,sample}
pgen/pgen_mergedbychr/aggCOVID_v4.2_aggV2_${chr}_intersect_multiallelics.{pgen,pvar,psam}
N_variants= 6,036,414
N_samples= 86,846
HQ_SNPs/ Genomic data for high-quality independent SNPs aggCOVID_v4.2_aggV2_HQSNPs_common.{bed,fam,bim} N_variants= 58,925
N_samples=86,846
GRM/GCTA/ Dense and sparse genetic relatedness matrices (GRM) generated with GCTA aggCOVID_v4.2_aggV2_HQSNPs_common.grm.{bin, N.bin ,id}
aggCOVID_v4.2_aggV2_HQSNPs_common_sparse0.05.grm.{sp, id}
N_samples=86,846
relatedness/ Unrelated sets of individuals.
Files without prefix "optim" do not take into account COVID-19 status when creating related and unrelated sets of individuals.
File with prefix "optim" was used in the primary GWAS analysis and preferentially retained individuals that were COVID-19 severe or mild from related pairs with 100K participants.
aggCOVID_v4.2_aggV2_unrelatedset.id
aggCOVID_v4.2_aggV2_relatedset.id
aggCOVID_v4.2_aggV2_optim_unrelatedset.id
N_unrel=65,060
N_rel=21,786
N_optim_unrel=65,025
PCA/ Principal components calculated per genetically-inferred ancestry.
${pop}=ALL, AFR, AMR, EAS, EUR, SAS
aggCOVID_v4.2_aggV2_${pop}.proj.eigenvec N_ALL=86,846
N_AFR=2,484
N_AMR=312
N_EAS=827
N_EUR=68,559
N_SAS=8,109
ancestry/ Ancestry probabilities and assigned ancestry for participants aggCOVID_v4.2_aggV2_predanc.txt N_samples= 86,846
GWAS/phenofiles/ Phenotype files with covariates used for the main GWAS analyses.
${pop}=AFR, EAS, EUR, SAS
aggCOVID_v4.2_aggV2_anc${pop}_sev_vs_mld_aggV2_unreloptim.pheno AFR: N_case=440; N_control=1,350
EAS: N_case=274; N_control=366
: N_case=5989; N_control=42,891
SAS: N_case=788; N_control=3,793
GWAS/control_control_AF_filter/ Allele frequency comparison for a smaller set of samples that were processed with both Genomics England pipeline 2.0 and Illumina NSV4 across all variants that segregate in both datasets.
${pop}=AFR, EAS, EUR, SAS
aggCOVID_v4.2_aggV2_AFfilter_gel2_nsv4_${pop}.txt AFR: N_samples=354; N_var=33,232,684
EAS: N_samples=81; N_var=11,682,778
EUR: N_samples=3,157;N_var=55,187,051
SAS: N_samples=373; N_var=24,913,384
GWAS/summaries/ GWAS summary statistics from analysis using SAIGE and the meta-analysed statistics using METAL.
${pop}=AFR, EAS, EUR, SAS
aggCOVID_v4.2_aggV2_${pop}_sev_vs_mld_aggV2_unreloptim_bimulti_AFcontrolcontrol0.01_allvariants_nodups.txt
gwas.meta.genomicc.final.txt
META: Nvar=8,121,396
AFR: Nvar=15,012,409
EAS: Nvar=6,000,811
EUR: Nvar=8,121,457
SAS: Nvar=9,092,222
GWAS/genomic_data_subset/ Subset of genomic data split by genetically inferred ancestry only for variants and individuals (post sample-QC, unrelated) included in the per-ancestry GWAS analysis. Useful for downstream analyses that couple genomic data with GWAS summaries such as fine-mapping.
${pop}=AFR, EAS, EUR, SAS
${chr}=chr1-chr22, chrX
plink/merged_bychr/aggCOVID_v4.2_aggV2_${chr}_finalmerge_nodups_${pop}.{bed,fam,bim}
plink/whole_genome_merged/aggCOVID_v4.2_aggV2_finalmerge_nodups_${pop}.{bed,fam,bim}
vcf/merged_bychr
aggCOVID_v4.2_aggV2_${chr}_finalmerge_nodups_${pop}.{vcf.gz,vcf.gz.tbi}
N_AFR=1,790
N_EAS=640
N_EUR=48,880
N_SAS=4,581
GWAS/finemapping/ Results from fine-mapping GWAS hits for EUR and SAS summaries as .txt file, R-objects and .png plots are provided susieR_results/EUR/aggCOVID_v4.2_aggV2_EUR_sev_vs_mld_aggV2_${sentinel_variant}.{txt,Rdata,png}
susieR_results/SAS/aggCOVID_v4.2_aggV2_SAS_sev_vs_mld_aggV2_${sentinel_variant}.{txt,Rdata,png}
17 fine-mapped regions for EUR and one for SAS
TWAS/summaries Results from TWAS analysis using GTEX_V8 pre-trained models for Lung, whole blood, meta-analysis using eQTL models across tissues, meta-analysis using sQTL models across tissues twas_Lung.csv
twas_Whole_Blood.csv<brmetatwas.csv
metatwas_sqtl_genomicc.txt
Nvar_lung=12,485
Nvar_Whole_blood=10,473
Nvar_metatwas=21,813
Nvar_metatwas_sqtl=117,610
TWAS/coloc Colocalisation results using GTEX_V8 lung, whole blood and eqtlgen data. coloc_gtexv8_eqtl_lung.txt
coloc_gtexv8_eqtl_whole_blood.txt
coloc_eQTLGen.txt
71 loci in all files
HLA/ HLA calls from HIBAG-HLA caller and HLA association summary statistics hibag_vcf_all_aggCOVID_v4.2_aggV2_probthres_0.5.vcf.gz
hibag_vcf_all_aggCOVID_v4.2_aggV2_probthres_0.5.vcf.gz.csi
hibag_tsv_all_aggCOVID_v4.2_aggV2.tsv
aggCOVID_v4.2_aggV2_sev_vs_mld_aggV2_EUR_unrelated_masked_manual_hibag_probthres_0.5_rerun.SAIGE.gwas.txt