AggV2 code book¶

This code book provides some sample snippets to help you use aggV2 in your analyses. These include using BEDtools to find the correct chunk file to use, and using BCFtools to query the aggregate files themselves.

This aggregate dataset contains information on a subset of participants who have since been withdrawn from research. Their use in any new analyses is not permitted. Thus, it is extremely important to remove these samples from your analyses an ensure that you are only using samples included in the latest data release.

The list of samples for the consented participants can be found in the aggregate_gvcf_sample_stats table in the labkey, for the latest data release.

For the main programme version 19 (31st October 2024) data release, the list of consented samples are detailed in the samples file located in the folder /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/docs/

Overview¶

The code snippets assume that you are working in the HPC environment and that you submit jobs to the cluster. Please see About the HPC for more information.

Feedback and Requests

Please reach out via the Genomics England Service Desk for any issues related to the aggV2 aggregation or companion datasets, including "aggV2" in the title/description of your inquiry.

Applications¶

The majority of queries to aggV2 can be implemented using the applications below:

Application	Description
bcftools	A set of utilities that manipulate variant calls in the Variant Call Format (VCF). Use version 1.10.2 via `module load bcftools/1.16`
split-vep	A bcftools plug-in to parse VEP annotation (comes with bcftools version 1.10.2-GCC-8.3.0).
LabKey APIs	The LabKey client libraries (APIs) provide programmatic access to the clinical/phenotype data.
R / Python	For downstream processing.
bedtools	To intersect, merge, count, complement, and shuffle genomic intervals. Use version 2.31.0 via `module load bedtools/2.31.0`

Test data¶

We provide test data comprising the header, and first 1000 lines of several chunks for all samples in aggV2. Index (".csi") files are also provided. The data are not synthetic, therefore they should be treated with the same considerations as other participant data.

These files should be used for building and prototyping of scripts and workflows as run times are much lower than running on full size aggV2 chunks. The files can be found at:

/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/test_data

Code book structure¶

We have divided the Code Book into the following sections: