100kGP COVID-19 CloudOS NHS-GMS coding data

AggV3 code book - identifying the correct subshard¶

For any query using AggV3, you must first identify the correct subshard for your genomic region of interest.

There are two ways to identify the relevant subshards for your analysis:

You can use the shard lookup tool to pull out the shards by inputting a locus.
Query the shard BED files with bedtools.

Shard BED files¶

We provide shard BED files for different purposes listing the subshard names and full file paths to the VCF files. You can find these at:

Multiallelic genotype VCFs, s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/manifests/genomic_data/multiallelic_shards.bed
Biallelic genotype VCFs, s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/manifests/genomic_data/biallelic_shards.bed
Biallelic genotype PGEN files, s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/manifests/genomic_data/pgen_shards.bed
Aggregation sites VCFs, s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/manifests/genomic_data/sites_shards.bed
Functional annotation VCFs, s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/manifests/functional_annotation/2025-12-24/functional_annotation_shards.bed
Quality control VCFs, s3://512426816668-gel-data-resources/dragen3.7.8/AggV3_resources/manifests/site_qc/2026-01-06/siteqc_shards.bed

The shard BEDs file contains one line for each of the 3,166 subshards. The exact fields depend on the BED file you're using:

MultiallelicBiallelicPGEN filesSites VCFsFunctional annotationQuality control

Description	Example
chromosome	`chr1`
subshard start position, 0-based (as it appears in the Illumina files MINUS 1)	`10060`
subshard end position	`1111562`
chr:start-end	`chr1:10061-1111562`
shard	`1`
subshard	`1`
full path to the multiallelic vcf	`s3://357851407625-germline-aggregate-v3/data/euw2-dragen-igg-20250430075006-msvcf-version-1/data/shard-msvcf/shard-1/subshard-1/dragen.vcf.gz`
full path to the multiallelic vcf index	`s3://357851407625-germline-aggregate-v3/data/euw2-dragen-igg-20250430075006-msvcf-version-1/data/shard-msvcf/shard-1/subshard-1/dragen.vcf.gz.tbi`

Description	Example
chromosome	`chr1`
subshard start position, 0-based (as it appears in the Illumina files MINUS 1)	`10060`
subshard end position	`1111562`
chr:start-end	`chr1:10061-1111562`
shard	`1`
subshard	`1`
full path to the biallelic vcf	`s3://357851407625-germline-aggregate-v3/data/euw2-dragen-igg-20250430075006-msvcf-version-1/shard-1/subshard-1/postproc/vcf/dragen.vcf.gz`
full path to the biallelic vcf index	`s3://357851407625-germline-aggregate-v3/data/euw2-dragen-igg-20250430075006-msvcf-version-1/shard-1/subshard-1/postproc/vcf/dragen.vcf.gz.tbi`

Description	Example
chromosome	`chr1`
subshard start position, 0-based (as it appears in the Illumina files MINUS 1)	`10060`
subshard end position	`1111562`
chr:start-end	`chr1:10061-1111562`
shard	`1`
subshard	`1`
full path to the PGEN file	`s3://357851407625-germline-aggregate-v3/data/euw2-dragen-igg-20250430075006-msvcf-version-1/shard-1/subshard-1/postproc/pgen/dragen.pgen`
full path to the PVAR file	`s3://357851407625-germline-aggregate-v3/data/euw2-dragen-igg-20250430075006-msvcf-version-1/shard-1/subshard-1/postproc/pgen/dragen.pvar`
full path to the PSAM file	`s3://357851407625-germline-aggregate-v3/data/euw2-dragen-igg-20250430075006-msvcf-version-1/shard-1/subshard-1/postproc/pgen/dragen.psam`

Description	Example
chromosome	`chr1`
subshard start position, 0-based (as it appears in the Illumina files MINUS 1)	`10060`
subshard end position	`1111562`
chr:start-end	`chr1:10061-1111562`
shard	`1`
subshard	`1`
full path to the vcf	`s3://357851407625-germline-aggregate-v3/data/euw2-dragen-igg-20250430075006-msvcf-version-1/shard-1/subshard-1/postproc/vcf/dragen_sites.vcf.gz`
full path to the vcf index	`s3://357851407625-germline-aggregate-v3/data/euw2-dragen-igg-20250430075006-msvcf-version-1/shard-1/subshard-1/postproc/vcf/dragen_sites.vcf.gz.tbi`

Description	Example
chromosome	`chr1`
subshard start position, 0-based (as it appears in the Illumina files MINUS 1)	`10060`
subshard end position	`1111562`
chr:start-end	`chr1:10061-1111562`
shard	`1`
subshard	`1`
full path to the biallelic functional annotation vcf	`s3://357851407625-germline-aggregate-v3-supporting-data/functional-annotation_2025-12-24/shard-1/subshard-1/dragen.gel.annotated.vcf.gz`
full path to the biallelic functional annotation vcf index	`s3://357851407625-germline-aggregate-v3-supporting-data/functional-annotation_2025-12-24/shard-1/subshard-1/dragen.gel.annotated.vcf.gz.tbi`

Description	Example
chromosome	`chr1`
subshard start position, 0-based (as it appears in the Illumina files MINUS 1)	`10060`
subshard end position	`1111562`
chr:start-end	`chr1:10061-1111562`
shard	`1`
subshard	`1`
full path to the siteQC vcf	`s3://357851407625-germline-aggregate-v3-supporting-data/base-site-qc_2026-01-06/shard-1/subshard-1/dragen.gel.siteqc.vcf.gz`
full path to the siteQC vcf index	`s3://357851407625-germline-aggregate-v3-supporting-data/base-site-qc_2026-01-06/shard-1/subshard-1/dragen.gel.siteqc.vcf.gz.tbi`

To find the right subshard file, you will need to:

Create a BED file of your regions of interest.
Intersect your BED file against the shard BED file.

Create your own BED file¶

You firstly must create a regions file of your genes, variants or regions of interest. This must be a three column tab-delimited file of chromosome, start, and stop (with an option fourth column of an identifier - i.e. a gene name). The file should have the .bed extension. There is no limit to how many lines you can have in this file.

Please pre-sort your data by chromosome and then by start position (sort -k1,1 -k2,2n in.bed in.sorted.bed)

Example:

chr2    213005363   213151603   IKZF2
chr7    50304716    50405101    IKZF1

You can create this file within a CloudOS interactive session, or create it elsewhere and upload it to CloudOS.

Intersect the two files¶

Now you can intersect the bed file of shard names in an interactive session or as a bash script.

Interactive sessionbash script

Open an interactive session and mount the BED file to the session.
Open the command line interface.
Load bedtools in the command line interface. The easiest way to do this is using conda:

conda install bedtools
Run bedtools intersect:

bedtools intersect -wo -a my_regions.bed -b mounted-data-readonly/shard_manifest.bed

This will print out a tab-delimited file with the number of lines equalling the number of inputs in the regions file, containing the columns from your bed file, plus the columns from the subshard bed.

Mount the VCF(s) and index(es) to your interactive session¶

If you're working in an interactive session, you can now work with the subshard by mounting it to your interactive session. You have two options:

mount the subshard VCF and its index only to your session; this will load more quickly, but may be laborious if you are querying multiple regions.
mount the entire shard data folder to your session; this approach is more appropriate if you're querying multiple regions but it will take longer to mount all the files.

mounting multiple shard VCFs

All of the shard VCFs of the same type will have the same filename (eg dragen.vcf.gz for the genotype VCFs). This means that if you mount or multiple files directly, they will all appear in mounted-data-readonly or filesystems under the same filename, and the filesystem will not be able to differentiate between them. If you're using multiple shard VCFs, we recommend mounting the parent folders to avoid this.

Go to Batch analysis and select Run Pipeline.
Search for bedtools and select a bedtools container

If you cannot find a bedtools container, select Import, then Bash and paste in the path to a bedtools container:
Select executable script and add bedtools intersect -wo then add the parameters -a your bed file and -b the relevant shard bed file.
Choose your project and run analysis.