Querying aggregate VCF files to find participants by genotypes¶
Give us feedback on this tutorial
The aggregate VCFs, AggV2 for germline variants and somAgg for somatic variants, are aggregates of gVCF files. You can query these aggregates for genomic loci.
Identifying the correct chunk¶
As these aggregrates are split into 1371 chunks, all queries for loci must begin by identifying the correct chunk in AggV2 or somAgg.
There are BED files listing all the chunks, allowing you to use bedtools
to intersect against your own BED files. Create a BED file listing your genomic loci of interest, for example:
chr2 213006985 213006985 variant
You can then intersect against the AggV2 or somAgg chunk bed files:
module load bedtools/2.31.0
bedtools intersect -wo -a my_regions.bed -b /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/chunk_names/aggV2_chunk_names.bed | cut -f 1-4,10-11
module load bedtools/2.31.0
bedtools intersect -wo -a my_regions.bed -b /gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/chunk_names/somAgg_chunk_names.bed | cut -f 1-4,10-11
This will give you a tsv file listing the correct chunk to use, for example:
chr2 213006985 213006985 variant /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr2_211052166_213676386.vcf.gz /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP/gel_mainProgramme_aggV2_chr2_211052166_213676386_VEPannot.vcf.gz
chr2 213006985 213006985 variant /gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/genomic_data/somAgg_dr12_chr2_211052166_213676386.vcf.gz
Query the VCF¶
You can then use bcftools to query for your loci in AggV2 or somAgg.
module load bcftools/1.16
bcftools query -r chr2:213006985 \
-f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\n]' \
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr2_211052166_213676386.vcf.gz > variant_genotypes.tsv
module load bcftools/1.16
bcftools query -r chr2:213006985 \
-f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\n]' \
/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/
/genomic_data/somAgg_dr12_chr2_211052166_213676386.vcf.gz > somatic_variant_genotypes.tsv
This will give an output like:
LP1234567-DNA_A01 chr2 213006985 G A . . PASS 0/0
LP2345678-DNA_B02 chr2 213006985 G A . . PASS 0/1
LP3456789-DNA_C03 chr2 213006985 G A . . PASS 1/1
LP4567890-DNA_D04 chr2 213006985 G A . . PASS 0/0
LP5678901-DNA_E05 chr2 213006985 G A . . PASS 0/1
LP6789012-DNA_F06 chr2 213006985 G A . . PASS 0/0
AggV2 was created in release 8 and completed in release 10, whereas somAgg was created in release 12. This means that neither contain participants who were added to the dataset after those releases and both include some participants who have since withdrawn consent. This means that you should always cross reference any results against the current release to ensure there are no participants in your results who are no longer consented.
Both aggregates only contain genomes that were aligned to GRCh38.