Querying aggregate VCF files to find participants by genotypes¶
Identifying the correct chunk¶
There are BED files listing all the chunks, allowing you to use
bedtools to intersect against your own BED files. Create a BED file listing your genomic loci of interest, for example:
chr2 213006985 213006985 variant
You can then intersect against the AggV2 or somAgg chunk bed files:
This will give you a tsv file listing the correct chunk to use, for example:
chr2 213006985 213006985 variant /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr2_211052166_213676386.vcf.gz /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP/gel_mainProgramme_aggV2_chr2_211052166_213676386_VEPannot.vcf.gz
Query the VCF¶
module load bio/BCFtools/1.10.2-GCC-8.3.0 bcftools query -r chr2:213006985 \ -f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\n]' \ /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr2_211052166_213676386.vcf.gz > variant_genotypes.tsv
module load bio/BCFtools/1.10.2-GCC-8.3.0 bcftools query -r chr2:213006985 \ -f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%INFO/OLD_MULTIALLELIC\t%INFO/OLD_CLUMPED\t%FILTER\t%GT\n]' \ /gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/ /genomic_data/somAgg_dr12_chr2_211052166_213676386.vcf.gz > somatic_variant_genotypes.tsv
This will give an output like:
AggV2 was created in release 8 and completed in release 10, whereas somAgg was created in release 12. This means that neither contain participants who were added to the dataset after those releases and both include some participants who have since withdrawn consent. This means that you should always cross reference any results against the current release to ensure there are no participants in your results who are no longer consented.
Both aggregates only contain genomes that were aligned to GRCh38.