AggV2 code book - combining queries¶
To extract variants and samples that comply to both a set of genotype and functional annotation filters, you must intersect the genotype VCFs with the functional annotation VCFs in two steps.
An example question would be: "I want to extract the samples in aggV2 who are homozygous alt for missense (or worse) rare variants within the gene IKZF1".
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data
bcftools version
Please use bcftools version 1.16 via: module load bcftools/1.16
Snippets
Within each snippet shown below, most lines end with the '\' character. This is not part of the command but a shorthand notation meaning "keep reading the next line as part of a single command." We do this to split each snippet over multiple lines so it is easier to read.
Step A: Extract variants by functional annotation¶
The first step is to extract variants in the functional annotation VCF that are missense or worse of the gene of interest. The output is a VCF.gz file called my_annotation.vcf.gz
.
Step B: Intersect functional annotation VCF with genotype VCF¶
The second step is to intersect the variants from the functional annotation filter VCF (my_annotation.vcf.gz
) with the genotype VCF for the correct chunk. This is done using bcftools isec which will match variants in both VCFs by chrom, pos, as well as ref and alt alleles.
Make sure that the genotype VCF is passed in first and the functional annotation VCF passed in second.
The -i command specifies inclusion criteria for the genotype VCF - in this case we retrieve homozygous alt genotypes with an allele frequency < 0.05. The -e- command specifies that no inclusion/exclusion criteria is needed for the functional annotation VCF. The -p option sets the output folder name (called intersect in this case). The -n=2 option specifies that variants outputted much be present in both files.
We write a VCF.gz file called 0000.vcf.gz
in the intersect folder which will contain rare missense or worse variants in IKZF1 for which there is at least one sample with a homozygous alt genotype.
Step C: Process results¶
We can now process the intersected results to find these samples of interest using bcftools query as shown below. This will print out a tab-delimited file called my_results.tsv
.
Output:
...
LP1234567-DNA_A01 chr7 50327751 G T . . PASS 1/1
LP2345678-DNA_B02 chr7 50368157 C A . . PASS 1/1
LP3456789-DNA_C03 chr7 50400079 G C . . PASS 1/1
...
Pulling data on a subset of samples¶
A common query involves wanting to find variants for a subset of participants. The code below takes an input list of samples, pulls data for variants where AC in the subset of samples is greater than one, and finally queries the functional data for these variants. This is done across the genome and submitted as an array job, with 100 permitted to run at a time. This means that all the chunks run in parallel, which is critical for querying the whole aggregate.
The subset of samples must be supplied as a tab-separated, single column file with one platekey per line.
If you wanted to do this for a subset of chunks, then you must change the input file (named bedfile in the script below). Then update the number of array jobs to submit the relevant number of jobs.