AggV2 Code Book - Functional annotation queries¶
The site QC statistics for chromosome X have now been fixed.
Disclaimer¶
Genomics England imposes no restrictions on access to, or use of, the data provided and the software used to analyse and present it.
Some of the data and software included in the distribution may be subject to third-party constraints. Users of the data and software are solely responsible for establishing the nature of and complying with any such restrictions.
Overview¶
There is a very useful plugin for bcftools called split-vep (available with bcftools version 1.10.2-foss-2018b) which we will make use of here to query and parse the functional annotation from VEP. Please read the documentation for more information.
bcftools version
Please use bcftools version 1.10.2 via: module load bio/BCFtools/1.10.2-GCC-8.3.0
Snippets
Within each snippet shown below, most lines end with the '\' character. This is not part of the command but a shorthand notation meaning "keep reading the next line as part of a single command." We do this to split each snippet over multiple lines so it is easier to read.
The queries below work with the functional annotation data files present in:
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP
You can view all of the available VEP annotations using the command below - all annotations can be extracted and/or filtered for.
Output:
...
1 Consequence
2 IMPACT
3 SYMBOL
4 Gene
5 Feature_type
6 Feature
7 BIOTYPE
8 EXON
9 INTRON
10 HGVSc
...
Extract variants above threshold for aggV2 derived allele frequencies¶
Question: "I want to find common variants in genetically inferred European samples".
Script: Use bcftools query with the -i flag to specify the allele frequency cut-off, and the -f flag to determine the output attributes and format.
Note: Use the functional annotation data in the VEP_99 directory to access aggV2 allele frequencies.
Output: The first five lines are printed to screen.
...
chr1 10108 C T 0.073
chr1 10109 A T 0.083
chr1 10147 C A 0.063
chr1 10150 C T 0.067
chr1 10177 A C 0.132
...
Extract variants for a gene of interest¶
Question: "I want to extract all variants in the gene IKZF1 and view some basic annotation".
Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest - IKZF1 - using the -i flag. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.
...
chr7 50319076 G A IKZF1 ENST00000641948 non_coding_transcript_exon_variant rs374267123
chr7 50319076 G A IKZF1 ENST00000642219 synonymous_variant rs374267123
chr7 50319076 G A IKZF1 ENST00000645066 synonymous_variant rs374267123
chr7 50319076 G A IKZF1 ENST00000646110 synonymous_variant rs374267123
chr7 50319076 G C IKZF1 ENST00000331340 missense_variant rs374267123
chr7 50319076 G C IKZF1 ENST00000343574 missense_variant rs374267123
...
* data have been randomised and subset
Extract variants for a gene of interest with a damaging prediction¶
Question: "I want to view the variants that that are missense or worse and rare in gnomAD in the gene IKZF1".
Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest - IKZF1 - that are missense or worse (-s and -S flag) and rare in Europeans (use the -c flag for numeric conversion). There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.
...
chr7 50368156 C T IKZF1 ENST00000413698 missense_variant rs558055360 0
chr7 50368201 A G IKZF1 ENST00000413698 missense_variant rs117111762 0.002
chr7 50368251 A G IKZF1 ENST00000413698 missense_variant rs573829014 0
chr7 50368281 G A IKZF1 ENST00000413698 missense_variant rs562525663 0
chr7 50368296 G A IKZF1 ENST00000413698 missense_variant rs544990441 0
chr7 50376578 G A IKZF1 ENST00000331340 missense_variant rs144637662&COSM6972304 0.001
chr7 50376689 G A IKZF1 ENST00000331340 missense_variant rs549930725 0.001
chr7 50400076 G A IKZF1 ENST00000331340 missense_variant rs148169768 0
chr7 50400355 G C IKZF1 ENST00000331340 missense_variant rs529231990 0
...
* data have been randomised and subset
Please note that when specifying severity constraints using the -s flag, you should currently specify the severity scale explicitly using the -S flag. This is due to a known bug in bcftools 1.10.x (outdated default severity scale), which will be resolved in bcftools 1.11. The severity scale can be found in the location below:
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/vep_severity_scale/VEP_severity_scale_2020.txt
Additional annotation queries¶
Below are some additional queries using split-vep that extract useful annotation.