somAgg code book functional annotation queries¶
There is a very useful plugin for bcftools called split-vep (available with bcftools version 1.11.2-foss-2018b) which we will make use of here to query and parse the VEP-like functional annotation from CellBase. Please read the split-vep documentation for more information.
bcftools version
Please use bcftools version 1.10.2 via: module load **bio/BCFtools/1.11.2-GCC-8.3.0**
Snippets
Within each snippet shown below, most lines end with the '\' character. This is not part of the command but a shorthand notation meaning "keep reading the next line as part of a single command." We do this to split each snippet over multiple lines so it is easier to read.
You can view all of the available CellBase annotations using the command below - all annotations can be extracted and / or filtered for. Note that, split-vep has been designed to work with VEP and VEP-like annotation. The Cellbase annotation provided in somAgg is VEP-like. This requires the annotation field to be given explicitly to split-vep, which is done by using the flag -a CT
.
Output:
...
0 GeneName
1 TranscriptID
2 CDSchange
3 ProteinChange
4 Consequence
5 siftDescription
6 siftScore
7 polyphenDescription
8 polyphenScore
...
Extract variants for a gene of interest¶
Question: "I want to extract all variants in the gene IKZF1 and view some basic annotation".
Script: Use bcftools split-vep. Use the flag -a CT
to tell split-vep what the annotation field is called. This example will output variants annotated against all transcripts for the gene of interest - IKZF1 - using the -i flag. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.
...
chr7 50299131 G A IKZF1 ENST00000484847 upstream_variant .
chr7 50299139 A AGT IKZF1 ENST00000484847 upstream_gene_variant .
chr7 50299147 C T IKZF1 ENST00000484847 upstream_variant .
chr7 50299148 G A IKZF1 ENST00000484847 upstream_variant .
...
* data have been randomised and subset
Extract variants for a gene of interest with a damaging prediction¶
Question: "I want to view the variants that that are missense or worse in the gene IKZF1".
Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest (IKZF1) that are missense or worse (-s and -S flag). Use the flag -a CT
to tell split-vep what the annotation field is called. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.
...
chr7 50327647 G T IKZF1 ENST00000413698 missense_variant p.(SER17ILE)
chr7 50327647 GC G IKZF1 ENST00000413698 frameshift_variant p.(SER17SER)
chr7 50327653 C T IKZF1 ENST00000413698 missense_variant p.(PRO19LEU)
chr7 50327661 G C IKZF1 ENST00000413698 missense_variant p.(ASP22HIS)
chr7 50327661 G A IKZF1 ENST00000413698 missense_variant p.(ASP22ASN)
...
* data have been randomised and subset
Please note that when using bcftools 1.11.x specifying severity constraints via the -s flag, as above, is redudant. However, you may refer to the example to provide your own severity scale if needed. The one above can be found at:
/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/vep_severity_scale/VEP_severity_scale_2020.txt
Extract variants for a gene of interest – Genomics England Interpretation pipeline¶
Question: "I want to extract per sample variants as presented in the analysis_csv (LabKey table cancer_analysis) in the gene IKZF1".
Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest (IKZF1) that are missense or worse (-s and -S flag). Use the flag -a CT
to tell split-vep what the annotation field is called. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.
...
chr7 50327647 G T IKZF1 ENST00000413698 missense_variant p.(SER17ILE)
chr7 50327647 GC G IKZF1 ENST00000413698 frameshift_variant p.(SER17SER)
chr7 50327653 C T IKZF1 ENST00000413698 missense_variant p.(PRO19LEU)
chr7 50327661 G C IKZF1 ENST00000413698 missense_variant p.(ASP22HIS)
chr7 50327661 G A IKZF1 ENST00000413698 missense_variant p.(ASP22ASN)
...
* data have been randomised and subset
The list of transcripts used by the Cancer Programme Genomics England Interpretation pipeline can be found here:
/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/LIST_OF_CANONICAL_TRANSCRIPTS.v1.11.tsv
Alternatively, the MANE project is also available on our HPC, at:
/public_data_resources/ensembl_cds/MANE/v0.95/