somAgg code book functional annotation queries¶
Disclaimer
Genomics England imposes no restrictions on access to, or use of, the data provided and the software used to analyse and present it.
Some of the data and software included in the distribution may be subject to third-party constraints. Users of the data and software are solely responsible for establishing the nature of and complying with any such restrictions.
We can use split-vep to query and parse the functional annotation from VEP.
bcftools version
Please use bcftools version 1.16 via: module load bcftools/1.16
Snippets
Within each snippet shown below, most lines end with the '\' character. This is not part of the command but a shorthand notation meaning "keep reading the next line as part of a single command." We do this to split each snippet over multiple lines so it is easier to read.
The queries below work with the functional annotation data files present in:
/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/functional_annotation/VEP
View CellBase annotations¶
You can view all of the available CellBase annotations using the command below - all annotations can be extracted and/or filtered for. Note that, split-vep has been designed to work with VEP and VEP-like annotation. The Cellbase annotation provided in somAgg is VEP-like. This requires the annotation field to be given explicitly to split-vep, which is done by using the flag -a CT
.
Output:
...
0 GeneName
1 TranscriptID
2 CDSchange
3 ProteinChange
4 Consequence
5 siftDescription
6 siftScore
7 polyphenDescription
8 polyphenScore
...
Extract variants for a gene of interest¶
Question: I want to extract all variants in the gene IKZF1 and view some basic annotation.
Script: Use bcftools split-vep. Use the flag -a CT
to tell split-vep what the annotation field is called. This example will output variants annotated against all transcripts for the gene of interest - IKZF1 - using the -i
flag. There will be one annotation per line (for each transcript - using the -d
flag). The -f
option formats the output with the attributes included. The >
character writes the output to a tab-delimited file.
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f
command above.
...
chr7 50299131 G A IKZF1 ENST00000484847 upstream_variant .
chr7 50299139 A AGT IKZF1 ENST00000484847 upstream_gene_variant .
chr7 50299147 C T IKZF1 ENST00000484847 upstream_variant .
chr7 50299148 G A IKZF1 ENST00000484847 upstream_variant .
...
* data have been randomised and subset
Extract variants for a gene of interest with a damaging prediction¶
Question: I want to view the variants that that are missense or worse in the gene IKZF1.
Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest (IKZF1) that are missense or worse (-s
and -S
flag). Use the flag -a CT
to tell split-vep what the annotation field is called. There will be one annotation per line (for each transcript - using the -d
flag). The -f
option formats the output with the attributes included. The >
character writes the output to a tab-delimited file.
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.
...
chr7 50327647 G T IKZF1 ENST00000413698 missense_variant p.(SER17ILE)
chr7 50327647 GC G IKZF1 ENST00000413698 frameshift_variant p.(SER17SER)
chr7 50327653 C T IKZF1 ENST00000413698 missense_variant p.(PRO19LEU)
chr7 50327661 G C IKZF1 ENST00000413698 missense_variant p.(ASP22HIS)
chr7 50327661 G A IKZF1 ENST00000413698 missense_variant p.(ASP22ASN)
...
* data have been randomised and subset
When using bcftools 1.11.x specifying severity constraints via the -s
flag, as above, is redudant. However, you may refer to the example to provide your own severity scale if needed. The one above can be found at:
/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/vep_severity_scale/VEP_severity_scale_2020.txt
Extract variants for a gene of interest – Genomics England Interpretation pipeline¶
Question: I want to extract per sample variants as presented in the analysis_csv
(LabKey table cancer_analysis
) in the gene IKZF1.
Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest (IKZF1) that are missense or worse (-s
and -S
flag). Use the flag -a CT
to tell split-vep what the annotation field is called. There will be one annotation per line (for each transcript - using the -d
flag). The -f
option formats the output with the attributes included. The >
character writes the output to a tab-delimited file.
Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f
command above.
...
chr7 50327647 G T IKZF1 ENST00000413698 missense_variant p.(SER17ILE)
chr7 50327647 GC G IKZF1 ENST00000413698 frameshift_variant p.(SER17SER)
chr7 50327653 C T IKZF1 ENST00000413698 missense_variant p.(PRO19LEU)
chr7 50327661 G C IKZF1 ENST00000413698 missense_variant p.(ASP22HIS)
chr7 50327661 G A IKZF1 ENST00000413698 missense_variant p.(ASP22ASN)
...
* data have been randomised and subset
The list of transcripts used by the Cancer Programme Genomics England Interpretation pipeline can be found here:
/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/LIST_OF_CANONICAL_TRANSCRIPTS.v1.11.tsv
Alternatively, the MANE project is also available on our HPC, at:
/public_data_resources/ensembl_cds/MANE/v0.95/