The HPC is changing

We will soon be switching to a new High Performance Cluster, called Double Helix. This will mean that some of the commands you use to connect to the HPC and call modules will change. We will inform you by email when you are switching over, allowing you to make the necessary changes to your scripts. Please check our HPC changeover notes for more details on what will change.

somAgg code book functional annotation queries¶

Disclaimer

Genomics England imposes no restrictions on access to, or use of, the data provided and the software used to analyse and present it.

Some of the data and software included in the distribution may be subject to third-party constraints. Users of the data and software are solely responsible for establishing the nature of and complying with any such restrictions.

We can use split-vep (available with bcftools version 1.10.2-foss-2018b) to query and parse the functional annotation from VEP.

bcftools version

Please use bcftools version 1.10.2 via: module load bio/BCFtools/1.10.2-GCC-8.3.0

Snippets

Within each snippet shown below, most lines end with the '\' character. This is not part of the command but a shorthand notation meaning "keep reading the next line as part of a single command." We do this to split each snippet over multiple lines so it is easier to read.

The queries below work with the functional annotation data files present in:

/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/functional_annotation/VEP

View CellBase annotations¶

You can view all of the available CellBase annotations using the command below - all annotations can be extracted and/or filtered for. Note that, split-vep has been designed to work with VEP and VEP-like annotation. The Cellbase annotation provided in somAgg is VEP-like. This requires the annotation field to be given explicitly to split-vep, which is done by using the flag -a CT.

#!/bin/bash

module load bio/BCFtools/1.11.2-GCC-8.3.0

bcftools +split-vep -a CT -l somAgg_dr12_chr21_21996481_24481186.vcf.gz

Output:

...
0   GeneName
1   TranscriptID
2   CDSchange
3   ProteinChange
4   Consequence
5   siftDescription
6   siftScore
7   polyphenDescription
8   polyphenScore
...

Extract variants for a gene of interest¶

Question: I want to extract all variants in the gene IKZF1 and view some basic annotation.

Script: Use bcftools split-vep. Use the flag -a CT to tell split-vep what the annotation field is called. This example will output variants annotated against all transcripts for the gene of interest - IKZF1 - using the -i flag. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.

#!/bin/bash

module load bio/BCFtools/1.11.2-GCC-8.3.0

bcftools +split-vep -a CT somAgg_dr12_chr7_48864936_51531027.vcf.gz \
-i 'GeneName="IKZF1"' -d \
-f '%CHROM\t%POS\t%REF\t%ALT\t%GeneName\t%TranscriptID\t%Consequence\t%CDSchange\n' \
> IKZF1_variants.tsv
</pre>

Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.

...
chr7    50299131    G   A   IKZF1   ENST00000484847 upstream_variant        .
chr7    50299139    A   AGT IKZF1   ENST00000484847 upstream_gene_variant   .
chr7    50299147    C   T   IKZF1   ENST00000484847 upstream_variant        .
chr7    50299148    G   A   IKZF1   ENST00000484847 upstream_variant        .
...

* data have been randomised and subset

Extract variants for a gene of interest with a damaging prediction¶

Question: I want to view the variants that that are missense or worse in the gene IKZF1.

Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest (IKZF1) that are missense or worse (-s and -S flag). Use the flag -a CT to tell split-vep what the annotation field is called. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.

#!/bin/bash

module load bio/BCFtools/1.11.2-GCC-8.3.0

bcftools +split-vep -a CT \
-i 'GeneName="IKZF1"' -d  \
-f '%CHROM\t%POS\t%REF\t%ALT\t%GeneName\t%TranscriptID\t%Consequence\t%ProteinChange\n'  \
-c GeneName,TranscriptID,Consequence,ProteinChange  \
-s worst:missense+  \
-S ../additional_data/vep_severity_scale/VEP_severity_scale_2020.txt  \
somAgg_dr12_chr7_48864936_51531027.vcf.gz    \
> ../IKZF1_worse_missense.tsv

Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.

...
chr7    50327647    G   T   IKZF1   ENST00000413698 missense_variant    p.(SER17ILE)
chr7    50327647    GC  G   IKZF1   ENST00000413698 frameshift_variant  p.(SER17SER)
chr7    50327653    C   T   IKZF1   ENST00000413698 missense_variant    p.(PRO19LEU)
chr7    50327661    G   C   IKZF1   ENST00000413698 missense_variant    p.(ASP22HIS)
chr7    50327661    G   A   IKZF1   ENST00000413698 missense_variant    p.(ASP22ASN)
...

* data have been randomised and subset

When using bcftools 1.11.x specifying severity constraints via the -s flag, as above, is redudant. However, you may refer to the example to provide your own severity scale if needed. The one above can be found at:

/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/vep_severity_scale/VEP_severity_scale_2020.txt

Extract variants for a gene of interest – Genomics England Interpretation pipeline¶

Question: I want to extract per sample variants as presented in the analysis_csv (LabKey table cancer_analysis) in the gene IKZF1.

Script: Use bcftools split-vep. This example will output variants annotated against all transcripts for the gene of interest (IKZF1) that are missense or worse (-s and -S flag). Use the flag -a CT to tell split-vep what the annotation field is called. There will be one annotation per line (for each transcript - using the -d flag). The -f option formats the output with the attributes included. The > character writes the output to a tab-delimited file.

#!/bin/bash

module load bio/BCFtools/1.11.2-GCC-8.3.0

bcftools +split-vep -a CT \
-i 'GeneName="IKZF1" & TranscriptID="ENST00000331340" & FMT/FILTER="PASS" & (Consequence="frameshift_variant" | Consequence="inframe_deletion" | Consequence="inframe_insertion" | Consequence="missense_variant" | Consequence="splice_acceptor_variant" | Consequence="splice_donor_variant" | Consequence="splice_region_variant" | Consequence="start_lost" | Consequence="stop_gained" | Consequence="stop_lost")' -d  \
-f '[%CHROM\t%POS\t%REF\t%ALT\t%SAMPLE\t%GeneName\t%TranscriptID\t%Consequence\t%CDSchange\t%ProteinChange\n]'  \
-c GeneName,TranscriptID,Consequence,CDSchange,ProteinChange  \
somAgg_dr12_chr7_48864936_51531027.vcf.gz    \
> IKZF1_interpreted.tsv

Output: The output is a tab-delimited file in long-format - where each annotation is on a separate line across all variants. The columns are in the same order as stated in the -f command above.

...
chr7    50327647    G   T   IKZF1   ENST00000413698 missense_variant    p.(SER17ILE)
chr7    50327647    GC  G   IKZF1   ENST00000413698 frameshift_variant  p.(SER17SER)
chr7    50327653    C   T   IKZF1   ENST00000413698 missense_variant    p.(PRO19LEU)
chr7    50327661    G   C   IKZF1   ENST00000413698 missense_variant    p.(ASP22HIS)
chr7    50327661    G   A   IKZF1   ENST00000413698 missense_variant    p.(ASP22ASN)
...

* data have been randomised and subset

The list of transcripts used by the Cancer Programme Genomics England Interpretation pipeline can be found here:

/gel_data_resources/main_programme/aggregation/aggregated_somatic_strelka/somAgg/v0.2/additional_data/LIST_OF_CANONICAL_TRANSCRIPTS.v1.11.tsv

Alternatively, the MANE project is also available on our HPC, at:

/public_data_resources/ensembl_cds/MANE/v0.95/