Skip to content

How do I identify compound heterozygous mutations within the Genomics England Dataset?


I'd like to know how to identify participants who have pathogenic variants in both copies of the target gene. I have a list of pathogenic variants and I can identify participants who have a specific variant. However, it is difficult to identify participants who have different pathogenic variants in each copy of the gene (compound heterozygous). Could you let me know how I can do it?


Unfortunately, there is currently no centralised resource to assess compound heterozygous variants within the Research Environment. A research group are currently phasing the entire dataset, so once this is ready, one will be able to interrogate compound hets more easily. We don't have a deadline for this release yet.

There are a few ways to go about assessing compound hets however though some scripting are limitations are involved.

IVA: Firstly, on a family-by-family basis, one can look at interpreted families within IVA using the 'interpretation portal' tab. From here you can click on a family, and then select the 'compound heterozygous' filter. You can then filter for specific genes and consequence types of variants for example.

Tiering: Secondly, if the genes you are interested in are in PanelApp and applied to a family (based on the participant's recruited disease), you will be able to mine the tiering data for compound het mutations as these are flagged within the Rare Disease Interpretation Pipeline. These are based on the filters shown below:

Single sample Filters * Affected samples are not 'reference_homozygous' or 'alternate_homozygous'
* NonAffected samples are not 'alternate_homozygous'
Single sample Selection * At least one affected is 'heterozygous' or 'alternate_hemizygous'
Family Filter1 * Father and mother are not both reference homozygous for the same variant in the pair.
Special Filter * None of the NonAffected members of the family are heterozygous for both variants in the pair.

You can access the tiering data in LabKey and filter for segregation pattern = "CompoundHeterozygous".

Scripting: Thirdly, you can script this yourself if you know the structure of the family and the affection status of the individuals. Below is an example of some logic you can apply to flag potential compound hets.

bcftools view test.vcf.gz |\
-i <span class="code-quote" style="color: rgb(0,145,0);">'FORMAT/GT[0]="het" && (FORMAT/GT[1]="ref" FORMAT/GT[1]="het") && (FORMAT/GT[2]="ref" | FORMAT/GT[2]="het")'</span> |\
bcftools query -f <span class="code-quote" style="color: rgb(0,145,0);">'%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'</span> |\
sed -e <span class="code-quote" style="color: rgb(0,145,0);">'s/0\/1/1/g;s/1\/0/1/g;s/0\/0/0/g;s/1\/1/2/g'</span> |\
awk <span class="code-quote" style="color: rgb(0,145,0);">'$6 != $7 {print $0}'

Something like the above is possible. Firstly merge the VCFs of the proband, mother, and father (test.vcf.gz) for the regions you are interested in, then only included variants where the proband is het and mother is ref and father is het, or, the proband is het and the mother is het and the father is ref. Then query out those positions and printed the results to end up with something like:

chr7 117587778 G T 1 1 0
chr7 117627759 T G 1 0 1

Which shows the variants with proband, mother, father genotypes. It's possible to query the annotation (gene and consequence type) post VEP annotation.

Last updated

This page was last updated on the 14 Feb 2020.

  1. Each pair of variants in the gene are taken together for the family filter