Troubleshooting the gene-variant workflow¶
Why is my workflow encountering memory errors?¶
The most likely cause is that these genes contain an unusually high amount of variants, causing memory errors during the aggregation steps of the workflow. However, longer genes can also cause a problem, for example:
Gene name | Ensembl ID |
---|---|
TNNI3 | ENSG00000129991 |
KIF1A | ENSG00000130294 |
ACTN2 | ENSG00000077522 |
PRKAG2 | ENSG00000106617 |
RYR2 | ENSG00000198626 |
SCN5A | ENSG00000183873 |
CRELD1 | ENSG00000163703 |
BBS10 | ENSG00000179941 |
To run the workflow on such problematic genes, the following memory allocations may be used.
In quick_merge.wdl
:
Task to be edited | Parameter | Default | Change to... |
---|---|---|---|
split | memory | 1GB | 2GB |
cpu | 1 | 1 | |
queue | short | short | |
first_round_merge | memory | 20GB | 32GB |
cpu | 1 | 2 | |
queue | short | medium | |
second_round_merge | memory | 10 | 48GB |
cpu | 1 | 3 | |
queue | short | medium |
In annotation.wdl
:
Task to be edited | Parameter | Default | Change to... |
---|---|---|---|
fill_tags_query | memory allocation | 2GB | 5GB |
number of cores | 1 | 1 | |
queue | short | short | |
annotate | memory allocation | 1GB | 5GB |
number of cores | 4 | 4 | |
queue | short | short | |
sum_and_annotate | memory allocation | 5GB | 10GB |
number of cores | 1 | 1 | |
queue | short | short |
Note: A version of the workflow with these memory allocations is not actively supported.
Questions relating to autosomal hemizygosity¶
What do the AC_count
and NS_variant
columns mean?¶
The AC and NS columns produced by bcftools "fill-tags".
Column | Description |
---|---|
AC_variant | Allele count in genotypes. This is the number of alleles for the given variant counted from the genotypes listed in the VCF. |
NS_variant | Number of samples with data. This is the number of samples which have been included for this particular variant, and is equal to the number of samples with any variant at the given chromosomal position. |
Why is AC_Hemi_variant
> 0 when my gene of interest is not on a sex chromosome?¶
For autosomal variants, the majority of samples will have diploid genotypes (e.g. 0/1). However, some samples will have haploid (hemizygous-like) calls (e.g. 1) for certain variants. Such haploid calls indicate that the respective sample-genotype identified on one chromosome is located within a deletion identified on the other chromosome for the same sample.
These haploid calls are not produced as part of the aggregation procedure, but are seen in the single-sample gVCFs.
Worked Example¶
In the single-sample gVCF, we have identified the following variant where the genotype is represented as haploid ALT call:
CHROM | POS | REF | ALT | GT | Description |
---|---|---|---|---|---|
chr1 | 2118756 | A | T | 1 | Haploid ALT genotype identified |
On closer inspection in the single-sample gVCF, we see that there is a heterozygous call (0/1) for a 2 bp deletion (TGA > T) 2 bp upstream of the variant (from bases 2118755 - 2118756). Therefore, the A > T SNP above is represented as haploid, because it is located within a known deletion on the other chromosome.
Please note that reference calls spanning that deletion are also haploid (the G reference call).
CHROM | POS | REF | ALT | GT | Description |
---|---|---|---|---|---|
chr1 | 2118754 | TGA | T | 0/1 | 2bp deletion of bases GA from position 2118755 - 2118756. Called as heterozygous (diploid). |
chr1 | 2118755 | G | . | 0 | We know the G base in position 2118755 is deleted on one chromosome, but on the other it is REF - therefore the hemizygous genotype 0 is called (haploid). |
chr1 | 2118756 | A | T | 1 | We know the A in position 2118756 is deleted on one chromosome, but on the other it is ALT - therefore a hemizygous genotype 1 is called (haploid). |