Known issues and limitations¶
Containers¶
Failure to pull singularity image¶
Occasionally, network issues cause image pulls to timeout. If the workflow fails with a status 143 error, copy the images to your cache.
Copy the required images to your singularity cache
cp /gel_data_resources/workflows/input_material/rdp_small_variant/containers/* $HOME/.singularity/cache/
Parameters¶
Compute resources (cpus, memory, queue)¶
Longer genes, or genes that contain a large number of variants, may require additional memory in the merge step. Merge resources are defined dynamically as
merge_queue = { task.attempt < 2 ? 'short' : 'medium' }
merge_memory = { 16.GB + (16.GB * (task.attempt - 1)) }
merge_cpus = { 4 + (4 * (task.attempt - 1)) }
You can set these on the command line, e.g., --merge_cpus 4
. If you encounter a resource problem for a particular gene, try using a larger value on the command line, while maintaining the recommended cpu:memory ratios (1 cpu per 4GB memory HPC; 1 cpu per 6GB memory CloudOS)
Note
Set any Nextflow workflow parameter on the command line with --<parameter_name> <value>
. For a full list of defined parameters nextflow config -flat -profile hpc main.nf | grep params. | grep --color=auto --color "="
Processes¶
Normalisation and variant representation¶
The normalisation and left-alignment happens in 3 steps (excerpt from workflow process second_round_merge
below)
bcftools merge \
--file-list chunk.list \
--merge both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
| bcftools norm --fasta-ref ${reference_fasta} -m+both \
| bcftools norm --fasta-ref ${reference_fasta} -m-both \
-Ob -o ${build}_merged_normalized_1.bcf
The bcftools norm -m
(or --multiallelics
) option "split(s) multiallelic sites into biallelic records (-) or joins biallelic sites into multiallelic records (+). An optional type string can follow which controls variant types which should be split or merged together: If only SNP records should be split or merged, specify snps; if both SNPs and indels should be merged separately into two records, specify both; if SNPs and indels should be merged into a single record, specify any." The both
option indicates "SNPs and indels should be merged separately into two records".
Running bcftools norm -m-both
once produces duplicated variants for positions where both MNPs and SNPs are observed. Normalisation decomposes the MNP into separate SNPs that do not combine with canonical SNPs (not derived from MNPs), as they have different associated metrics (QUAL, DP, GQ, etc.). The alternative would be to have multiple entries for the same SNP, with different allele counts (AN, AC, AF).
See detailed documentation on variant duplication. (Note. The documentation describes characteristics of aggV2 in which vt
was used for the normalisation. The same considerations apply and vt
and bcftools norm
give identical output.)
Output¶
Additional INFO tags¶
Additional bcftools
INFO tags are computed with the plugin fill-tags
. These per variant tags are derived from the merged single-sample VCF and combined with the VEP annotation output in the final annotated variant file.
ID | Description | Meaning |
---|---|---|
AN | Total number of alleles in called genotypes | |
AC | Allele count in genotypes | |
NS | Number of samples with data | The number of samples which have been included for a particular variant, and is equal to the number of samples with any variant at the given chromosomal position |
AC_Hom | Allele counts in homozygous genotypes | The number of alleles for the given variant counted from the genotypes listed in the multi-sample VCF |
AC_Het | Allele counts in heterozygous genotypes | |
AC_Hemi | Allele counts in hemizygous genotypes | |
MAF | Frequency of the second most common allele |
Annotated variant file columns¶
- *_variant columns are
bcftools
output - *_annotation columns are
VEP
annotation output - *_sample columns derived from filtering on bi-allelic genotype
Name | Description |
---|---|
CHROM_variant | chromosome |
POS_variant | position |
ID_variant | identifier |
REF_variant | reference base(s) |
ALT_variant | alternate base(s) |
(...) | bcftools +fill-tags columns described above |
DUP_variant | |
(...) | VEP annotation output |
Het_samples | samples from output VCF selected with bcftools -i 'GT="het"' |
Hom_samples | samples from output VCF selected with bcftools -i 'GT="AA"' |
Hemi_samples | samples from output VCF selected with bcftools -i 'GT="A"' |
Het/Hemi variant and sample count inconsistency¶
We are currently investigating an inconsistency in the variant and sample counts in the annotated variant output file, for a small number of variants. This affects hemizygous and heterozygous variant (AC_Het_variant
, AC_Hemi_variant
) and sample (Het_sample
, Hemi_sample
) counts.
Autosomal hemizygosity
For autosomal variants, the majority of samples will have diploid genotypes (e.g. 0/1). However, some samples will have haploid (hemizygous-like) calls (e.g. 1) for certain variants, where a variant lies within a deletion.
CHROM | POS | REF | ALT | S1 |
---|---|---|---|---|
chr1 | 2118754 | TGA | T | 0/1 |
chr1 | 2118755 | G | . | 0 |
chr1 | 2118756 | A | T | 1 |
In the example above, position 2118754 is a 2 basepair deletion. Bases GA are in positions 2118755 and 2118756, respectively. Sample S1 is heterozygous REF/ALT (diploid) at position 2118754; they have the deletion on one chromosome. At position 2118755, S1 is REF (haploid). At position 2118756, S1 is ALT (haploid). These haploid calls are not an artefact of the workflow, but are seen in the single-sample VCFs, and will have AC_Hemi_variant
> 0 in the annotated variant output.
Cannot combine diploid with haploid genotype¶
For a small number of genes, the workflow SECOND_ROUND_MERGE
process fails with the error: cannot combine diploid with haploid genotype, e.g.,
Error at X:31132900: cannot combine diploid with haploid genotype
Lines total/split/realigned/skipped: 564/23/33/0
Error at chrX:100301248: cannot combine diploid with haploid genotype
Lines total/split/realigned/skipped: 984/40/12/0
The error arises because bcftools norm
join (-m+both
) cannot combine the decomposed variants from the previous split step (-m-both
), in the repeated normalisation. For example, DMD for 4 test samples (M, F, M, F) includes both haploid and diploid genotypes at chrX:31709584 after the first norm -m-both
.
chrX 31708374 ./. ./. ./. 0/1
chrX 31709555 ./. 0/1 0 0/1
chrX 31709555 ./. 0/0 1 0/0
chrX 31709584 ./. 0/1 ./. 0/0
chrX 31709584 ./. 0/0 ./. 0/1
chrX 31709584 . . 1 .
chrX 31710931 . . 1 .
We are currently working on a solution.
Memory allocation for VEP annotation on CloudOS¶
On CloudOS, the awsbatch executor copies the VEP cache (~15GB) to the compute instance for each Nextflow task (i.e. each query gene). With multiple tasks being assigned to the same compute instance, we have observed that the annotation process can complete, with all columns in the annotated variant file output, but with a "download failed" message and "cannot allocate memory" error reported in the log. We are currently exploring solutions to this issue.
Warning
Check the logs to be sure the VEP annotation has run succesfully.