The Aggregrate Variant Testing workflow changelog¶
Please always use the latest available version, unless explicitly instructed otherwise - older version may still be available in the RE but will not be supported by our team
This release adds RVtests as an additional method for running rave variant tests. Note that it is implemented following the method described in Nature, so not all the functionality of RVtests is available. In particular, this implementation of RVtests does not use covariates.
Fixed a bug where, during functional annotation filtering, variant consequence was not being taken into account if you were also filtering on an annotation (e.g. gnomAD frequency) and also allowing the inclusion of variants where that annotation was missing. This lead to more variants passing the filter than expected.
Made improvements to the phenotype file processing, so now phenotype files with multiple blank lines at the end no longer cause workflow issues.
This version includes new options in the inputs.json file, therefore you will be unable to reuse the inputs.json file from version 3.
Major update and reworking of the entire pipeline. Please see the v3.x documentation page for an overview of all the new features. The below list is just a few highlights.
Now takes either BGEN or PGEN files as input for genomic data, instead of VCF (annotation input unchanged).
Can now run on any number of phenotypes, as long as all are defined in your phenotype file.
Functional filtering updated to be more flexible, now allows for AND and OR filtering in the same run.
Includes Regenie as an additional program for burden testing.
Minor updates to the options file.
New functional annotation files (produced using VEP v99 in July 2021) are now used by default.
Fixed the options file, so that now a task job that fails while running (transient job failures, as opposed to jobs executing fully but exiting with an error code) will be run again up to 5 times before stopping the workflow.
MIT-style license attached.
Fixed a bug in differential missingness checks when processing indels.
Empty output is now allowed and does not crash the workflow.
The workflow is now tested for biallelic indels, too.
New memory and queue requirements make it easier to run on large cohorts with default settings.
Changed the default memory value for task create_regions_files, which was causing the workflow to crash on large cohorts.
Changed the declaration type of the memory value in the config file from Float to Int, to avoid issues with LSF flags on the HPC.
New input options make it more clear how to run the workflow using a pre-computed GRM.
A new filter for differential missingness is added to the GRM creation step.
There is a new output file with counts of variants in each MAC category used.
During the VEP functional annotation filtering step, if the empty string is provided as the value for variable "vep_severity_to_include" then all variants are accepted - this is the same behaviour that " bcftools +split-vep -s worst: " has.
Only autosomes are used to create the GRM for SAIGE-GENE, because some of the chrX files occasionally gave errors similar to reported bugs in indexing of sex chromosomes.
You can now use chrX with both the "aggV2" and the "aggV2_PASS_UTRplus_proteincodinggenes" input variant datasets.
In case of gene-based input, i.e. "chromosome" file or "gene" file, during the VEP functional annotation filtering step, for each gene all variants are now included or excluded according to the "worst" Consequence on any transcript for that gene. In case or coordinate-based inputs, the "worst" Consequence at each location will be selected, as in previous behaviour. In case of "groups" input, the VEP functional annotation filtering step is skipped.
Input genes, groups, or coordinate blocks that are split across more than one "chunk" of the input variant dataset are now processed as a whole, after the "chunks" are resized appropriately. Therefore, output results do not have a "__chunkXXXX" specification appended to each gene/group/coordinate-block name any more.
You can now also specify inputs as SAIGE-GENE-like "groups" of variants.
Differential missingness filters have been introduced.
The GRM used by SAIGE-GENE can now be created very quickly by specifying the relevant plink files as an input.
You can choose to use the full "aggV2", or the much smaller "aggV2_PASS_UTRplus_proteincodinggenes", as an input variant dataset.
First release - working by coordinate, i.e. following the boundaries of the input variant dataset's chunks strictly.