Skip to content

Aggregate Variant Testing (AVT) workflow

pipeline version: v4.0.0

This version is a beta release

The Nextflow AVT v4.0.0 beta release is now available. We encourage early adopters to provide us with insights and feedback to improve the final version via our Product board roadmap. Alternatively, please send us a JIRA ticket with the subject line: "AVT Nextflow beta release". This feedback will inform our further development and subsequent quarterly releases.

The Aggregate Variant Testing (AVT) workflow allows you to run rare variant association tests (SAIGE-GENE+, regenie, Rvtests) on Genomics England data. It compares a case and control cohort to determine the cumulative effects of multiple rare variants within genomic regions defined by coordinates, variants, genes, or chromosomes. As of v4.0.0, the analysis is run on three "branches" (one for each of the programs mentioned above), which can be enabled or disabled independently for each run using input parameters.

Basic use

Prerequisite: Copy the LSF submission script /gel_data_resources/workflows/rdp_aggregate_variant_testing/v4.0.0/submit.sh to your own folder.

  1. Update your submission script.

    At a minimum, you must update your LSF project code, check that the working directory is correct, and run this from a folder you can write in (see submission script below).

    Additionally, you should set variables (parameters to the nextflow command) so that they are relevant to your analysis.
    See Parameters for a list of input parameters: you can set them in the submission script by prefixing them with a double dash and adding a value, as is done in the example submission script below for example for --outdir results.
    Refer to Input for input file type and format.

    submit.sh
    #!/bin/bash
    
    #BSUB -P <project_code>
    #BSUB -q inter
    #BSUB -R rusage[mem=10000]
    #BSUB -M 12000
    #BSUB -J avt
    #BSUB -o logs/%J_avt.stdout
    #BSUB -e logs/%J_avt.stderr
    
    module purge
    module load singularity/4.1.1 nextflow/23.10
    
    LSF_JOB_ID=${LSB_JOBID:-default}
    export NXF_LOG_FILE="logs/${LSF_JOB_ID}_avt.log"
    
    SINGULARITY_TMPDIR=/re_scratch/$USER
    export SINGULARITY_TMPDIR
    
    avt="/gel_data_resources/workflows/rdp_aggregate_variant_testing/v4.0.0"
    
    nextflow run "${avt}"/main.nf \
        --project_code "<project_code>" \
        --outdir results \
        --output_file_name_suffix "avt_output" \
        --run_saige_gene true \
        --run_regenie true \
        --run_fishers_test true \
        -work-dir "/re_scratch/$USER" \
        -profile cluster \
        -ansi-log false \
        -resume
    
  2. Run the workflow.

    bsub < submit.sh
    

    Default parameters will run the workflow on an inital cohort of 147 cases and 7452 controls, on chromosomes 21 and 22 only. Any arguments added to the submission script will overwrite the defaults.

    Refer to Output for references for the third-party programs run by the workflow, useful to interpret the output files.

Design

The workflow is written in Nextflow DSL2 and includes the following steps (with corresponding Nextflow MODULE).

Figure. Aggregate Variant Testing DAG

pre-processing

  • Make sure input files are valid (VALIDATE_INPUT_FILES)
  • Filter inputs according to user's criteria (INPUT_FILTERING)
  • Prepare user input cohort file (NORMALISE_COHORT_FILE)
  • Perform the first round of filtering on the genomic data (COHORT_FREQUENCY_REGION_TYPE_FILTERING)
  • Remove samples present in the cohort files but not in the genomic data (FILTER_SAMPLES_IN_COHORT_FILE)
  • Extract the list of masks from the variant list (CREATE_MASK_LIST)

optional annotation filtering

  • Extract headers for functional annotation fields (GET_ANNOTATION_HEADERS)
  • Get the functional annotations for each variant (EXTRACT_VARIANTS_WITH_ANNOTATIONS)
  • Filter the extracted annotations based on the user's variant mask(s) (FILTER_VARIANTS_ON_ANNOTATIONS)
  • Aggregate filtered variants into a whole genome file (AGGREGATE_FILTERED_VARIANTS)
  • Filter the genomic data based on variant annotations as specified by the user (ANNOTATION_FILTERING)

genomic data filtering

  • Convert plink2 PGEN format files to plink1 BED format (CONVERT_PGEN_TO_BED)
  • Identify differentially missing sites per binary phenotype in the filtered genomic data (TEST_GENOMIC_DATA_FOR_DIFFERENTIAL_MISSINGNESS)
  • Aggregate differentially missing sites for genomic data (AGG_DIFF_MISSING_GD)
  • Filter the genomic data to remove any variants that are identified as being differentially missing between cases and controls(FILTER_GENOMIC_DATA_FOR_DIFFERENTIAL_MISSINGNESS)
  • Convert plink2 PGEN format files to BGEN format (CONVERT_PGEN_TO_BGEN)
  • Create BGEN index files for a BGEN fileset (INDEX_BGEN_FILES)

grm filtering

  • Filter pre-computed GRM files to include only samples in user cohort (FILTER_GRM_FILES_TO_COHORT)
  • Test pre-computed GRM files for differential missingness for each binary phenotype (TEST_GRM_FOR_DIFFERENTIAL_MISSINGNESS)
  • Aggregate GRM differentially missing sites (AGG_DIFF_MISSING_GRM)
  • Filter the GRM files to remove variants that are differentially missing in any phenotype (FILTER_GRM_FOR_DIFFERENTIAL_MISSINGNESS)

association testing

  • Create the group files used by SAIGE-GENE and REGENIE for testing (CREATE_GROUP_FILES)

regenie

  • Filter the pre-selected HQ SNP variants to common variants for REGENIE GRM (REGENIE_FILTER_GRM_FILES)
  • Run step 1 of REGENIE (REGENIE_STEP_1)
  • Run step 2 of REGENIE (REGENIE_STEP_2)
  • Aggregate REGENIE association summary statistics (REGENIE_AGGREGATE_SUMMARY_STATISTICS)

SAIGE-GENE

  • Subset GRM files to contain variants of MAC 10 or greater (SAIGE_SUBSET_FILES_FOR_SPARSE_GRM)
  • Create the sparse GRM used in SAIGE-GENE (SAIGE_CREATE_SPARSE_GRM)
  • Fit the null GLMM with SAIGE-GENE (SAIGE_FIT_NULLGLMM)
  • Run the SKAT-O, SKAT and burden tests with SAIGE-GENE (SAIGE_RUN_SPA_TESTS)
  • Aggregate SAIGE-GENE association summary statistics (SAIGE_AGGREGATE_SUMMARY_STATISTICS)

Fisher's test (using Rvtests)

  • Extract the variant information in the PGEN files into a matrix for collapsing (EXTRACT_VARIANTS_FOR_COLLAPSING)
  • Create a pseudo VCF file from the variant matrix (CREATE_PER_CHROMOSOME_PSEUDO_VCF)
  • Merge the per-chromosome VCF files used in the Fisher's test branch of the workflow (CREATE_WHOLE_GENOME_PSEUDO_VCF)
  • Run Fisher's exact test on the pseudo VCF file with Rvtests (RUN_FISHERS_TEST_USING_RVTESTS)