Skip to content

The HPC is changing

We will soon be switching to a new High Performance Cluster, called Double Helix. This will mean that some of the commands you use to connect to the HPC and call modules will change. We will inform you by email when you are switching over, allowing you to make the necessary changes to your scripts. Please check our HPC changeover notes for more details on what will change.

AVT detailed input file overview

The input file is broken down into seven sections. Each section details a related set of inputs, e.g. saige-gene settings.

Section 1 - Setup

These settings define who you are (for HPC usage), the name and genome build of the cohort, and a set of reference files that point to the genomic and functional annotation data for the cohort.

Setting name Description Example value Mandatory?
lsf_project_code The project code needed to run jobs on the HPC. Full list is located at LSF Project Codes. Only relevant when running in the RE, not CloudRE re_gecip_cardiovascular Mandatory
cohort_name Name for your cohort that is described by the variable 'cohort_file_lists'. "aggV2" Mandatory
cohort_genome_build The genome build that your cohort is aligned to. This is used later to fetch the correct gene positions. "GRCh38" Mandatory
cohort_file_lists This setting defines a cohort, with three associated file options. Each cohort described here must have a unique name.
'cohort_functional_annotation_file_list' is mandatory.
cohort_bgen_file_list is a filepath to a 3-column tsv, where the columns are file.bgen, file.sample, file.bgen.bgi
cohort_pgen_file_list is a filepath to a 3-column tsv, where the columns are file.pgen, file.psam, file.pvar
* cohort_functional_annotation_file_list is a filepath to a 2-column tsv, where the columns are file.vcf, file.vcf.{tbi,csi}
One of either 'cohort_bgen_file_list' or 'cohort_pgen_file_list' must be supplied, and the other must be left as "".
See example input settings for section 1.
run_saige Controls whether to run SAIGE-GENE for burden testing. Independent of Regenie i.e. both can be set to true. true Not Mandatory
run_regenie Controls whether to run Regenie for burden testing. Independent of SAIGE-GENE i.e. both can be set to true. false Not Mandatory

If both 'run_saige' and 'run_regenie' are set to false, then the workflow will not run any association analyses, and just output the results of the first step, i.e. a list of variants that pass functional filtering.

Example input settings for section 1
"master_aggregate_variant_testing.lsf_project_code": "bio",
"master_aggregate_variant_testing.cohort_name": "aggV2",
"master_aggregate_variant_testing.cohort_genome_build": "GRCh38",

"master_aggregate_variant_testing.cohort_file_lists": {
    "aggV2": {
        "cohort_bgen_file_list": "",
        "cohort_pgen_file_list": "aggV2_pgen_file_list.txt",
        "cohort_functional_annotation_file_list": "aggV2_functional_annotation_list.txt"
    },
    "custom_cohort": {
        "cohort_bgen_file_list": "custom_cohort_bgen_list.txt",
        "cohort_pgen_file_list": "",
        "cohort_functional_annotation_file_list": "custom_cophort_functional_annotation_list.txt"
    }
},

"master_aggregate_variant_testing.run_saige": true,
"master_aggregate_variant_testing.run_regenie": false,

Section 2 - Input and output definition

This section defines all the input and output files that you need to change or be aware of for running the workflow.

Setting name Description Example value Mandatory?
output_file_name The suffix for the output test statistics that are produced by saige-gene or regenie. The workflow will prefix this name with the phenotype, program and mask used. "output.txt" Mandatory
phenotype_input_file Filepath that points to the phenotype file that you wish to use for the analysis. This should be a table using the column separators of your choice, and should contain a column of sample identifiers (e.g. platekey or IID), one or more phenotype columns, and all columns that you want to use as covariates.
For binary traits, cases should be coded as 1 and controls should be coded as 0.
This workflow might work for quantitative traits, but they have not been tested.
"/path/to/pkd_cohort.txt" Mandatory
phenotype_column_delimiter The delimiter used to separate columns in your phenotype file. Good choices for readability are spaces or tabs. "\t" Mandatory
phenotype_sample_column The column in the phenotype file that lists the sample identifiers for each participant in the cohort. "plate_key" Mandatory
phenotype_array_for_testing An array that lists all the different phenotypes that you wish to test in your cohort. Can have one or more comma separated entries. Each phenotype is tested in parallel. Each phenotype in the array must correspond to a column in your phenotype file. "pkd" Mandatory
sex_column The name of the column that records the sex of each sample. These are coded values that typically encode male as 0 and female as 1 "sex" Mandatory
male_coding How males are coded in the sex column. 0 Mandatory
covarColList A comma separated list of all covariates that you wish to include in the model for testing. Each covariate listed must be a column in the phenotype file. Not all covariate columns in the phenotype file need to be used as covariates in the test. "age,sex,pc1,pc2" Mandatory
chrX_female_only_cohort deprecate? false Mandatory
chrX_male_only_cohort deprecate? false Mandatory
chromosomes_input_file A filepath that points to the chromosome input file.
Please only provide ONE of chromosome, gene, region or variant input files.
"/path/to/chromosomes.txt" Not Mandatory
genes_input_file A filepath that points to the genes input file.
Please only provide ONE of chromosome, gene, region or variant input files.
"/path/to/genes.txt" Not Mandatory
coordinates_input_file A filepath that points to the region input file.
Please only provide ONE of chromosome, gene, region or variant input files.
"/path/to/coordinates.bed" Not Mandatory
variant_input_file A filepath that points to the variant input file.
Please only provide ONE of chromosome, gene, region or variant input files.
"/path/to/variants.tsv" Not Mandatory
gene_exclusion_file A filepath that points to a gene exclusion file. "/path/to/exclude_genes.tsv" Not Mandatory
variant_exclusion_file A filepath that points to a variant exclusion file. "/path/to/exclude_variants.tsv" Not Mandatory
precomputed_plink_files_for_grm An array that lists the filepaths to the .bed, .bim and .fam files that contain the high quality common and rare variants that are needed by SAIGE-GENE for GRM creation and variance estimation, and by REGENIE for PRS estimation. Details on how these were created for aggV2 can be found here. See the example input for section 2 Mandatory

Set all unneeded file inputs to null.

Example input settings for section 2
"master_aggregate_variant_testing.output_file_name": "output.txt",

"master_aggregate_variant_testing.phenotype_input_file": "/re_gecip/BRS/example.pheno",
"master_aggregate_variant_testing.phenotype_column_delimiter": "\t",
"master_aggregate_variant_testing.phenotype_sample_column": "plate_key",
"master_aggregate_variant_testing.phenotype_array_for_testing": ["pkd"],
"master_aggregate_variant_testing.sex_column": "sex",
"master_aggregate_variant_testing.male_coding": 0,
"master_aggregate_variant_testing.covarColList": "ancestry,age,sex,age2,age.sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20",

"master_aggregate_variant_testing.chrX_female_only_cohort": false,
"master_aggregate_variant_testing.chrX_male_only_cohort": false,

"master_aggregate_variant_testing.part_1_variant_selection.chromosomes_input_file": "input_user_data/chromosomes.txt",
"master_aggregate_variant_testing.part_1_variant_selection.genes_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.coordinates_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.variant_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.gene_exclusion_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.variant_exclusion_file": null,

"master_aggregate_variant_testing.precomputed_plink_files_for_grm": [
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.bed",
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.bim",
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.fam"
],

Section 3 - Functional filtering

This section is all about the various settings to filter variants based on their functional annotation attributes. Only variants that pass at least one of these filters will be included in downstream analysis.

Setting Name Description Example Value Mandatory?
use_snps_only A true/false flag to set if you want to analyse SNP variants only, or if you want to include INDEL variants as well true Mandatory
use_vep_filtering A true/false flag to set if you want to perform functional annotation filtering. Generally set to true unless you already have a variant list that you do not wish to perform any additional filtering on. true Mandatory
use_ensembl_protein_coding_genes_only A true/false flag to set if you want to restrict your analysis to genes that have the biotype protein_coding in Ensembl. Setting to false will include genes of all biotypes, e.g. lncRNA. true Mandatory
use_genes_only A true/false flag to set if you are analysing Ensembl defined genes only, or you wish to include custom regions that lie outside / overlap genes. true Mandatory
functional_annotation_filter_masks This setting details the various 'masks' that are used to bin each variant. The format for each mask section is as follows:
* Mask name
* AND filter settings - any variant will have to pass ALL filters in this section to be included
* Zero or more specific filters
* OR filter settings - any variant will have to pass ANY filter in this section to be included
* Zero or more specific filters
The rules that govern the functional filtering are:
* Any annotation in the CSQ section of the functional annotation VCF can be used as an annotation to filter on. This is done by bcftools +split-vep internally.
* The AND and OR filters are themselves combined with AND, therefore a variant must pass all filters in the AND block, AND one or more filters in the OR block.
* If the annotation field annotation and the VEP severity field vep_severity_to_include are empty, the filter is skipped. Leaving filters empty does not impact the running of the workflow.
* vep_severity_to_include operates in an identical manner to bcftools +split-vep -s. You can specify an exact consequence, e.g. stop_gained, and only variants with that consequence will be retained. Alternatively, you can specify a consequence or worse, e.g. missense+, to include all variants that are at least as severe as missense. If it is left blank, then all consequences will be considered.
* include_missing can be set to no or yes. If set to no then only variants that pass the filter will be included. If set to yes then variants that pass the filter and variants that are annotated with missing entries by VEP i.e. '.', '-' or '' will be included. This can be useful for example when wanting to filter on CADD_PHRED scores and also include INDELs, as many INDELs do not have a CADD_PHRED score.
* Comparators can be >, >=, ==, <=, < for float values, or == for string values. The workflow will exit with an error if the wrong comparator type is detected.
Make sure to give each filter a unique name WITHIN the AND and OR blocks for each mask. Having the same name for multiple filters means that some filters will get skipped, and lead to confusing output.
See the example input in section 3 Not mandatory
mask_rank This setting defines the order of the strictness for each mask, so that a variant is assigned to the correct mask in the case that is passes multiple. The lower the number, the more strict the mask is. See the example input in section 3 Not mandatory
regenie_masks This setting defines which masks can be grouped together when testing with REGENIE. See the example input in section 3 Not mandatory.
Example input settings for section 3
"master_aggregate_variant_testing.part_1_variant_selection.use_snps_only": true,
"master_aggregate_variant_testing.part_1_variant_selection.use_vep_filtering": true,
"master_aggregate_variant_testing.part_1_variant_selection.use_ensembl_protein_coding_genes_only": true,
"master_aggregate_variant_testing.use_genes_only": true,

"master_aggregate_variant_testing.part_1_variant_selection.functional_annotation_filter_masks": {
    "LoF": {
        "and_mask": {
            "filter1": {"annotation": "LoF", "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "","vep_severity_to_include": ""}
        }
    },
    "missense": {
        "and_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "CADD_PHRED",  "comparator": ">=", "condition": "10","include_missing": "no", "vep_severity_to_include": "missense+"},
            "filter2": {"annotation": "LoF",  "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        }
    },
    "synonymous": {
        "and_mask": {
            "filter1": {"annotation": "Consequence", "comparator": "==", "condition": "synonymous_variant", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "", "include_missing": "", "vep_severity_to_include": ""}
        }
    }
},

"master_aggregate_variant_testing.part_1_variant_selection.mask_rank": {
    "LoF": 1,
    "missense": 2,
    "synonymous": 3
},

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_masks": {
    "strict_lof": "LoF",
    "mild_lof": "LoF,missense",
    "control": "synonymous"
},

Section 4 - Site wide variant filtering

This section describes the filters used for site-wide filtering of variants.

Setting name Description Example value Mandatory?
differential_missing_pvalue The p-value threshold for testing differential missingness between cases and controls. Sites that have a smaller p-value than this setting will be removed. 10e-5 Mandatory
upstream_downstream_length A value in basepairs that allows for padding of input regions. For example, you may wish to pad genes by 10kb to potentially capture regulatory regions. Set to 0 by default. 10000 Mandatory
use_max_mac A true/false setting on whether to use the minor allele count as the upper bound for rare variant filtering. Useful for capturing very rare variants in large cohorts. false Mandatory
final_max_mac_to_exclude The maximum MAC for variants to include, if using the use_max_mac setting. Variants with a higher MAC than this value will be removed. 20 Mandatory
use_max_maf A true/false setting on whether to use the minor allele frequency as the upper bound for rare variant filtering. Useful for capturing rare variants in small cohorts. true Mandatory
final_max_maf_to_include The maximum MAF for variants to include if using the use_max_maf setting. Variants with a higher MAF than this value will be removed. 0.005 Mandatory
max_missingness The upper threshold for missing data for each site. Sites that have a higher percentage of missing values than this value will be excluded. 0.05 Mandatory

Use only one of MAC of MAF for filtering.

MAF filtering is generally better for small cohorts, where a MAC of 20 might actually correspond to a MAF of 1% or more.

Example input settings for section 4
1
2
3
4
5
6
7
"master_aggregate_variant_testing.differential_missingness_pvalue": "10e-5",
"master_aggregate_variant_testing.part_1_variant_selection.upstream_downstream_length": 0,
"master_aggregate_variant_testing.part_1_variant_selection.use_max_mac": false,
"master_aggregate_variant_testing.part_1_variant_selection.final_max_mac_to_exclude": 20,
"master_aggregate_variant_testing.part_1_variant_selection.use_max_maf": true,
"master_aggregate_variant_testing.part_1_variant_selection.final_max_maf_to_exclude": 0.005,
"master_aggregate_variant_testing.part_1_variant_selection.max_missingness": 0.05,

Section 5 - SAIGE-GENE settings

These settings allow you to tweak various aspects of SAIGE-GENE at runtime. Most settings have been left at the default value recommended by the developers. We recommend only changing these settings if you have used SAIGE-GENE in the past. For more information on these options please consult the SAIGE-GENE documentation.

Settings that are not present in the table below, but are present in the program are handled internally, such as input files and output files.

Setting name Description Example value Mandatory?
LOCO A true/false setting for using leave-one-chromosome-out when testing with SAIGE-GENE "FALSE" Mandatory
saige_setp0_options The settings used to build the sparse GRM that is used for SAIGE-GENE. This should be left as default. See the example input in section 5 Mandatory
saige_step1_options The settings that are used to fit the null GLMM in SAIGE-GENE. Most of these can be left as default.
To perform rare variant testing, IsSparseKin should be left as TRUE
MaleCode and FemaleCode should be set to the same values as in your phenotype file (0,1 by default)
traitType can be set to quantitative and invNormalize can be set to true to test for quantitative traits. Note that this setting has not been tested in this workflow, and may not work.
See the example input in section 5 Mandatory
saige_step2_options The settings that are used for running the rare variant tests in SAIGE-GENE. Most can be left as default.
* method_to_CollapseUltraRare implements SAIGE-GENE+ when set to absence_or_presence. To disable this, set to '' (an empty string).
See the example input in section 5. Mandatory
Example input settings for section 5
"master_aggregate_variant_testing.part_2_saige_testing.LOCO": "FALSE",

"master_aggregate_variant_testing.part_2_saige_testing.saige_step0_options": [
    "--relatednessCutoff=0.125",
    "--numRandomMarkerforSparseKin=2000"
],

"master_aggregate_variant_testing.part_2_saige_testing.saige_step1_options": [
    "--IsSparseKin='TRUE'",
    "--traitType='binary'",
    "--invNormalize='FALSE'",
    "--outputPrefix='null_glmm'",
    "--outputPrefix_varRatio='null_glmm_var_ratio'",
    "--minMAFforGRM=0.01",
    "--sexCol=''",
    "--FemaleCode='1'",
    "--FemaleOnly='FALSE'",
    "--MaleCode='0'",
    "--MaleOnly='FALSE'",
    "--noEstFixedEff='FALSE'",
    "--tol=0.02",
    "--maxiter=20",
    "--tolPCG=1e-5",
    "--maxiterPCG=500",
    "--SPAcutoff=2",
    "--numRandomMarkerforVarianceRatio=30",
    "--skipModelFitting='FALSE'",
    "--tauInit='0,0'",
    "--traceCVcutoff=0.0025",
    "--ratioCVcutoff=0.001",
    "--isCateVarianceRatio='TRUE'",
    "--isCovariateTransform='TRUE'",
    "--cateVarRatioMinMACVecExclude='0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5'",
    "--cateVarRatioMaxMACVecInclude='1.5,2.5,3.5,4.5,5.5,10.5,20.5'",
    "--useSparseSigmaforInitTau='FALSE'",
    "--minCovariateCount=-1",
    "--includeNonautoMarkersforVarRatio='FALSE'",
    "--memoryChunk=2"
],
"master_aggregate_variant_testing.part_2_saige_testing.saige_step2_options": [
    "--IsDropMissingDosages='FALSE'",
    "--IsSparse='TRUE'",
    "--IsOutputAFinCaseCtrl='TRUE'",
    "--IsOutputNinCaseCtrl='TRUE'",
    "--IsOutputHetHomCountsinCaseCtrl='TRUE'",
    "--IsSingleVarinGroupTest='TRUE'",
    "--IsOutputPvalueNAinGroupTestforBinary='TRUE'",
    "--IsAccountforCasecontrolImbalanceinGroupTest='TRUE'",
    "--IsOutputBETASEinBurdenTest='TRUE'",
    "--is_rewrite_XnonPAR_forMales='FALSE'",
    "--minMAF=0",
    "--maxMAFforGroupTest=0.5",
    "--minMAC=0",
    "--numLinesOutput=10000",
    "--condition=''",
    "--kernel='linear.weighted'",
    "--method='optimal.adj'",
    "--weights.beta.rare=1,25",
    "--weights.beta.common=1,25",
    "--weightMAFcutoff=0.01",
    "--r.corr=0",
    "--dosageZerodCutoff=0.2",
    "--weightsIncludeinGroupFile='FALSE'",
    "--weights_for_G2_cond=''",
    "--SPAcutoff=2",
    "--chrom=chr1",
    "--method_to_CollapseUltraRare='absence_or_presence'",
    "--MACCutoff_to_CollapseUltraRare=10",
    "--DosageCutoff_for_UltraRarePresence=0.5"
],

Section 6 - Regenie settings

These settings allow you to tweak various aspects of Regenie at runtime. Most settings have been left as defaults. Check https://rgcgithub.github.io/regenie/options/#burden-testing for a full list of options.

Setting name Description Example value Mandatory?
regenie_step1_options Options for step 1 of REGENIE. See the example inputs in section 6 Mandatory
regenie_step2_options Options for step 2 of REGENIE. See the example inputs in section 6 Mandatory
Example input settings for section 6
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step1_options": [
    "--step 1",
    "--bt",
    "--lowmem",
    "--lowmem-prefix .",
    "--bsize 1000",
    "--ref-first",
    "--loocv"
],
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step2_options": [
    "--step 2",
    "--minMAC 1",
    "--aaf-bins 0.005",
    "--build-mask 'max'",
    "--bt",
    "--firth --approx",
    "--firth-se",
    "--bsize 1000",
    "--ref-first",
    "--singleton-carrier"
],

Section 7 - Workflow resources

This section details workflow resources. You should NOT change these, with the possible exception of CPU and memory settings - if the workflow is crashing due to lack of resources.

Warning

Changing any setting apart from CPU and memory settings will almost guarantee that the workflow will break. Lowering CPU and memory requirements too far will also break the workflow.

Example input settings for section 7
"master_aggregate_variant_testing.tools_container": "quay.io/alexander-stuckey/gwas_avt",
"master_aggregate_variant_testing.part_2_saige_testing.saige_container": "quay.io/alexander-stuckey/saige:0.44.6.5",
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_container": "quay.io/alexander-stuckey/regenie",

"master_aggregate_variant_testing.genome_build_to_ensembl_coordinates_files": {
    "GRCh37": "resources/Ensembl_87_genes_coordinates_GRCh37.tsv",
    "GRCh38": "resources/Ensembl_98_genes_coordinates_GRCh38.tsv"
},
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_scale": "resources/vep_severity_scale_2020_bcftools_splitvep.txt",

"master_aggregate_variant_testing.pheno_plink_helper_python_script": "resources/pheno_helper_script.py",
"master_aggregate_variant_testing.part_1_variant_selection.python_input_filtering_script": "resources/input_filtering.py",
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_scale_ensembl_translation": "resources/vep_severity_scale_translation_2020_Ensembl_to_bcftools_splitvep.tsv",
"master_aggregate_variant_testing.part_1_variant_selection.vep_filtering_python_script": "resources/vep_filtering_script.py",
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_bcftools_translation_and_ranking": "resources/vep_severity_bcftools_translation_and_ranking.tsv",

"master_aggregate_variant_testing.part_1_variant_selection.variant_extraction_threads": 4,
"master_aggregate_variant_testing.part_1_variant_selection.variant_extraction_memory": 16000,

"master_aggregate_variant_testing.part_2_saige_testing.saige_step_0_threads": 8,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_0_memory": 16000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_1_threads": 8,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_1_memory": 16000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_filter_differential_missingness_cpus": 1,
"master_aggregate_variant_testing.part_2_saige_testing.saige_filter_differential_missingness_memory": 4000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_create_GRM_threads": 4,
"master_aggregate_variant_testing.part_2_saige_testing.saige_create_GRM_memory": 20000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_2_threads": 4,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_2_memory": 64000,

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_filter_GRM_cpus": 1,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_filter_GRM_memory": 8000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_1_cpus": 8,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_1_memory": 16000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_create_set_filter_diff_missing_cpus": 2,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_create_set_filter_diff_missing_memory": 32000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_2_cpus": 2,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_2_memory": 16000

The full input file

The full input file

``` bash linenums="1" { "master_aggregate_variant_testing.lsf_project_code": "bio",

"master_aggregate_variant_testing.cohort_name": "aggV2",
"master_aggregate_variant_testing.cohort_genome_build": "GRCh38",

"master_aggregate_variant_testing.cohort_file_lists": {
    "aggV2": {
        "cohort_bgen_file_list": "example_bgen_list.txt",
        "cohort_pgen_file_list": "",
        "cohort_functional_annotation_file_list": "example_functional_list.txt"
    }
},

"master_aggregate_variant_testing.run_saige": true,
"master_aggregate_variant_testing.run_regenie": false,

"master_aggregate_variant_testing.output_file_name": "output.txt",

"master_aggregate_variant_testing.phenotype_input_file": "/re_gecip/BRS/example.pheno",
"master_aggregate_variant_testing.phenotype_column_delimiter": "\t",
"master_aggregate_variant_testing.phenotype_sample_column": "plate_key",
"master_aggregate_variant_testing.phenotype_array_for_testing": ["covid"],
"master_aggregate_variant_testing.sex_column": "sex",
"master_aggregate_variant_testing.male_coding": 0,
"master_aggregate_variant_testing.covarColList": "ancestry,age,sex,age2,age.sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20",

"master_aggregate_variant_testing.chrX_female_only_cohort": false,
"master_aggregate_variant_testing.chrX_male_only_cohort": false,

"master_aggregate_variant_testing.part_1_variant_selection.chromosomes_input_file": "input_user_data/chromosomes.txt",
"master_aggregate_variant_testing.part_1_variant_selection.genes_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.coordinates_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.variant_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.gene_exclusion_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.variant_exclusion_file": null,

"master_aggregate_variant_testing.part_1_variant_selection.use_snps_only": true,
"master_aggregate_variant_testing.part_1_variant_selection.use_vep_filtering": true,
"master_aggregate_variant_testing.part_1_variant_selection.use_ensembl_protein_coding_genes_only": true,
"master_aggregate_variant_testing.use_genes_only": true,

"master_aggregate_variant_testing.part_1_variant_selection.functional_annotation_filter_masks": {
    "LoF": {
        "and_mask": {
            "filter1": {"annotation": "LoF", "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "","vep_severity_to_include": ""}
        }
    },
    "missense": {
        "and_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "CADD_PHRED",  "comparator": ">=", "condition": "10","include_missing": "no", "vep_severity_to_include": "missense+"},
            "filter2": {"annotation": "LoF",  "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        }
    },
    "synonymous": {
        "and_mask": {
            "filter1": {"annotation": "Consequence", "comparator": "==", "condition": "synonymous_variant", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "", "include_missing": "", "vep_severity_to_include": ""}
        }
    }
},

| "master_aggregate_variant_testing.part_1_variant_selection.mask_rank": { "LoF": 1, "missense": 2, "synonymous": 3 },

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_masks": {
    "strict_lof": "LoF",
    "mild_lof": "LoF,missense",
    "control": "synonymous"
},

"master_aggregate_variant_testing.differential_missingness_pvalue": "10e-5",
"master_aggregate_variant_testing.part_1_variant_selection.upstream_downstream_length": 0,
"master_aggregate_variant_testing.part_1_variant_selection.use_max_mac": false,
"master_aggregate_variant_testing.part_1_variant_selection.final_max_mac_to_exclude": 20,
"master_aggregate_variant_testing.part_1_variant_selection.use_max_maf": true,
"master_aggregate_variant_testing.part_1_variant_selection.final_max_maf_to_exclude": 0.005,
"master_aggregate_variant_testing.part_1_variant_selection.max_missingness": 0.05,

"master_aggregate_variant_testing.precomputed_plink_files_for_grm": [
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.bed",
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.bim",
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.fam"
],

"master_aggregate_variant_testing.part_2_saige_testing.LOCO": "FALSE",

"master_aggregate_variant_testing.part_2_saige_testing.saige_step0_options": [
    "--relatednessCutoff=0.125",
    "--numRandomMarkerforSparseKin=2000"
],

"master_aggregate_variant_testing.part_2_saige_testing.saige_step1_options": [
    "--IsSparseKin='TRUE'",
    "--traitType='binary'",
    "--invNormalize='false'",
    "--outputPrefix='null_glmm'",
    "--outputPrefix_varRatio='null_glmm_var_ratio'",
    "--minMAFforGRM=0.01",
    "--sexCol=''",
    "--FemaleCode='1'",
    "--FemaleOnly='FALSE'",
    "--MaleCode='0'",
    "--MaleOnly='FALSE'",
    "--noEstFixedEff='FALSE'",
    "--tol=0.02",
    "--maxiter=20",
    "--tolPCG=1e-5",
    "--maxiterPCG=500",
    "--SPAcutoff=2",
    "--numRandomMarkerforVarianceRatio=30",
    "--skipModelFitting='FALSE'",
    "--tauInit='0,0'",
    "--traceCVcutoff=0.0025",
    "--ratioCVcutoff=0.001",
    "--isCateVarianceRatio='TRUE'",
    "--isCovariateTransform='TRUE'",
    "--cateVarRatioMinMACVecExclude='0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5'",
    "--cateVarRatioMaxMACVecInclude='1.5,2.5,3.5,4.5,5.5,10.5,20.5'",
    "--useSparseSigmaforInitTau='FALSE'",
    "--minCovariateCount=-1",
    "--includeNonautoMarkersforVarRatio='FALSE'",
    "--memoryChunk=2"
],
"master_aggregate_variant_testing.part_2_saige_testing.saige_step2_options": [
    "--IsDropMissingDosages='FALSE'",
    "--IsSparse='TRUE'",
    "--IsOutputAFinCaseCtrl='TRUE'",
    "--IsOutputNinCaseCtrl='TRUE'",
    "--IsOutputHetHomCountsinCaseCtrl='TRUE'",
    "--IsSingleVarinGroupTest='TRUE'",
    "--IsOutputPvalueNAinGroupTestforBinary='TRUE'",
    "--IsAccountforCasecontrolImbalanceinGroupTest='TRUE'",
    "--IsOutputBETASEinBurdenTest='TRUE'",
    "--is_rewrite_XnonPAR_forMales='FALSE'",
    "--minMAF=0",
    "--maxMAFforGroupTest=0.5",
    "--minMAC=0",
    "--numLinesOutput=10000",
    "--condition=''",
    "--kernel='linear.weighted'",
    "--method='optimal.adj'",
    "--weights.beta.rare=1,25",
    "--weights.beta.common=1,25",
    "--weightMAFcutoff=0.01",
    "--r.corr=0",
    "--dosageZerodCutoff=0.2",
    "--weightsIncludeinGroupFile='FALSE'",
    "--weights_for_G2_cond=''",
    "--SPAcutoff=2",
    "--chrom=chr1",
    "--method_to_CollapseUltraRare='absence_or_presence'",
    "--MACCutoff_to_CollapseUltraRare=10",
    "--DosageCutoff_for_UltraRarePresence=0.5"
],

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step1_options": [
    "--step 1",
    "--bt",
    "--lowmem",
    "--lowmem-prefix .",
    "--bsize 1000",
    "--ref-first",
    "--loocv"
],
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step2_options": [
    "--step 2",
    "--minMAC 1",
    "--aaf-bins 0.005",
    "--build-mask 'max'",
    "--bt",
    "--firth --approx",
    "--firth-se",
    "--bsize 1000",
    "--ref-first",
    "--singleton-carrier"
],

"master_aggregate_variant_testing.tools_container": "quay.io/alexander-stuckey/gwas_avt",
"master_aggregate_variant_testing.part_2_saige_testing.saige_container": "quay.io/alexander-stuckey/saige:0.44.6.5",
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_container": "quay.io/alexander-stuckey/regenie",

"master_aggregate_variant_testing.genome_build_to_ensembl_coordinates_files": {
    "GRCh37": "resources/Ensembl_87_genes_coordinates_GRCh37.tsv",
    "GRCh38": "resources/Ensembl_98_genes_coordinates_GRCh38.tsv"
},
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_scale": "resources/vep_severity_scale_2020_bcftools_splitvep.txt",

"master_aggregate_variant_testing.pheno_plink_helper_python_script": "resources/pheno_helper_script.py",
"master_aggregate_variant_testing.part_1_variant_selection.python_input_filtering_script": "resources/input_filtering.py",
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_scale_ensembl_translation": "resources/vep_severity_scale_translation_2020_Ensembl_to_bcftools_splitvep.tsv",
"master_aggregate_variant_testing.part_1_variant_selection.vep_filtering_python_script": "resources/vep_filtering_script.py",
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_bcftools_translation_and_ranking": "resources/vep_severity_bcftools_translation_and_ranking.tsv",

"master_aggregate_variant_testing.part_1_variant_selection.variant_extraction_threads": 4,
"master_aggregate_variant_testing.part_1_variant_selection.variant_extraction_memory": 16000,

"master_aggregate_variant_testing.part_2_saige_testing.saige_step_0_threads": 8,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_0_memory": 16000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_1_threads": 8,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_1_memory": 16000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_filter_differential_missingness_cpus": 1,
"master_aggregate_variant_testing.part_2_saige_testing.saige_filter_differential_missingness_memory": 4000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_create_GRM_threads": 4,
"master_aggregate_variant_testing.part_2_saige_testing.saige_create_GRM_memory": 20000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_2_threads": 4,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_2_memory": 64000,

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_filter_GRM_cpus": 1,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_filter_GRM_memory": 8000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_1_cpus": 8,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_1_memory": 16000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_create_set_filter_diff_missing_cpus": 2,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_create_set_filter_diff_missing_memory": 32000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_2_cpus": 2,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_2_memory": 16000

}

```