AVT detailed input file overview¶

The input file is broken down into seven sections. Each section details a related set of inputs, e.g. saige-gene settings.

Section 1 - Setup¶

These settings define who you are (for HPC usage), the name and genome build of the cohort, and a set of reference files that point to the genomic and functional annotation data for the cohort.

Setting name	Description	Example value	Mandatory?
lsf_project_code	The project code needed to run jobs on the HPC. Full list is located at LSF Project Codes. Only relevant when running in the RE, not CloudRE	`re_gecip_cardiovascular`	Mandatory
cohort_name	Name for your cohort that is described by the variable 'cohort_file_lists'.	`"aggV2"`	Mandatory
cohort_genome_build	The genome build that your cohort is aligned to. This is used later to fetch the correct gene positions.	`"GRCh38"`	Mandatory
cohort_file_lists	This setting defines a cohort, with three associated file options. Each cohort described here must have a unique name. 'cohort_functional_annotation_file_list' is mandatory. cohort_bgen_file_list is a filepath to a 3-column tsv, where the columns are file.bgen, file.sample, file.bgen.bgi cohort_pgen_file_list is a filepath to a 3-column tsv, where the columns are file.pgen, file.psam, file.pvar * cohort_functional_annotation_file_list is a filepath to a 2-column tsv, where the columns are file.vcf, file.vcf.{tbi,csi} One of either 'cohort_bgen_file_list' or 'cohort_pgen_file_list' must be supplied, and the other must be left as "".	See example input settings for section 1.
run_saige	Controls whether to run SAIGE-GENE for burden testing. Independent of Regenie i.e. both can be set to true.	`true`	Not Mandatory
run_regenie	Controls whether to run Regenie for burden testing. Independent of SAIGE-GENE i.e. both can be set to true.	`false`	Not Mandatory

If both 'run_saige' and 'run_regenie' are set to false, then the workflow will not run any association analyses, and just output the results of the first step, i.e. a list of variants that pass functional filtering.

Example input settings for section 1

"master_aggregate_variant_testing.lsf_project_code": "bio",
"master_aggregate_variant_testing.cohort_name": "aggV2",
"master_aggregate_variant_testing.cohort_genome_build": "GRCh38",

"master_aggregate_variant_testing.cohort_file_lists": {
    "aggV2": {
        "cohort_bgen_file_list": "",
        "cohort_pgen_file_list": "aggV2_pgen_file_list.txt",
        "cohort_functional_annotation_file_list": "aggV2_functional_annotation_list.txt"
    },
    "custom_cohort": {
        "cohort_bgen_file_list": "custom_cohort_bgen_list.txt",
        "cohort_pgen_file_list": "",
        "cohort_functional_annotation_file_list": "custom_cophort_functional_annotation_list.txt"
    }
},

"master_aggregate_variant_testing.run_saige": true,
"master_aggregate_variant_testing.run_regenie": false,

Section 2 - Input and output definition¶

This section defines all the input and output files that you need to change or be aware of for running the workflow.

Setting name	Description	Example value	Mandatory?
output_file_name	The suffix for the output test statistics that are produced by saige-gene or regenie. The workflow will prefix this name with the phenotype, program and mask used.	`"output.txt"`	Mandatory
phenotype_input_file	Filepath that points to the phenotype file that you wish to use for the analysis. This should be a table using the column separators of your choice, and should contain a column of sample identifiers (e.g. platekey or IID), one or more phenotype columns, and all columns that you want to use as covariates. For binary traits, cases should be coded as 1 and controls should be coded as 0. This workflow might work for quantitative traits, but they have not been tested.	`"/path/to/pkd_cohort.txt"`	Mandatory
phenotype_column_delimiter	The delimiter used to separate columns in your phenotype file. Good choices for readability are spaces or tabs.	`"\t"`	Mandatory
phenotype_sample_column	The column in the phenotype file that lists the sample identifiers for each participant in the cohort.	`"plate_key"`	Mandatory
phenotype_array_for_testing	An array that lists all the different phenotypes that you wish to test in your cohort. Can have one or more comma separated entries. Each phenotype is tested in parallel. Each phenotype in the array must correspond to a column in your phenotype file.	`"pkd"`	Mandatory
sex_column	The name of the column that records the sex of each sample. These are coded values that typically encode male as 0 and female as 1	`"sex"`	Mandatory
male_coding	How males are coded in the sex column.	`0`	Mandatory
covarColList	A comma separated list of all covariates that you wish to include in the model for testing. Each covariate listed must be a column in the phenotype file. Not all covariate columns in the phenotype file need to be used as covariates in the test.	`"age,sex,pc1,pc2"`	Mandatory
chrX_female_only_cohort	deprecate?	`false`	Mandatory
chrX_male_only_cohort	deprecate?	false	Mandatory
chromosomes_input_file	A filepath that points to the chromosome input file. Please only provide ONE of chromosome, gene, region or variant input files.	`"/path/to/chromosomes.txt"`	Not Mandatory
genes_input_file	A filepath that points to the genes input file. Please only provide ONE of chromosome, gene, region or variant input files.	`"/path/to/genes.txt"`	Not Mandatory
coordinates_input_file	A filepath that points to the region input file. Please only provide ONE of chromosome, gene, region or variant input files.	`"/path/to/coordinates.bed"`	Not Mandatory
variant_input_file	A filepath that points to the variant input file. Please only provide ONE of chromosome, gene, region or variant input files.	`"/path/to/variants.tsv"`	Not Mandatory
gene_exclusion_file	A filepath that points to a gene exclusion file.	`"/path/to/exclude_genes.tsv"`	Not Mandatory
variant_exclusion_file	A filepath that points to a variant exclusion file.	`"/path/to/exclude_variants.tsv"`	Not Mandatory
precomputed_plink_files_for_grm	An array that lists the filepaths to the .bed, .bim and .fam files that contain the high quality common and rare variants that are needed by SAIGE-GENE for GRM creation and variance estimation, and by REGENIE for PRS estimation. Details on how these were created for aggV2 can be found here.	See the example input for section 2	Mandatory

Set all unneeded file inputs to null.

Example input settings for section 2

"master_aggregate_variant_testing.output_file_name": "output.txt",

"master_aggregate_variant_testing.phenotype_input_file": "/re_gecip/BRS/example.pheno",
"master_aggregate_variant_testing.phenotype_column_delimiter": "\t",
"master_aggregate_variant_testing.phenotype_sample_column": "plate_key",
"master_aggregate_variant_testing.phenotype_array_for_testing": ["pkd"],
"master_aggregate_variant_testing.sex_column": "sex",
"master_aggregate_variant_testing.male_coding": 0,
"master_aggregate_variant_testing.covarColList": "ancestry,age,sex,age2,age.sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20",

"master_aggregate_variant_testing.chrX_female_only_cohort": false,
"master_aggregate_variant_testing.chrX_male_only_cohort": false,

"master_aggregate_variant_testing.part_1_variant_selection.chromosomes_input_file": "input_user_data/chromosomes.txt",
"master_aggregate_variant_testing.part_1_variant_selection.genes_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.coordinates_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.variant_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.gene_exclusion_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.variant_exclusion_file": null,

"master_aggregate_variant_testing.precomputed_plink_files_for_grm": [
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.bed",
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.bim",
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.fam"
],

Section 3 - Functional filtering¶

This section is all about the various settings to filter variants based on their functional annotation attributes. Only variants that pass at least one of these filters will be included in downstream analysis.

Setting Name	Description	Example Value	Mandatory?
use_snps_only	A true/false flag to set if you want to analyse SNP variants only, or if you want to include INDEL variants as well	`true`	Mandatory
use_vep_filtering	A true/false flag to set if you want to perform functional annotation filtering. Generally set to true unless you already have a variant list that you do not wish to perform any additional filtering on.	`true`	Mandatory
use_ensembl_protein_coding_genes_only	A true/false flag to set if you want to restrict your analysis to genes that have the biotype protein_coding in Ensembl. Setting to false will include genes of all biotypes, e.g. lncRNA.	`true`	Mandatory
use_genes_only	A true/false flag to set if you are analysing Ensembl defined genes only, or you wish to include custom regions that lie outside / overlap genes.	`true`	Mandatory
functional_annotation_filter_masks	This setting details the various 'masks' that are used to bin each variant. The format for each mask section is as follows: * Mask name * AND filter settings - any variant will have to pass ALL filters in this section to be included * Zero or more specific filters * OR filter settings - any variant will have to pass ANY filter in this section to be included * Zero or more specific filters The rules that govern the functional filtering are: * Any annotation in the CSQ section of the functional annotation VCF can be used as an annotation to filter on. This is done by `bcftools +split-vep` internally. * The AND and OR filters are themselves combined with AND, therefore a variant must pass all filters in the AND block, AND one or more filters in the OR block. * If the annotation field annotation and the VEP severity field `vep_severity_to_include` are empty, the filter is skipped. Leaving filters empty does not impact the running of the workflow. * `vep_severity_to_include` operates in an identical manner to `bcftools +split-vep -s`. You can specify an exact consequence, e.g. `stop_gained`, and only variants with that consequence will be retained. Alternatively, you can specify a consequence or worse, e.g. `missense+`, to include all variants that are at least as severe as `missense`. If it is left blank, then all consequences will be considered. * `include_missing` can be set to `no` or `yes`. If set to `no` then only variants that pass the filter will be included. If set to `yes` then variants that pass the filter and variants that are annotated with missing entries by VEP i.e. '.', '-' or '' will be included. This can be useful for example when wanting to filter on CADD_PHRED scores and also include INDELs, as many INDELs do not have a CADD_PHRED score. * Comparators can be >, >=, ==, <=, < for float values, or == for string values. The workflow will exit with an error if the wrong comparator type is detected. Make sure to give each filter a unique name WITHIN the AND and OR blocks for each mask. Having the same name for multiple filters means that some filters will get skipped, and lead to confusing output.	See the example input in section 3	Not mandatory
mask_rank	This setting defines the order of the strictness for each mask, so that a variant is assigned to the correct mask in the case that is passes multiple. The lower the number, the more strict the mask is.	See the example input in section 3	Not mandatory
regenie_masks	This setting defines which masks can be grouped together when testing with REGENIE.	See the example input in section 3	Not mandatory.

Example input settings for section 3

"master_aggregate_variant_testing.part_1_variant_selection.use_snps_only": true,
"master_aggregate_variant_testing.part_1_variant_selection.use_vep_filtering": true,
"master_aggregate_variant_testing.part_1_variant_selection.use_ensembl_protein_coding_genes_only": true,
"master_aggregate_variant_testing.use_genes_only": true,

"master_aggregate_variant_testing.part_1_variant_selection.functional_annotation_filter_masks": {
    "LoF": {
        "and_mask": {
            "filter1": {"annotation": "LoF", "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "","vep_severity_to_include": ""}
        }
    },
    "missense": {
        "and_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "CADD_PHRED",  "comparator": ">=", "condition": "10","include_missing": "no", "vep_severity_to_include": "missense+"},
            "filter2": {"annotation": "LoF",  "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        }
    },
    "synonymous": {
        "and_mask": {
            "filter1": {"annotation": "Consequence", "comparator": "==", "condition": "synonymous_variant", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "", "include_missing": "", "vep_severity_to_include": ""}
        }
    }
},

"master_aggregate_variant_testing.part_1_variant_selection.mask_rank": {
    "LoF": 1,
    "missense": 2,
    "synonymous": 3
},

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_masks": {
    "strict_lof": "LoF",
    "mild_lof": "LoF,missense",
    "control": "synonymous"
},

Section 4 - Site wide variant filtering¶

This section describes the filters used for site-wide filtering of variants.

Setting name	Description	Example value	Mandatory?
differential_missing_pvalue	The p-value threshold for testing differential missingness between cases and controls. Sites that have a smaller p-value than this setting will be removed.	`10e-5`	Mandatory
upstream_downstream_length	A value in basepairs that allows for padding of input regions. For example, you may wish to pad genes by 10kb to potentially capture regulatory regions. Set to 0 by default.	`10000`	Mandatory
use_max_mac	A true/false setting on whether to use the minor allele count as the upper bound for rare variant filtering. Useful for capturing very rare variants in large cohorts.	`false`	Mandatory
final_max_mac_to_exclude	The maximum MAC for variants to include, if using the use_max_mac setting. Variants with a higher MAC than this value will be removed.	`20`	Mandatory
use_max_maf	A true/false setting on whether to use the minor allele frequency as the upper bound for rare variant filtering. Useful for capturing rare variants in small cohorts.	`true`	Mandatory
final_max_maf_to_include	The maximum MAF for variants to include if using the use_max_maf setting. Variants with a higher MAF than this value will be removed.	`0.005`	Mandatory
max_missingness	The upper threshold for missing data for each site. Sites that have a higher percentage of missing values than this value will be excluded.	`0.05`	Mandatory

Use only one of MAC of MAF for filtering.

MAF filtering is generally better for small cohorts, where a MAC of 20 might actually correspond to a MAF of 1% or more.

Example input settings for section 4

"master_aggregate_variant_testing.differential_missingness_pvalue": "10e-5",
"master_aggregate_variant_testing.part_1_variant_selection.upstream_downstream_length": 0,
"master_aggregate_variant_testing.part_1_variant_selection.use_max_mac": false,
"master_aggregate_variant_testing.part_1_variant_selection.final_max_mac_to_exclude": 20,
"master_aggregate_variant_testing.part_1_variant_selection.use_max_maf": true,
"master_aggregate_variant_testing.part_1_variant_selection.final_max_maf_to_exclude": 0.005,
"master_aggregate_variant_testing.part_1_variant_selection.max_missingness": 0.05,

Section 5 - SAIGE-GENE settings¶

These settings allow you to tweak various aspects of SAIGE-GENE at runtime. Most settings have been left at the default value recommended by the developers. We recommend only changing these settings if you have used SAIGE-GENE in the past. For more information on these options please consult the SAIGE-GENE documentation.

Settings that are not present in the table below, but are present in the program are handled internally, such as input files and output files.

Setting name	Description	Example value	Mandatory?
LOCO	A true/false setting for using leave-one-chromosome-out when testing with SAIGE-GENE	`"FALSE"`	Mandatory
saige_setp0_options	The settings used to build the sparse GRM that is used for SAIGE-GENE. This should be left as default.	See the example input in section 5	Mandatory
saige_step1_options	The settings that are used to fit the null GLMM in SAIGE-GENE. Most of these can be left as default. To perform rare variant testing, `IsSparseKin` should be left as `TRUE` `MaleCode` and `FemaleCode` should be set to the same values as in your phenotype file (0,1 by default) `traitType` can be set to `quantitative` and `invNormalize` can be set to `true` to test for quantitative traits. Note that this setting has not been tested in this workflow, and may not work.	See the example input in section 5	Mandatory
saige_step2_options	The settings that are used for running the rare variant tests in SAIGE-GENE. Most can be left as default. * `method_to_CollapseUltraRare` implements SAIGE-GENE+ when set to `absence_or_presence`. To disable this, set to '' (an empty string).	See the example input in section 5.	Mandatory

Example input settings for section 5

"master_aggregate_variant_testing.part_2_saige_testing.LOCO": "FALSE",

"master_aggregate_variant_testing.part_2_saige_testing.saige_step0_options": [
    "--relatednessCutoff=0.125",
    "--numRandomMarkerforSparseKin=2000"
],

"master_aggregate_variant_testing.part_2_saige_testing.saige_step1_options": [
    "--IsSparseKin='TRUE'",
    "--traitType='binary'",
    "--invNormalize='FALSE'",
    "--outputPrefix='null_glmm'",
    "--outputPrefix_varRatio='null_glmm_var_ratio'",
    "--minMAFforGRM=0.01",
    "--sexCol=''",
    "--FemaleCode='1'",
    "--FemaleOnly='FALSE'",
    "--MaleCode='0'",
    "--MaleOnly='FALSE'",
    "--noEstFixedEff='FALSE'",
    "--tol=0.02",
    "--maxiter=20",
    "--tolPCG=1e-5",
    "--maxiterPCG=500",
    "--SPAcutoff=2",
    "--numRandomMarkerforVarianceRatio=30",
    "--skipModelFitting='FALSE'",
    "--tauInit='0,0'",
    "--traceCVcutoff=0.0025",
    "--ratioCVcutoff=0.001",
    "--isCateVarianceRatio='TRUE'",
    "--isCovariateTransform='TRUE'",
    "--cateVarRatioMinMACVecExclude='0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5'",
    "--cateVarRatioMaxMACVecInclude='1.5,2.5,3.5,4.5,5.5,10.5,20.5'",
    "--useSparseSigmaforInitTau='FALSE'",
    "--minCovariateCount=-1",
    "--includeNonautoMarkersforVarRatio='FALSE'",
    "--memoryChunk=2"
],
"master_aggregate_variant_testing.part_2_saige_testing.saige_step2_options": [
    "--IsDropMissingDosages='FALSE'",
    "--IsSparse='TRUE'",
    "--IsOutputAFinCaseCtrl='TRUE'",
    "--IsOutputNinCaseCtrl='TRUE'",
    "--IsOutputHetHomCountsinCaseCtrl='TRUE'",
    "--IsSingleVarinGroupTest='TRUE'",
    "--IsOutputPvalueNAinGroupTestforBinary='TRUE'",
    "--IsAccountforCasecontrolImbalanceinGroupTest='TRUE'",
    "--IsOutputBETASEinBurdenTest='TRUE'",
    "--is_rewrite_XnonPAR_forMales='FALSE'",
    "--minMAF=0",
    "--maxMAFforGroupTest=0.5",
    "--minMAC=0",
    "--numLinesOutput=10000",
    "--condition=''",
    "--kernel='linear.weighted'",
    "--method='optimal.adj'",
    "--weights.beta.rare=1,25",
    "--weights.beta.common=1,25",
    "--weightMAFcutoff=0.01",
    "--r.corr=0",
    "--dosageZerodCutoff=0.2",
    "--weightsIncludeinGroupFile='FALSE'",
    "--weights_for_G2_cond=''",
    "--SPAcutoff=2",
    "--chrom=chr1",
    "--method_to_CollapseUltraRare='absence_or_presence'",
    "--MACCutoff_to_CollapseUltraRare=10",
    "--DosageCutoff_for_UltraRarePresence=0.5"
],

Section 6 - Regenie settings¶

These settings allow you to tweak various aspects of Regenie at runtime. Most settings have been left as defaults. Check https://rgcgithub.github.io/regenie/options/#burden-testing for a full list of options.

Setting name	Description	Example value	Mandatory?
regenie_step1_options	Options for step 1 of REGENIE.	See the example inputs in section 6	Mandatory
regenie_step2_options	Options for step 2 of REGENIE.	See the example inputs in section 6	Mandatory

Example input settings for section 6

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step1_options": [
    "--step 1",
    "--bt",
    "--lowmem",
    "--lowmem-prefix .",
    "--bsize 1000",
    "--ref-first",
    "--loocv"
],
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step2_options": [
    "--step 2",
    "--minMAC 1",
    "--aaf-bins 0.005",
    "--build-mask 'max'",
    "--bt",
    "--firth --approx",
    "--firth-se",
    "--bsize 1000",
    "--ref-first",
    "--singleton-carrier"
],

Section 7 - Workflow resources¶

This section details workflow resources. You should NOT change these, with the possible exception of CPU and memory settings - if the workflow is crashing due to lack of resources.

Warning

Changing any setting apart from CPU and memory settings will almost guarantee that the workflow will break. Lowering CPU and memory requirements too far will also break the workflow.

Example input settings for section 7

"master_aggregate_variant_testing.tools_container": "quay.io/alexander-stuckey/gwas_avt",
"master_aggregate_variant_testing.part_2_saige_testing.saige_container": "quay.io/alexander-stuckey/saige:0.44.6.5",
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_container": "quay.io/alexander-stuckey/regenie",

"master_aggregate_variant_testing.genome_build_to_ensembl_coordinates_files": {
    "GRCh37": "resources/Ensembl_87_genes_coordinates_GRCh37.tsv",
    "GRCh38": "resources/Ensembl_98_genes_coordinates_GRCh38.tsv"
},
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_scale": "resources/vep_severity_scale_2020_bcftools_splitvep.txt",

"master_aggregate_variant_testing.pheno_plink_helper_python_script": "resources/pheno_helper_script.py",
"master_aggregate_variant_testing.part_1_variant_selection.python_input_filtering_script": "resources/input_filtering.py",
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_scale_ensembl_translation": "resources/vep_severity_scale_translation_2020_Ensembl_to_bcftools_splitvep.tsv",
"master_aggregate_variant_testing.part_1_variant_selection.vep_filtering_python_script": "resources/vep_filtering_script.py",
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_bcftools_translation_and_ranking": "resources/vep_severity_bcftools_translation_and_ranking.tsv",

"master_aggregate_variant_testing.part_1_variant_selection.variant_extraction_threads": 4,
"master_aggregate_variant_testing.part_1_variant_selection.variant_extraction_memory": 16000,

"master_aggregate_variant_testing.part_2_saige_testing.saige_step_0_threads": 8,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_0_memory": 16000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_1_threads": 8,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_1_memory": 16000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_filter_differential_missingness_cpus": 1,
"master_aggregate_variant_testing.part_2_saige_testing.saige_filter_differential_missingness_memory": 4000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_create_GRM_threads": 4,
"master_aggregate_variant_testing.part_2_saige_testing.saige_create_GRM_memory": 20000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_2_threads": 4,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_2_memory": 64000,

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_filter_GRM_cpus": 1,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_filter_GRM_memory": 8000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_1_cpus": 8,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_1_memory": 16000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_create_set_filter_diff_missing_cpus": 2,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_create_set_filter_diff_missing_memory": 32000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_2_cpus": 2,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_2_memory": 16000

The full input file¶

The full input file

``` bash linenums="1" { "master_aggregate_variant_testing.lsf_project_code": "bio",

"master_aggregate_variant_testing.cohort_name": "aggV2",
"master_aggregate_variant_testing.cohort_genome_build": "GRCh38",

"master_aggregate_variant_testing.cohort_file_lists": {
    "aggV2": {
        "cohort_bgen_file_list": "example_bgen_list.txt",
        "cohort_pgen_file_list": "",
        "cohort_functional_annotation_file_list": "example_functional_list.txt"
    }
},

"master_aggregate_variant_testing.run_saige": true,
"master_aggregate_variant_testing.run_regenie": false,

"master_aggregate_variant_testing.output_file_name": "output.txt",

"master_aggregate_variant_testing.phenotype_input_file": "/re_gecip/BRS/example.pheno",
"master_aggregate_variant_testing.phenotype_column_delimiter": "\t",
"master_aggregate_variant_testing.phenotype_sample_column": "plate_key",
"master_aggregate_variant_testing.phenotype_array_for_testing": ["covid"],
"master_aggregate_variant_testing.sex_column": "sex",
"master_aggregate_variant_testing.male_coding": 0,
"master_aggregate_variant_testing.covarColList": "ancestry,age,sex,age2,age.sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20",

"master_aggregate_variant_testing.chrX_female_only_cohort": false,
"master_aggregate_variant_testing.chrX_male_only_cohort": false,

"master_aggregate_variant_testing.part_1_variant_selection.chromosomes_input_file": "input_user_data/chromosomes.txt",
"master_aggregate_variant_testing.part_1_variant_selection.genes_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.coordinates_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.variant_input_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.gene_exclusion_file": null,
"master_aggregate_variant_testing.part_1_variant_selection.variant_exclusion_file": null,

"master_aggregate_variant_testing.part_1_variant_selection.use_snps_only": true,
"master_aggregate_variant_testing.part_1_variant_selection.use_vep_filtering": true,
"master_aggregate_variant_testing.part_1_variant_selection.use_ensembl_protein_coding_genes_only": true,
"master_aggregate_variant_testing.use_genes_only": true,

"master_aggregate_variant_testing.part_1_variant_selection.functional_annotation_filter_masks": {
    "LoF": {
        "and_mask": {
            "filter1": {"annotation": "LoF", "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "","vep_severity_to_include": ""}
        }
    },
    "missense": {
        "and_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "CADD_PHRED",  "comparator": ">=", "condition": "10","include_missing": "no", "vep_severity_to_include": "missense+"},
            "filter2": {"annotation": "LoF",  "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        }
    },
    "synonymous": {
        "and_mask": {
            "filter1": {"annotation": "Consequence", "comparator": "==", "condition": "synonymous_variant", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "", "include_missing": "", "vep_severity_to_include": ""}
        }
    }
},

| "master_aggregate_variant_testing.part_1_variant_selection.mask_rank": { "LoF": 1, "missense": 2, "synonymous": 3 },

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_masks": {
    "strict_lof": "LoF",
    "mild_lof": "LoF,missense",
    "control": "synonymous"
},

"master_aggregate_variant_testing.differential_missingness_pvalue": "10e-5",
"master_aggregate_variant_testing.part_1_variant_selection.upstream_downstream_length": 0,
"master_aggregate_variant_testing.part_1_variant_selection.use_max_mac": false,
"master_aggregate_variant_testing.part_1_variant_selection.final_max_mac_to_exclude": 20,
"master_aggregate_variant_testing.part_1_variant_selection.use_max_maf": true,
"master_aggregate_variant_testing.part_1_variant_selection.final_max_maf_to_exclude": 0.005,
"master_aggregate_variant_testing.part_1_variant_selection.max_missingness": 0.05,

"master_aggregate_variant_testing.precomputed_plink_files_for_grm": [
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.bed",
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.bim",
    "/re_gecip/BRS/genomicc_data/WGS_analysis/aggregate_variant_tests/aggCOVID_V2/output/EUR/loftee_hc/snps_only/chr1/glob-7b047a45b923da427eab0cce842a1ad3/plink_set_for_tests_all_samples.fam"
],

"master_aggregate_variant_testing.part_2_saige_testing.LOCO": "FALSE",

"master_aggregate_variant_testing.part_2_saige_testing.saige_step0_options": [
    "--relatednessCutoff=0.125",
    "--numRandomMarkerforSparseKin=2000"
],

"master_aggregate_variant_testing.part_2_saige_testing.saige_step1_options": [
    "--IsSparseKin='TRUE'",
    "--traitType='binary'",
    "--invNormalize='false'",
    "--outputPrefix='null_glmm'",
    "--outputPrefix_varRatio='null_glmm_var_ratio'",
    "--minMAFforGRM=0.01",
    "--sexCol=''",
    "--FemaleCode='1'",
    "--FemaleOnly='FALSE'",
    "--MaleCode='0'",
    "--MaleOnly='FALSE'",
    "--noEstFixedEff='FALSE'",
    "--tol=0.02",
    "--maxiter=20",
    "--tolPCG=1e-5",
    "--maxiterPCG=500",
    "--SPAcutoff=2",
    "--numRandomMarkerforVarianceRatio=30",
    "--skipModelFitting='FALSE'",
    "--tauInit='0,0'",
    "--traceCVcutoff=0.0025",
    "--ratioCVcutoff=0.001",
    "--isCateVarianceRatio='TRUE'",
    "--isCovariateTransform='TRUE'",
    "--cateVarRatioMinMACVecExclude='0.5,1.5,2.5,3.5,4.5,5.5,10.5,20.5'",
    "--cateVarRatioMaxMACVecInclude='1.5,2.5,3.5,4.5,5.5,10.5,20.5'",
    "--useSparseSigmaforInitTau='FALSE'",
    "--minCovariateCount=-1",
    "--includeNonautoMarkersforVarRatio='FALSE'",
    "--memoryChunk=2"
],
"master_aggregate_variant_testing.part_2_saige_testing.saige_step2_options": [
    "--IsDropMissingDosages='FALSE'",
    "--IsSparse='TRUE'",
    "--IsOutputAFinCaseCtrl='TRUE'",
    "--IsOutputNinCaseCtrl='TRUE'",
    "--IsOutputHetHomCountsinCaseCtrl='TRUE'",
    "--IsSingleVarinGroupTest='TRUE'",
    "--IsOutputPvalueNAinGroupTestforBinary='TRUE'",
    "--IsAccountforCasecontrolImbalanceinGroupTest='TRUE'",
    "--IsOutputBETASEinBurdenTest='TRUE'",
    "--is_rewrite_XnonPAR_forMales='FALSE'",
    "--minMAF=0",
    "--maxMAFforGroupTest=0.5",
    "--minMAC=0",
    "--numLinesOutput=10000",
    "--condition=''",
    "--kernel='linear.weighted'",
    "--method='optimal.adj'",
    "--weights.beta.rare=1,25",
    "--weights.beta.common=1,25",
    "--weightMAFcutoff=0.01",
    "--r.corr=0",
    "--dosageZerodCutoff=0.2",
    "--weightsIncludeinGroupFile='FALSE'",
    "--weights_for_G2_cond=''",
    "--SPAcutoff=2",
    "--chrom=chr1",
    "--method_to_CollapseUltraRare='absence_or_presence'",
    "--MACCutoff_to_CollapseUltraRare=10",
    "--DosageCutoff_for_UltraRarePresence=0.5"
],

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step1_options": [
    "--step 1",
    "--bt",
    "--lowmem",
    "--lowmem-prefix .",
    "--bsize 1000",
    "--ref-first",
    "--loocv"
],
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step2_options": [
    "--step 2",
    "--minMAC 1",
    "--aaf-bins 0.005",
    "--build-mask 'max'",
    "--bt",
    "--firth --approx",
    "--firth-se",
    "--bsize 1000",
    "--ref-first",
    "--singleton-carrier"
],

"master_aggregate_variant_testing.tools_container": "quay.io/alexander-stuckey/gwas_avt",
"master_aggregate_variant_testing.part_2_saige_testing.saige_container": "quay.io/alexander-stuckey/saige:0.44.6.5",
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_container": "quay.io/alexander-stuckey/regenie",

"master_aggregate_variant_testing.genome_build_to_ensembl_coordinates_files": {
    "GRCh37": "resources/Ensembl_87_genes_coordinates_GRCh37.tsv",
    "GRCh38": "resources/Ensembl_98_genes_coordinates_GRCh38.tsv"
},
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_scale": "resources/vep_severity_scale_2020_bcftools_splitvep.txt",

"master_aggregate_variant_testing.pheno_plink_helper_python_script": "resources/pheno_helper_script.py",
"master_aggregate_variant_testing.part_1_variant_selection.python_input_filtering_script": "resources/input_filtering.py",
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_scale_ensembl_translation": "resources/vep_severity_scale_translation_2020_Ensembl_to_bcftools_splitvep.tsv",
"master_aggregate_variant_testing.part_1_variant_selection.vep_filtering_python_script": "resources/vep_filtering_script.py",
"master_aggregate_variant_testing.part_1_variant_selection.vep_severity_bcftools_translation_and_ranking": "resources/vep_severity_bcftools_translation_and_ranking.tsv",

"master_aggregate_variant_testing.part_1_variant_selection.variant_extraction_threads": 4,
"master_aggregate_variant_testing.part_1_variant_selection.variant_extraction_memory": 16000,

"master_aggregate_variant_testing.part_2_saige_testing.saige_step_0_threads": 8,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_0_memory": 16000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_1_threads": 8,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_1_memory": 16000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_filter_differential_missingness_cpus": 1,
"master_aggregate_variant_testing.part_2_saige_testing.saige_filter_differential_missingness_memory": 4000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_create_GRM_threads": 4,
"master_aggregate_variant_testing.part_2_saige_testing.saige_create_GRM_memory": 20000,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_2_threads": 4,
"master_aggregate_variant_testing.part_2_saige_testing.saige_step_2_memory": 64000,

"master_aggregate_variant_testing.part_2_regenie_testing.regenie_filter_GRM_cpus": 1,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_filter_GRM_memory": 8000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_1_cpus": 8,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_1_memory": 16000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_create_set_filter_diff_missing_cpus": 2,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_create_set_filter_diff_missing_memory": 32000,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_2_cpus": 2,
"master_aggregate_variant_testing.part_2_regenie_testing.regenie_step_2_memory": 16000

}

```