AVT input files¶

There are a number of different input files you can use with the AVT workflow.

File	used for	parameter(s)	default
Region input	A text file that specifies the region of the genome that needs to be analysed.	`--region_input_file`	`${projectDir}/input/chromosomes_subset.txt` chr21 and chr22
Exclusion data	A text file that specifies regions of the genome to be excluded from analysis	`--exclusion_data_file`	`false`
Input cohort	A list of cases and controls with covariates	`--input_cohort_file`	`/gel_data_resources/workflows/input_material/RDP_tools_aggregateVariantTestingWorkflow/auxiliary_files/input/cohort.txt`
Functional annotation filter masks	Filter out variants in your output	`--functional_annotation_filter_masks`	`${projectDir}/input/functional_annotation_filter_masks.json`
Mask ranks	defines the order of the strictness for each annotation label, so that a variant is assigned to the correct label in the case that is passes multiple ones	`--mask_rank`	`${projectDir}/input/mask_rank.json`
Regenie masks	regenie functional annotation "masks"	`--regenie_masks`	`${projectDir}/input/regenie_masks.json`
Genomic data	paths to the files for the input variant dataset	`--genomic_data`	`${projectDir}/input/aggV2_pgen_list_by_chromosomes_all_variants_biallelic_and_multiallelic.tsv`
Annotation VCF list	paths to the files with functional annotations for the input variant dataset	`--vcf_files`	`${projectDir}/input/aggV2_functional_annotations_list_by_chunks_VEP105.tsv`
Pre-computed PLINK	A pre-computed set of PLINK files that contain high quality SNP variants from across the genome for all samples	`--precomputed_plink_files_for_grm_bed` `--precomputed_plink_files_for_grm_bim` `--precomputed_plink_files_for_grm_fam`	`/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.bed` `/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.bim` `/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.fam`
Consequence severity ranking		`--consequence_severity_ranking_file`	`${projectDir}/resources/VEP_severity_bcftools_translation_and_ranking.tsv`
Gene coordinates	A tab-delimited file of gene coordinates	`--ensembl_gene_list`	`${projectDir}/resources/Ensembl_105_genes_coordinates_GRCh38.tsv`
Protein-coding gene coordinates	A tab-delimited file of protein-coding gene coordinates	`--ensembl_gene_list_protein_coding`	`${projectDir}/resources/Ensembl_105_genes_coordinates_GRCh38_protein_coding.tsv`

Region input `--region_input_file`¶

You can run the AVT workflow to focus on specific regions, either whole chromosomes, genomic regions, specific genes or a custom set of variants.

The region input is a text file that specifies the region of the genome that needs to be analysed. The type of regions in your file is detected by the workflow - it will display the type, chr_mode, gene_mode, region_mode, or variant_mode, in the log file at the beginning of the run.

The ${projectDir}/input/ directory includes examples of all region file input types - respectively:

chromosomes: A single-column, one per line text file of required chromosomes (chr1-chr22, chrX, chrY). For a whole genome analysis, list all chromosomes.
chromosomes.txt
```
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
```
genes: A single-column, one per line text file of gene HGNC symbols or Ensembl IDs, with no header.
genes.txt
```
POLR3K
SNRNP25
ENSG00000007384
NPRL3
HBA2
LUC7L
ARHGDIG
PKD1
BRCA1
```

coordinates: A BED format file of gene coordinates, with header and name column. Please always use the column names found in the header of the example file.

coordinates.bed

#CHROM  START   END NAME
chr16   2025356 2039026 region1
chr16   2039815 2047866 region2
chr16   2047967 2089491 region3
chr16   2088710 2135898 region4
chr16   2090195 2090284 region5
chr16   2091436 2095433 region6
chr16   2094830 2097026 region7
chr16   2106669 2106753 region8
chr16   2112335 2113342 region9
chr16   2119207 2120248 region10

variants: A tab-delimited, four-column file, with no header. It has the following fields:

the variant ID as CHROM:POS_REF_ALT
the name of the region that includes that variant (usually the gene name)
the "most severe" annotation label (i.e. the one that will be used by SAIGE-GENE and REGENIE for that variant)
a comma-separated list of all annotation labels (this information is used only by the Fisher's test branch of the workflow which will then use that variant when running tests for each of those labels). Annotation labels therefore need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.

variants.tsv

chr5:68296298_G_GC  PIK3R1  LoF LoF,missense
chr5:68296298_G_GT  PIK3R1  LoF LoF,missense
chr5:68297418_C_CG  PIK3R1  LoF LoF,missense
chr5:68297596_C_T   PIK3R1  LoF LoF,missense
chr5:69094339_G_T   SLC30A5 LoF LoF,missense
chr5:69100862_CT_C  SLC30A5 LoF LoF,missense
chr5:69108437_G_A   SLC30A5 LoF LoF,missense
chr5:69115374_C_A   SLC30A5 LoF LoF,missense
chr5:69116085_C_CTT SLC30A5 LoF LoF,missense
chr5:69118506_C_T   SLC30A5 LoF LoF,missense
chr5:69121897_T_C   SLC30A5 LoF LoF,missense
chr5:69123427_T_A   SLC30A5 LoF LoF,missense
chr5:69128082_AT_A  SLC30A5 LoF LoF,missense
chr5:69128127_C_T   SLC30A5 LoF LoF,missense
chr5:69128133_G_C   SLC30A5 LoF LoF,missense
chr5:69168323_CT_C  CCNB1   LoF LoF,missense
chr5:69171453_G_GT  CCNB1   LoF LoF,missense
chr5:69175436_GAACT_G   CCNB1   LoF LoF,missense
chr5:69175519_TC_T  CCNB1   LoF LoF,missense
chr5:69177248_CTACAACA_C    CCNB1   LoF LoF,missense
chr5:69177311_AATGTAGTC_A   CCNB1   LoF LoF,missense
chr5:69177337_T_TA  CCNB1   LoF LoF,missense

The default run uses a chromosome subset file including the two GRCh38 chromosomes chr21 and chr22.

chromosomes_subset.txt

chr21
chr22

Exclusion data file `--exclusion_data_file`¶

Either false or a valid path to a text file containing genomic regions to be excluded. The regions must be of the same type as the region input file, e.g. gene symbols or IDs for "gene mode", a variant list for "variant mode".

Input cohort file `--input_cohort_file`¶

Your cohort of cases and controls. You can build your own case/control cohort using your preferred method. We provide some tutorials on building cohorts using our Labkey API.

This is a tab-separated file with a header. It should include a unique sample identifier, one or more phenotype columns and covariates, such as age, sex and principal components.

You need to specify the columns that contain each of the value types in your submission script.

Parameter	Notes	Default
`--input_cohort_file`	case/control cohort	`"/gel_data_resources/workflows/input_material/RDP_tools_aggregateVariantTestingWorkflow/auxiliary_files/input/cohort.txt"`
`--cohort_sample_column`	The column in your cohort file that specifies the platekey	`"Platekey"`
`--cohort_sex_column`	The column in your cohort file that specifies the sex	`"sex"`
`--control_coding`	How controls are coded in your input file	`0`
`--phenotype_array`	List of columns in your input file that contain phenotype data in the following example this would be `disease_A`	`"status"`
`--phenotype_type_array`	List of phenotype types that correspond to the list of phenotype columns - these can be `b` binary or `q` quantitative	`"b"`
`--covariates`	List of columns of covariates in your input file	`"age,sex,age.age,age.sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20"`
`--categorical_covariates`	List of columns of discrete covariates in your input file	`"sex"`

The workflow will run on the cohort specified in the input, as described below. It is your responsibility to ensure that the cohort only includes participants whose data you are allowed to process and analyse. Ensure that you are only using samples included in the latest data release (current main programme data release: 19 (31st October 2024)). Please ask via Service desk if unsure.

This example includes the minimal required data, a column for the sample names, a column for the phenotype status of the participant (in this example: plate_key and disease_A) and the sex. The example additionally includes an optional but commonly used covariate column (age):

plate_key   disease_A   age sex
sample1     0           25  0
sample2     0           50  1
sample3     1           33  0

Functional annotation filter masks file `--functional_annotation_filter_masks`¶

You can filter variants in your output by the consequences they have on genes. Use the functional annotation filter masks file to create a list of filters for each consequence type.

The whole functional annotation filtering section is skipped in "variant mode" - in that case, the user provides the relevant information for variant filtering in the user region input file.

This is a JSON file. It comprises functional annotation labels for variants, for example "LoF", "missense" and "synonymous", and within each of these, the filters you want all variants to pass ("and_mask") and optional filters ("or_mask").

The format for each functional annotation label section is as follows:

Functional annotation label name
- AND mask settings - any variant will have to pass ALL filters in this section to be included
  - Zero or more specific filters
- OR mask settings - any variant will have to pass ANY filter in this section to be included
  - Zero or more specific filters

Annotation labels will be used in the SAIGE-GENE, REGENIE, and Fisher's test "branches" of the workflow, and need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.

The functional annotation labels defined in this file will be used in the mask ranks file, the SAIGE saige_masks parameter, and the REGENIE masks file, which all need to be adjusted accordingly.

The rules that govern the functional filtering are:

Any annotation in the CSQ section of the functional annotation VCF can be used as an annotation to filter on. This is done by bcftools +split-vep internally.
The AND and OR filters are themselves combined with AND, therefore a variant must pass all filters in the AND block, AND one or more filters in the OR block.
If the annotation field annotation and the VEP severity field vep_severity_to_include are empty, the filter is skipped. Leaving filters empty does not impact the running of the workflow.
vep_severity_to_include operates in an identical manner to bcftools +split-vep -s. You can specify an exact consequence, e.g. stop_gained, and only variants with that consequence will be retained. Alternatively, you can specify a consequence or worse, e.g. missense+, to include all variants that are at least as severe as missense. If it is left blank, then all consequences will be considered.
include_missing can be set to "yes" or "no". If set to "no" then only variants that pass the filter will be included. If set to "yes" then variants that pass the filter and variants that are annotated with missing entries by VEP i.e. '.', '-' or '' will be included. This can be useful for example when wanting to filter on CADD_PHRED scores and also include INDELs, as many INDELs do not have a CADD_PHRED score.
Comparators can be ">", ">=", "==", "<=", "<" for float values, or "==" for string values. The workflow will exit with an error if the wrong comparator type is detected.

Make sure to give each filter a unique name WITHIN the AND and OR blocks for each mask. Having the same name for multiple filters means that some filters will get skipped, and lead to confusing output.

functional_annotation_filter_masks

{
    "LoF": {
        "and_mask": {
            "filter1": {"annotation": "CANONICAL", "comparator": "==", "condition": "YES", "include_missing": "no", "vep_severity_to_include": ""},
            "filter2": {"annotation": "LoF", "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "","vep_severity_to_include": ""}
        }
    },
    "missense": {
        "and_mask": {
            "filter1": {"annotation": "CANONICAL", "comparator": "==", "condition": "YES", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "CADD_PHRED",  "comparator": ">=", "condition": "10","include_missing": "no", "vep_severity_to_include": "missense+"},
            "filter2": {"annotation": "LoF",  "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        }
    },
    "synonymous": {
        "and_mask": {
            "filter1": {"annotation": "Consequence", "comparator": "==", "condition": "synonymous_variant", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "", "include_missing": "", "vep_severity_to_include": ""}
        }
    }
}

Mask ranks file `--mask_rank`¶

The mask rank file defines the order of the strictness for each annotation label, so that a variant is assigned to the correct label in the case that is passes multiple ones. The lower the number, the more strict the annotation label is.

The file is a JSON file of ranks for functional annotation labels (for example "LoF", "missense" and "synonymous"). Annotation labels will be used in the SAIGE-GENE, REGENIE, and Fisher's test "branches" of the workflow, and need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.

The whole functional annotation filtering section is skipped in "variant mode" - in that case, the user provides the relevant information for variant filtering in the region input file.

Make sure that your mask rank file is compatible with the functional annotation filter masks file.

mask_rank.json

{
    "LoF": 1,
    "missense": 2,
    "synonymous": 3
}

Regenie masks file `--regenie_masks`¶

A JSON file of regenie functional annotation "masks" (composed of one or a combination of functional annotation labels) - please see the REGENIE docs for more detail. Also see our Known issues and limitations page.

parameter:
default: ${projectDir}/input/regenie_masks.json

See the default below as an example - set this to your own file, compatible with your functional annotation filter masks file.

regenie_masks.json

{
    "strict_lof": "LoF",
    "mild_lof": "LoF,missense",
    "control": "synonymous"
}

Genomic data file `--genomic_data`¶

A four-column, tab-delimited file, with no header. This file contains paths to the files for the input variant dataset, ie to the per chromosome multi-sample PLINK pgen/psam/pvar-format files that contain genomic data. The default input variant dataset is aggV2.

head aggV2_pgen_list_by_chromosomes_all_variants_biallelic_and_multiallelic.tsv

chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr10_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr10_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr10_allvariants_masked.pvar
chr11   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr11_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr11_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr11_allvariants_masked.pvar
chr12   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr12_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr12_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr12_allvariants_masked.pvar
chr13   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr13_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr13_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr13_allvariants_masked.pvar
chr14   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr14_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr14_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr14_allvariants_masked.pvar

Annotation VCF list file `--vcf_files`¶

A two-column tab-delimited file with no header. Contains paths to the files with functional annotations for the input variant dataset, ie paths to annotated multi-sample VCF files corresponding to the PLINK files included in the genomic data file. The default input variant dataset is aggV2.

head aggV2_functional_annotations_list_by_chunks_VEP105.tsv

chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_10064150_12359131_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_101007881_103656324_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_103656325_106340362_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_106340363_108834875_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_108834876_111348918_annotated.vcf.gz

Pre-computed PLINK files¶

A pre-computed set of PLINK files that contain high quality SNP variants from across the genome for all samples contained in the input variant dataset. The default input variant dataset is aggV2.

There are three parameters:

--precomputed_plink_files_for_grm_bed
--precomputed_plink_files_for_grm_bim
--precomputed_plink_files_for_grm_fam

Consequence severity ranking file `--consequence_severity_ranking_file`¶

We do not recommend that you alter the contents of this file. See predicted data

VEP_severity_bcftools_translation_and_ranking.tsv

# This should normally not be edited by users.
# See https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
ensembl_annotation  bcftools_annotation ranking
transcript_ablation transcript_ablation 20
splice_acceptor_variant splice_acceptor 19
splice_donor_variant    splice_donor    19
stop_gained stop_gained 18
frameshift_variant  frameshift  18
stop_lost   stop_lost   18
start_lost  start_lost  18
transcript_amplification    transcript_amplification    15
inframe_insertion   inframe 14
inframe_deletion    inframe 14
missense_variant    missense    14
protein_altering_variant    protein_altering    14
splice_region_variant   splice_region   13
splice_donor_5th_base_variant   splice_region   13
splice_donor_region_variant splice_region   13
splice_polypyrimidine_tract_variant splice_region   13
incomplete_terminal_codon_variant   incomplete_terminal_codon   12
start_retained_variant  start_retained  11
stop_retained_variant   stop_retained   11
synonymous_variant  synonymous  11
coding_sequence_variant coding_sequence 10
mature_miRNA_variant    mature_miRNA    10
5_prime_UTR_variant 5_prime_utr 9
3_prime_UTR_variant 3_prime_utr 9
non_coding_transcript_exon_variant  non_coding_transcript_exon  8
intron_variant  intron  7
NMD_transcript_variant  NMD_transcript  7
non_coding_transcript_variant   non_coding_transcript   6
upstream_gene_variant   upstream    5
downstream_gene_variant downstream  5
TFBS_ablation   TFBS    4
TFBS_amplification  TFBS    4
TF_binding_site_variant TF_binding_site 4
regulatory_region_ablation  regulatory  3
regulatory_region_amplification regulatory  3
regulatory_region_variant   regulatory  3
feature_elongation  feature_elongation  2
feature_truncation  feature_truncation  2
intergenic_variant  intergenic  1

Gene coordinates file `--ensembl_gene_list`¶

This file must be consistent with the Genomic Data and the Annotation Data.

A tab-delimited file of gene coordinates (Ensembl 105 GRCh38, in the default case), with header.

Protein-coding gene coordinates file `--ensembl_gene_list_protein_coding`¶

This file must be consistent with the Genomic Data and the Annotation Data.

A tab-delimited file of protein-coding gene coordinates (Ensembl 105 GRCh38 protein-coding, in the default case), with header.

AVT input files¶

Region input --region_input_file¶

Exclusion data file --exclusion_data_file¶

Input cohort file --input_cohort_file¶

Functional annotation filter masks file --functional_annotation_filter_masks¶

Mask ranks file --mask_rank¶

Regenie masks file --regenie_masks¶

Genomic data file --genomic_data¶

Annotation VCF list file --vcf_files¶