Skip to content

AVT input files

There are a number of different input files you can use with the AVT workflow.

File used for parameter(s) default
Region input A text file that specifies the region of the genome that needs to be analysed. --region_input_file ${projectDir}/input/chromosomes_subset.txt
chr21 and chr22
Exclusion data A text file that specifies regions of the genome to be excluded from analysis --exclusion_data_file false
Input cohort A list of cases and controls with covariates --input_cohort_file /gel_data_resources/workflows/input_material/RDP_tools_aggregateVariantTestingWorkflow/auxiliary_files/input/cohort.txt
Functional annotation filter masks Filter out variants in your output --functional_annotation_filter_masks ${projectDir}/input/functional_annotation_filter_masks.json
Mask ranks defines the order of the strictness for each annotation label, so that a variant is assigned to the correct label in the case that is passes multiple ones --mask_rank ${projectDir}/input/mask_rank.json
Regenie masks regenie functional annotation "masks" --regenie_masks ${projectDir}/input/regenie_masks.json
Genomic data paths to the files for the input variant dataset --genomic_data ${projectDir}/input/aggV2_pgen_list_by_chromosomes_all_variants_biallelic_and_multiallelic.tsv
Annotation VCF list paths to the files with functional annotations for the input variant dataset --vcf_files ${projectDir}/input/aggV2_functional_annotations_list_by_chunks_VEP105.tsv
Pre-computed PLINK A pre-computed set of PLINK files that contain high quality SNP variants from across the genome for all samples --precomputed_plink_files_for_grm_bed
--precomputed_plink_files_for_grm_bim
--precomputed_plink_files_for_grm_fam
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.bed
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.bim
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.fam
Consequence severity ranking --consequence_severity_ranking_file ${projectDir}/resources/VEP_severity_bcftools_translation_and_ranking.tsv
Gene coordinates A tab-delimited file of gene coordinates --ensembl_gene_list ${projectDir}/resources/Ensembl_105_genes_coordinates_GRCh38.tsv
Protein-coding gene coordinates A tab-delimited file of protein-coding gene coordinates --ensembl_gene_list_protein_coding ${projectDir}/resources/Ensembl_105_genes_coordinates_GRCh38_protein_coding.tsv

Region input --region_input_file

You can run the AVT workflow to focus on specific regions, either whole chromosomes, genomic regions, specific genes or a custom set of variants.

The region input is a text file that specifies the region of the genome that needs to be analysed. The type of regions in your file is detected by the workflow - it will display the type, chr_mode, gene_mode, region_mode, or variant_mode, in the log file at the beginning of the run.

The ${projectDir}/input/ directory includes examples of all region file input types - respectively:

  • chromosomes: A single-column, one per line text file of required chromosomes (chr1-chr22, chrX, chrY). For a whole genome analysis, list all chromosomes.

    chromosomes.txt
    chr1
    chr2
    chr3
    chr4
    chr5
    chr6
    chr7
    chr8
    chr9
    chr10
    chr11
    chr12
    chr13
    chr14
    chr15
    chr16
    chr17
    chr18
    chr19
    chr20
    chr21
    chr22
    chrX
    chrY
    
  • genes: A single-column, one per line text file of gene HGNC symbols or Ensembl IDs, with no header.

    genes.txt
    POLR3K
    SNRNP25
    ENSG00000007384
    NPRL3
    HBA2
    LUC7L
    ARHGDIG
    PKD1
    BRCA1
    
  • coordinates: A BED format file of gene coordinates, with header and name column. Please always use the column names found in the header of the example file.

    coordinates.bed
    #CHROM  START   END NAME
    chr16   2025356 2039026 region1
    chr16   2039815 2047866 region2
    chr16   2047967 2089491 region3
    chr16   2088710 2135898 region4
    chr16   2090195 2090284 region5
    chr16   2091436 2095433 region6
    chr16   2094830 2097026 region7
    chr16   2106669 2106753 region8
    chr16   2112335 2113342 region9
    chr16   2119207 2120248 region10
    
  • variants: A tab-delimited, four-column file, with no header. It has the following fields:

    1. the variant ID as CHROM:POS_REF_ALT
    2. the name of the region that includes that variant (usually the gene name)
    3. the "most severe" annotation label (i.e. the one that will be used by SAIGE-GENE and REGENIE for that variant)
    4. a comma-separated list of all annotation labels (this information is used only by the Fisher's test branch of the workflow which will then use that variant when running tests for each of those labels). Annotation labels therefore need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.
    variants.tsv
    chr5:68296298_G_GC  PIK3R1  LoF LoF,missense
    chr5:68296298_G_GT  PIK3R1  LoF LoF,missense
    chr5:68297418_C_CG  PIK3R1  LoF LoF,missense
    chr5:68297596_C_T   PIK3R1  LoF LoF,missense
    chr5:69094339_G_T   SLC30A5 LoF LoF,missense
    chr5:69100862_CT_C  SLC30A5 LoF LoF,missense
    chr5:69108437_G_A   SLC30A5 LoF LoF,missense
    chr5:69115374_C_A   SLC30A5 LoF LoF,missense
    chr5:69116085_C_CTT SLC30A5 LoF LoF,missense
    chr5:69118506_C_T   SLC30A5 LoF LoF,missense
    chr5:69121897_T_C   SLC30A5 LoF LoF,missense
    chr5:69123427_T_A   SLC30A5 LoF LoF,missense
    chr5:69128082_AT_A  SLC30A5 LoF LoF,missense
    chr5:69128127_C_T   SLC30A5 LoF LoF,missense
    chr5:69128133_G_C   SLC30A5 LoF LoF,missense
    chr5:69168323_CT_C  CCNB1   LoF LoF,missense
    chr5:69171453_G_GT  CCNB1   LoF LoF,missense
    chr5:69175436_GAACT_G   CCNB1   LoF LoF,missense
    chr5:69175519_TC_T  CCNB1   LoF LoF,missense
    chr5:69177248_CTACAACA_C    CCNB1   LoF LoF,missense
    chr5:69177311_AATGTAGTC_A   CCNB1   LoF LoF,missense
    chr5:69177337_T_TA  CCNB1   LoF LoF,missense
    

The default run uses a chromosome subset file including the two GRCh38 chromosomes chr21 and chr22.

chromosomes_subset.txt
chr21
chr22

Exclusion data file --exclusion_data_file

Either false or a valid path to a text file containing genomic regions to be excluded. The regions must be of the same type as the region input file, e.g. gene symbols or IDs for "gene mode", a variant list for "variant mode".

Input cohort file --input_cohort_file

Your cohort of cases and controls. You can build your own case/control cohort using your preferred method. We provide some tutorials on building cohorts using our Labkey API.

This is a tab-separated file with a header. It should include a unique sample identifier, one or more phenotype columns and covariates, such as age, sex and principal components.

You need to specify the columns that contain each of the value types in your submission script.

Parameter Notes Default
--input_cohort_file case/control cohort "/gel_data_resources/workflows/input_material/RDP_tools_aggregateVariantTestingWorkflow/auxiliary_files/input/cohort.txt"
--cohort_sample_column The column in your cohort file that specifies the platekey "Platekey"
--cohort_sex_column The column in your cohort file that specifies the sex "sex"
--control_coding How controls are coded in your input file 0
--phenotype_array List of columns in your input file that contain phenotype data "status"
--phenotype_type_array List of phenotype types that correspond to the list of phenotype columns "b"
--covariates List of columns of covariates in your input file "age,sex,age.age,age.sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20"
--categorical_covariates List of columns of discrete covariates in your input file "sex"

The workflow will run on the cohort specified in the input, as described below. It is your responsibility to ensure that the cohort only includes participants whose data you are allowed to process and analyse. Ensure that you are only using samples included in the latest data release (current main programme data release: 19 (31st October 2024)). Please ask via Service desk if unsure.

This example includes the minimal required data (plate_key and disease_A) plus two commonly used covariate columns (age and sex):

plate_key   disease_A   age sex
sample1     0           25  0
sample2     0           50  1
sample3     1           33  0

Functional annotation filter masks file --functional_annotation_filter_masks

You can filter variants in your output by the consequences they have on genes. Use the functional annotation filter masks file to create a list of filters for each consequence type.

The whole functional annotation filtering section is skipped in "variant mode" - in that case, the user provides the relevant information for variant filtering in the user region input file.

This is a JSON file. It comprises functional annotation labels for variants, for example "LoF", "missense" and "synonymous", and within each of these, the filters you want all variants to pass ("and_mask") and optional filters ("or_mask").

The format for each functional annotation label section is as follows:

  • Functional annotation label name
    • AND mask settings - any variant will have to pass ALL filters in this section to be included
      • Zero or more specific filters
    • OR mask settings - any variant will have to pass ANY filter in this section to be included
      • Zero or more specific filters

Annotation labels will be used in the SAIGE-GENE, REGENIE, and Fisher's test "branches" of the workflow, and need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.

The functional annotation labels defined in this file will be used in the mask ranks file, the SAIGE saige_masks parameter, and the REGENIE masks file, which all need to be adjusted accordingly.

The rules that govern the functional filtering are:

  • Any annotation in the CSQ section of the functional annotation VCF can be used as an annotation to filter on. This is done by bcftools +split-vep internally.
  • The AND and OR filters are themselves combined with AND, therefore a variant must pass all filters in the AND block, AND one or more filters in the OR block.
  • If the annotation field annotation and the VEP severity field vep_severity_to_include are empty, the filter is skipped. Leaving filters empty does not impact the running of the workflow.
  • vep_severity_to_include operates in an identical manner to bcftools +split-vep -s. You can specify an exact consequence, e.g. stop_gained, and only variants with that consequence will be retained. Alternatively, you can specify a consequence or worse, e.g. missense+, to include all variants that are at least as severe as missense. If it is left blank, then all consequences will be considered.
  • include_missing can be set to "yes" or "no". If set to "no" then only variants that pass the filter will be included. If set to "yes" then variants that pass the filter and variants that are annotated with missing entries by VEP i.e. '.', '-' or '' will be included. This can be useful for example when wanting to filter on CADD_PHRED scores and also include INDELs, as many INDELs do not have a CADD_PHRED score.
  • Comparators can be ">", ">=", "==", "<=", "<" for float values, or "==" for string values. The workflow will exit with an error if the wrong comparator type is detected.

Make sure to give each filter a unique name WITHIN the AND and OR blocks for each mask. Having the same name for multiple filters means that some filters will get skipped, and lead to confusing output.

functional_annotation_filter_masks
{
    "LoF": {
        "and_mask": {
            "filter1": {"annotation": "CANONICAL", "comparator": "==", "condition": "YES", "include_missing": "no", "vep_severity_to_include": ""},
            "filter2": {"annotation": "LoF", "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "","vep_severity_to_include": ""}
        }
    },
    "missense": {
        "and_mask": {
            "filter1": {"annotation": "CANONICAL", "comparator": "==", "condition": "YES", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "CADD_PHRED",  "comparator": ">=", "condition": "10","include_missing": "no", "vep_severity_to_include": "missense+"},
            "filter2": {"annotation": "LoF",  "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        }
    },
    "synonymous": {
        "and_mask": {
            "filter1": {"annotation": "Consequence", "comparator": "==", "condition": "synonymous_variant", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "", "include_missing": "", "vep_severity_to_include": ""}
        }
    }
}

Mask ranks file --mask_rank

The mask rank file defines the order of the strictness for each annotation label, so that a variant is assigned to the correct label in the case that is passes multiple ones. The lower the number, the more strict the annotation label is.

The file is a JSON file of ranks for functional annotation labels (for example "LoF", "missense" and "synonymous"). Annotation labels will be used in the SAIGE-GENE, REGENIE, and Fisher's test "branches" of the workflow, and need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.

The whole functional annotation filtering section is skipped in "variant mode" - in that case, the user provides the relevant information for variant filtering in the region input file.

Make sure that your mask rank file is compatible with the functional annotation filter masks file.

mask_rank.json
{
    "LoF": 1,
    "missense": 2,
    "synonymous": 3
}

Regenie masks file --regenie_masks

A JSON file of regenie functional annotation "masks" (composed of one or a combination of functional annotation labels) - please see the REGENIE docs for more detail. Also see our Known issues and limitations page.

  • parameter:
  • default: ${projectDir}/input/regenie_masks.json

See the default below as an example - set this to your own file, compatible with your functional annotation filter masks file.

regenie_masks.json
{
    "strict_lof": "LoF",
    "mild_lof": "LoF,missense",
    "control": "synonymous"
}

Genomic data file --genomic_data

A four-column, tab-delimited file, with no header. This file contains paths to the files for the input variant dataset, ie to the per chromosome multi-sample PLINK pgen/psam/pvar-format files that contain genomic data. The default input variant dataset is aggV2.

head aggV2_pgen_list_by_chromosomes_all_variants_biallelic_and_multiallelic.tsv
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr10_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr10_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr10_allvariants_masked.pvar
chr11   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr11_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr11_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr11_allvariants_masked.pvar
chr12   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr12_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr12_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr12_allvariants_masked.pvar
chr13   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr13_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr13_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr13_allvariants_masked.pvar
chr14   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr14_allvariants_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr14_allvariants_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr14_allvariants_masked.pvar

Annotation VCF list file --vcf_files

A two-column tab-delimited file with no header. Contains paths to the files with functional annotations for the input variant dataset, ie paths to annotated multi-sample VCF files corresponding to the PLINK files included in the genomic data file. The default input variant dataset is aggV2.

head aggV2_functional_annotations_list_by_chunks_VEP105.tsv
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_10064150_12359131_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_101007881_103656324_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_103656325_106340362_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_106340363_108834875_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_108834876_111348918_annotated.vcf.gz

A pre-computed set of PLINK files that contain high quality SNP variants from across the genome for all samples contained in the input variant dataset. The default input variant dataset is aggV2.

There are three parameters:

  • --precomputed_plink_files_for_grm_bed
  • --precomputed_plink_files_for_grm_bim
  • --precomputed_plink_files_for_grm_fam

Consequence severity ranking file --consequence_severity_ranking_file

We do not recommend that you alter the contents of this file. See predicted data

VEP_severity_bcftools_translation_and_ranking.tsv
# This should normally not be edited by users.
# See https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
ensembl_annotation  bcftools_annotation ranking
transcript_ablation transcript_ablation 20
splice_acceptor_variant splice_acceptor 19
splice_donor_variant    splice_donor    19
stop_gained stop_gained 18
frameshift_variant  frameshift  18
stop_lost   stop_lost   18
start_lost  start_lost  18
transcript_amplification    transcript_amplification    15
inframe_insertion   inframe 14
inframe_deletion    inframe 14
missense_variant    missense    14
protein_altering_variant    protein_altering    14
splice_region_variant   splice_region   13
splice_donor_5th_base_variant   splice_region   13
splice_donor_region_variant splice_region   13
splice_polypyrimidine_tract_variant splice_region   13
incomplete_terminal_codon_variant   incomplete_terminal_codon   12
start_retained_variant  start_retained  11
stop_retained_variant   stop_retained   11
synonymous_variant  synonymous  11
coding_sequence_variant coding_sequence 10
mature_miRNA_variant    mature_miRNA    10
5_prime_UTR_variant 5_prime_utr 9
3_prime_UTR_variant 3_prime_utr 9
non_coding_transcript_exon_variant  non_coding_transcript_exon  8
intron_variant  intron  7
NMD_transcript_variant  NMD_transcript  7
non_coding_transcript_variant   non_coding_transcript   6
upstream_gene_variant   upstream    5
downstream_gene_variant downstream  5
TFBS_ablation   TFBS    4
TFBS_amplification  TFBS    4
TF_binding_site_variant TF_binding_site 4
regulatory_region_ablation  regulatory  3
regulatory_region_amplification regulatory  3
regulatory_region_variant   regulatory  3
feature_elongation  feature_elongation  2
feature_truncation  feature_truncation  2
intergenic_variant  intergenic  1

Gene coordinates file --ensembl_gene_list

This file must be consistent with the Genomic Data and the Annotation Data.

A tab-delimited file of gene coordinates (Ensembl 105 GRCh38, in the default case), with header.

Protein-coding gene coordinates file --ensembl_gene_list_protein_coding

This file must be consistent with the Genomic Data and the Annotation Data.

A tab-delimited file of protein-coding gene coordinates (Ensembl 105 GRCh38 protein-coding, in the default case), with header.