Skip to content

Input files

User region input file

A user-defined text file of variable format which specifies the region of the genome that needs to be processed.

  • parameter: --region_input_file
  • default: ${projectDir}/input/chromosomes_subset.txt
  • header: (no header)

This input file determines the modality for the AVT workflow run, which is one of four:

  1. "chromosome mode": useful to run whole chromosomes or whole genome analysis
  2. "gene mode": useful to run the analysis on a specific set of genes, which must be part of the gene coordinates file
  3. "region mode": useful to run the analysis on custom regions rather than Ensembl genes
  4. "variant mode": useful to skip the variant selection part of the workflow and run the analysis on a custom set of variants

The region input file type is auto-detected and the workflow mode is set appropriately to chr_mode, gene_mode, region_mode, or variant_mode - this is displayed in the log file at the beginning of the run.

The ${projectDir}/input/ directory includes examples of all region file input types - respectively:

  • chromosomes: A single-column, one per line text file of required chromosomes (chr1-chr22, chrX, chrY) - for a whole genome analysis, simply list all chromosomes.
  • genes: A single-column, one per line text file of gene HGNC symbols or Ensembl IDs, with no header.
  • coordinates: A BED format file of gene coordinates, with header and name column. Please always use the column names found in the header of the example file.
  • variants: A tab-delimited, 4-column file, with no header. The first field is a variant ID as CHROM:POS_REF_ALT, the second one is the name of the region that includes that variant (usually the gene name), the third one is the "most severe" annotation label (i.e. the one that will be used by SAIGE-GENE and REGENIE for that variant), the fourth field is a comma-separated list of all annotation labels (as of AVT v4.1.0, this information is used only by the Fisher's test branch of the workflow which will then use that variant when running tests for each of those labels). Annotation labels therefore need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.
chromosomes.txt
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
chrY
coordinates.bed
#CHROM  START   END NAME
chr16   2025356 2039026 region1
chr16   2039815 2047866 region2
chr16   2047967 2089491 region3
chr16   2088710 2135898 region4
chr16   2090195 2090284 region5
chr16   2091436 2095433 region6
chr16   2094830 2097026 region7
chr16   2106669 2106753 region8
chr16   2112335 2113342 region9
chr16   2119207 2120248 region10
genes.txt
POLR3K
SNRNP25
ENSG00000007384
NPRL3
HBA2
LUC7L
ARHGDIG
PKD1
BRCA1
variants.tsv
chr5:68296298_G_GC  PIK3R1  LoF LoF,missense
chr5:68296298_G_GT  PIK3R1  LoF LoF,missense
chr5:68297418_C_CG  PIK3R1  LoF LoF,missense
chr5:68297596_C_T   PIK3R1  LoF LoF,missense
chr5:69094339_G_T   SLC30A5 LoF LoF,missense
chr5:69100862_CT_C  SLC30A5 LoF LoF,missense
chr5:69108437_G_A   SLC30A5 LoF LoF,missense
chr5:69115374_C_A   SLC30A5 LoF LoF,missense
chr5:69116085_C_CTT SLC30A5 LoF LoF,missense
chr5:69118506_C_T   SLC30A5 LoF LoF,missense
chr5:69121897_T_C   SLC30A5 LoF LoF,missense
chr5:69123427_T_A   SLC30A5 LoF LoF,missense
chr5:69128082_AT_A  SLC30A5 LoF LoF,missense
chr5:69128127_C_T   SLC30A5 LoF LoF,missense
chr5:69128133_G_C   SLC30A5 LoF LoF,missense
chr5:69168323_CT_C  CCNB1   LoF LoF,missense
chr5:69171453_G_GT  CCNB1   LoF LoF,missense
chr5:69175436_GAACT_G   CCNB1   LoF LoF,missense
chr5:69175519_TC_T  CCNB1   LoF LoF,missense
chr5:69177248_CTACAACA_C    CCNB1   LoF LoF,missense
chr5:69177311_AATGTAGTC_A   CCNB1   LoF LoF,missense
chr5:69177337_T_TA  CCNB1   LoF LoF,missense

The default run uses a chromosome subset file including the two GRCh38 chromosomes chr21 and chr22.

chromosomes_subset.txt
chr21
chr22

Exclusion data file

Either false or a user-defined valid path to a text file containing genomic regions to be excluded. The regions must be of the same type as the user region input file, e.g. gene symbols or IDs for "gene mode", a variant list for "variant mode".

  • parameter: --exclusion_data_file
  • default: false

Input cohort file

A user-defined whitespace delimited text file, with header (the column separator is auto-determined). It includes a unique sample identifier (parameter cohort_sample_column), one or more phenotype columns (parameter phenotype_array - trait type for each of those ("b" = binary, "q" = quantitative) specified in parameter phenotype_type_array), and covariate columns for the association tests (the one used for the analysis are specified in parameter covariates). The phenotype control value is specified by parameter control_coding.

  • parameter: --input_cohort_file
  • default: /gel_data_resources/workflows/input_material/RDP_tools_aggregateVariantTestingWorkflow/auxiliary_files/input/cohort.txt
  • header (in the default case): Platekey age age.age age.sex sex status PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20

You can build your own case/control cohort using your preferred method. We provide some tutorials on building cohorts using our Labkey API.

IMPORTANT NOTE:

The workflow will run on the cohort specified in the input, as described below. It is your responsibility to ensure that the cohort only includes participants whose data you are allowed to process and analyse. Ensure that you are only using samples included in the latest data release (current main programme data release: 18 (21st December 2023)). Please ask via Service desk if unsure.

This example includes the minimal required data (plate_key and disease_A) plus two commonly used covariate columns (age and sex):

plate_key   disease_A   age sex
sample1     0           25  0
sample2     0           50  1
sample3     1           33  0

Functional annotation filter masks file

A user-defined JSON file of functional annotation labels for variants (in the default file these are: "LoF", "missense", "synonymous"). Annotation labels will be used in the SAIGE-GENE, REGENIE, and Fisher's test "branches" of the workflow, and need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.

Please note that the whole functional annotation filtering section is skipped in "variant mode" - in that case, the user provides the relevant information for variant filtering in the user region input file.

  • parameter: --functional_annotation_filter_masks
  • default: ${projectDir}/input/functional_annotation_filter_masks.json

See the default below as an example - set this to your own file, defining your own required annotation labels that will be used to bin each variant.
Also note that the functional annotation labels defined in this file will be used in the mask ranks file, the SAIGE saige_masks parameter, and the REGENIE masks file, which all need to be adjusted accordingly.

The format for each functional annotation label section is as follows:

  • Functional annotation label name
    • AND mask settings - any variant will have to pass ALL filters in this section to be included
      • Zero or more specific filters
    • OR mask settings - any variant will have to pass ANY filter in this section to be included
      • Zero or more specific filters

The rules that govern the functional filtering are:

  • Any annotation in the CSQ section of the functional annotation VCF can be used as an annotation to filter on. This is done by bcftools +split-vep internally.
  • The AND and OR filters are themselves combined with AND, therefore a variant must pass all filters in the AND block, AND one or more filters in the OR block.
  • If the annotation field annotation and the VEP severity field vep_severity_to_include are empty, the filter is skipped. Leaving filters empty does not impact the running of the workflow.
  • vep_severity_to_include operates in an identical manner to bcftools +split-vep -s. You can specify an exact consequence, e.g. stop_gained, and only variants with that consequence will be retained. Alternatively, you can specify a consequence or worse, e.g. missense+, to include all variants that are at least as severe as missense. If it is left blank, then all consequences will be considered.
  • include_missing can be set to "yes" or "no". If set to "no" then only variants that pass the filter will be included. If set to "yes" then variants that pass the filter and variants that are annotated with missing entries by VEP i.e. '.', '-' or '' will be included. This can be useful for example when wanting to filter on CADD_PHRED scores and also include INDELs, as many INDELs do not have a CADD_PHRED score.
  • Comparators can be ">", ">=", "==", "<=", "<" for float values, or "==" for string values. The workflow will exit with an error if the wrong comparator type is detected.

Make sure to give each filter a unique name WITHIN the AND and OR blocks for each mask. Having the same name for multiple filters means that some filters will get skipped, and lead to confusing output.

functional_annotation_filter_masks
{
    "LoF": {
        "and_mask": {
            "filter1": {"annotation": "CANONICAL", "comparator": "==", "condition": "YES", "include_missing": "no", "vep_severity_to_include": ""},
            "filter2": {"annotation": "LoF", "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "","vep_severity_to_include": ""}
        }
    },
    "missense": {
        "and_mask": {
            "filter1": {"annotation": "CANONICAL", "comparator": "==", "condition": "YES", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "CADD_PHRED",  "comparator": ">=", "condition": "10","include_missing": "no", "vep_severity_to_include": "missense+"},
            "filter2": {"annotation": "LoF",  "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
        }
    },
    "synonymous": {
        "and_mask": {
            "filter1": {"annotation": "Consequence", "comparator": "==", "condition": "synonymous_variant", "include_missing": "no", "vep_severity_to_include": ""}
        },
        "or_mask": {
            "filter1": {"annotation": "", "comparator": "", "condition": "", "include_missing": "", "vep_severity_to_include": ""}
        }
    }
}

Mask ranks file

A user-defined JSON file of ranks for functional annotation labels (in the default file these are: "LoF", "missense", "synonymous"). Annotation labels will be used in the SAIGE-GENE, REGENIE, and Fisher's test "branches" of the workflow, and need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.

Please note that the whole functional annotation filtering section is skipped in "variant mode" - in that case, the user provides the relevant information for variant filtering in the user region input file.

  • parameter: --mask_rank
  • default: ${projectDir}/input/mask_rank.json

See the default below as an example - set this to your own file, compatible with your functional annotation filter masks file.
This setting defines the order of the strictness for each annotation label, so that a variant is assigned to the correct label in the case that is passes multiple ones. The lower the number, the more strict the annotation label is.

mask_rank.json
{
    "LoF": 1,
    "missense": 2,
    "synonymous": 3
}

Regenie masks file

A user-defined JSON file of regenie functional annotation "masks" (composed of one or a comination of functional annotation labels) - please see the REGENIE docs for more detail. Also see our Known issues and limitations page.

  • parameter: --regenie_masks
  • default: ${projectDir}/input/regenie_masks.json

See the default below as an example - set this to your own file, compatible with your functional annotation filter masks file.

regenie_masks.json
{
    "strict_lof": "LoF",
    "mild_lof": "LoF,missense",
    "control": "synonymous"
}

Genomic data file

A 4-column, tab-delimited file, with no header. Contains paths to the files for the input variant dataset, i.e. to the per chromosome multi-sample PLINK pgen/psam/pvar-format files that contain genomic data. The default input variant dataset is aggV2.

  • parameter: --genomic_data
  • default: ${projectDir}/input/aggV2_pgen_list.tsv
  • header: None. Columns are:
    • (chromosome) chromsome name in GRCh38 format, e.g., chr1
    • (pgen) path to multi-sample masked pgen file
    • (psam) path to multi-sample masked psam file
    • (pvar) path to multi-sample masked pvar file
head aggV2_pgen_list.tsv
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr10_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr10_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr10_masked.pvar
chr11   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr11_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr11_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr11_masked.pvar
chr12   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr12_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr12_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr12_masked.pvar
chr13   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr13_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr13_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr13_masked.pvar
chr14   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr14_masked.pgen   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr14_masked.psam   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen/by_chromosome/gel_mainProgramme_aggV2_chr14_masked.pvar

Annotation VCF list file

A 2-column tab-delimited file with no header. Contains paths to the files with functional annotations for the input variant dataset, i.e. paths to annotated multi-sample VCF files corresponding to the PLINK files included in the genomic data file. The default input variant dataset is aggV2.

  • parameter: --vcf_files
  • default: ${projectDir}/input/aggV2_functional_list.tsv
  • header: None. Columns are:
    • (chromosome) chromsome name in GRCh38 format, e.g., chr1
    • (vcf) path to annotated multi-sample VCF
head aggV2_functional_list.tsv
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_10064150_12359131_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_101007881_103656324_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_103656325_106340362_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_106340363_108834875_annotated.vcf.gz
chr10   /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_108834876_111348918_annotated.vcf.gz

A pre-computed set of PLINK files that contain high quality SNP variants from across the genome for all samples contained in the input variant dataset. The default input variant dataset is aggV2.

  • parameter:
    • --precomputed_plink_files_for_grm_bed
    • --precomputed_plink_files_for_grm_bim
    • --precomputed_plink_files_for_grm_fam
  • default:
    • /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.bed
    • /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.bim
    • /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.fam

Consequence severity ranking file

Info

It is not recommended for users to alter the contents of this file. See predicted data

A tab-delimited file, with header.

  • parameter: --consequence_severity_ranking_file
  • default: ${projectDir}/resources/VEP_severity_bcftools_translation_and_ranking.tsv
  • header: ensembl_annotation bcftools_annotation ranking
VEP_severity_bcftools_translation_and_ranking.tsv
# This should normally not be edited by users.
# See https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
ensembl_annotation  bcftools_annotation ranking
transcript_ablation transcript_ablation 20
splice_acceptor_variant splice_acceptor 19
splice_donor_variant    splice_donor    19
stop_gained stop_gained 18
frameshift_variant  frameshift  18
stop_lost   stop_lost   18
start_lost  start_lost  18
transcript_amplification    transcript_amplification    15
inframe_insertion   inframe 14
inframe_deletion    inframe 14
missense_variant    missense    14
protein_altering_variant    protein_altering    14
splice_region_variant   splice_region   13
splice_donor_5th_base_variant   splice_region   13
splice_donor_region_variant splice_region   13
splice_polypyrimidine_tract_variant splice_region   13
incomplete_terminal_codon_variant   incomplete_terminal_codon   12
start_retained_variant  start_retained  11
stop_retained_variant   stop_retained   11
synonymous_variant  synonymous  11
coding_sequence_variant coding_sequence 10
mature_miRNA_variant    mature_miRNA    10
5_prime_UTR_variant 5_prime_utr 9
3_prime_UTR_variant 3_prime_utr 9
non_coding_transcript_exon_variant  non_coding_transcript_exon  8
intron_variant  intron  7
NMD_transcript_variant  NMD_transcript  7
non_coding_transcript_variant   non_coding_transcript   6
upstream_gene_variant   upstream    5
downstream_gene_variant downstream  5
TFBS_ablation   TFBS    4
TFBS_amplification  TFBS    4
TF_binding_site_variant TF_binding_site 4
regulatory_region_ablation  regulatory  3
regulatory_region_amplification regulatory  3
regulatory_region_variant   regulatory  3
feature_elongation  feature_elongation  2
feature_truncation  feature_truncation  2
intergenic_variant  intergenic  1

Gene coordinates file

Info

This file must be consistent with the Genomic Data and the Annotation Data.

A tab-delimited file of gene coordinates (Ensembl 105 GRCh38, in the default case), with header.

  • parameter: --ensembl_gene_list
  • default: ${projectDir}/resources/Ensembl_105_genes_coordinates_GRCh38.tsv
  • header: chrom start end gene_symbol gene_id gene_biotype