AVT input files¶
There are a number of different input files you can use with the AVT workflow.
File | used for | parameter(s) | default |
---|---|---|---|
Region input | A text file that specifies the region of the genome that needs to be analysed. | --region_input_file |
${projectDir}/input/chromosomes_subset.txt chr21 and chr22 |
Exclusion data | A text file that specifies regions of the genome to be excluded from analysis | --exclusion_data_file |
false |
Input cohort | A list of cases and controls with covariates | --input_cohort_file |
/gel_data_resources/workflows/input_material/RDP_tools_aggregateVariantTestingWorkflow/auxiliary_files/input/cohort.txt |
Functional annotation filter masks | Filter out variants in your output | --functional_annotation_filter_masks |
${projectDir}/input/functional_annotation_filter_masks.json |
Mask ranks | defines the order of the strictness for each annotation label, so that a variant is assigned to the correct label in the case that is passes multiple ones | --mask_rank |
${projectDir}/input/mask_rank.json |
Regenie masks | regenie functional annotation "masks" | --regenie_masks |
${projectDir}/input/regenie_masks.json |
Genomic data | paths to the files for the input variant dataset | --genomic_data |
${projectDir}/input/aggV2_pgen_list_by_chromosomes_all_variants_biallelic_and_multiallelic.tsv |
Annotation VCF list | paths to the files with functional annotations for the input variant dataset | --vcf_files |
${projectDir}/input/aggV2_functional_annotations_list_by_chunks_VEP105.tsv |
Pre-computed PLINK | A pre-computed set of PLINK files that contain high quality SNP variants from across the genome for all samples | --precomputed_plink_files_for_grm_bed --precomputed_plink_files_for_grm_bim --precomputed_plink_files_for_grm_fam |
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.bed /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.bim /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/HQ_SNPs/GELautosomes_LD_pruned_1kgp3Intersect_common_and_rare_for_AVT_mpv10.fam |
Consequence severity ranking | --consequence_severity_ranking_file |
${projectDir}/resources/VEP_severity_bcftools_translation_and_ranking.tsv |
|
Gene coordinates | A tab-delimited file of gene coordinates | --ensembl_gene_list |
${projectDir}/resources/Ensembl_105_genes_coordinates_GRCh38.tsv |
Protein-coding gene coordinates | A tab-delimited file of protein-coding gene coordinates | --ensembl_gene_list_protein_coding |
${projectDir}/resources/Ensembl_105_genes_coordinates_GRCh38_protein_coding.tsv |
Region input --region_input_file
¶
You can run the AVT workflow to focus on specific regions, either whole chromosomes, genomic regions, specific genes or a custom set of variants.
The region input is a text file that specifies the region of the genome that needs to be analysed. The type of regions in your file is detected by the workflow - it will display the type, chr_mode
, gene_mode
, region_mode
, or variant_mode
, in the log file at the beginning of the run.
The ${projectDir}/input/
directory includes examples of all region file input types - respectively:
-
chromosomes: A single-column, one per line text file of required chromosomes (chr1-chr22, chrX, chrY). For a whole genome analysis, list all chromosomes.
-
genes: A single-column, one per line text file of gene HGNC symbols or Ensembl IDs, with no header.
-
coordinates: A BED format file of gene coordinates, with header and name column. Please always use the column names found in the header of the example file.
coordinates.bed
#CHROM START END NAME chr16 2025356 2039026 region1 chr16 2039815 2047866 region2 chr16 2047967 2089491 region3 chr16 2088710 2135898 region4 chr16 2090195 2090284 region5 chr16 2091436 2095433 region6 chr16 2094830 2097026 region7 chr16 2106669 2106753 region8 chr16 2112335 2113342 region9 chr16 2119207 2120248 region10
-
variants: A tab-delimited, four-column file, with no header. It has the following fields:
- the variant ID as
CHROM:POS_REF_ALT
- the name of the region that includes that variant (usually the gene name)
- the "most severe" annotation label (i.e. the one that will be used by SAIGE-GENE and REGENIE for that variant)
- a comma-separated list of all annotation labels (this information is used only by the Fisher's test branch of the workflow which will then use that variant when running tests for each of those labels). Annotation labels therefore need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.
variants.tsv
chr5:68296298_G_GC PIK3R1 LoF LoF,missense chr5:68296298_G_GT PIK3R1 LoF LoF,missense chr5:68297418_C_CG PIK3R1 LoF LoF,missense chr5:68297596_C_T PIK3R1 LoF LoF,missense chr5:69094339_G_T SLC30A5 LoF LoF,missense chr5:69100862_CT_C SLC30A5 LoF LoF,missense chr5:69108437_G_A SLC30A5 LoF LoF,missense chr5:69115374_C_A SLC30A5 LoF LoF,missense chr5:69116085_C_CTT SLC30A5 LoF LoF,missense chr5:69118506_C_T SLC30A5 LoF LoF,missense chr5:69121897_T_C SLC30A5 LoF LoF,missense chr5:69123427_T_A SLC30A5 LoF LoF,missense chr5:69128082_AT_A SLC30A5 LoF LoF,missense chr5:69128127_C_T SLC30A5 LoF LoF,missense chr5:69128133_G_C SLC30A5 LoF LoF,missense chr5:69168323_CT_C CCNB1 LoF LoF,missense chr5:69171453_G_GT CCNB1 LoF LoF,missense chr5:69175436_GAACT_G CCNB1 LoF LoF,missense chr5:69175519_TC_T CCNB1 LoF LoF,missense chr5:69177248_CTACAACA_C CCNB1 LoF LoF,missense chr5:69177311_AATGTAGTC_A CCNB1 LoF LoF,missense chr5:69177337_T_TA CCNB1 LoF LoF,missense
- the variant ID as
The default run uses a chromosome subset file including the two GRCh38 chromosomes chr21
and chr22
.
Exclusion data file --exclusion_data_file
¶
Either false
or a valid path to a text file containing genomic regions to be excluded. The regions must be of the same type as the region input file, e.g. gene symbols or IDs for "gene mode", a variant list for "variant mode".
Input cohort file --input_cohort_file
¶
Your cohort of cases and controls. You can build your own case/control cohort using your preferred method. We provide some tutorials on building cohorts using our Labkey API.
This is a tab-separated file with a header. It should include a unique sample identifier, one or more phenotype columns and covariates, such as age, sex and principal components.
You need to specify the columns that contain each of the value types in your submission script.
Parameter | Notes | Default |
---|---|---|
--input_cohort_file |
case/control cohort | "/gel_data_resources/workflows/input_material/RDP_tools_aggregateVariantTestingWorkflow/auxiliary_files/input/cohort.txt" |
--cohort_sample_column |
The column in your cohort file that specifies the platekey | "Platekey" |
--cohort_sex_column |
The column in your cohort file that specifies the sex | "sex" |
--control_coding |
How controls are coded in your input file | 0 |
--phenotype_array |
List of columns in your input file that contain phenotype data | "status" |
--phenotype_type_array |
List of phenotype types that correspond to the list of phenotype columns | "b" |
--covariates |
List of columns of covariates in your input file | "age,sex,age.age,age.sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,PC11,PC12,PC13,PC14,PC15,PC16,PC17,PC18,PC19,PC20" |
--categorical_covariates |
List of columns of discrete covariates in your input file | "sex" |
The workflow will run on the cohort specified in the input, as described below. It is your responsibility to ensure that the cohort only includes participants whose data you are allowed to process and analyse. Ensure that you are only using samples included in the latest data release (current main programme data release: 19 (31st October 2024)). Please ask via Service desk if unsure.
This example includes the minimal required data (plate_key
and disease_A
) plus two commonly used covariate columns (age and sex):
Functional annotation filter masks file --functional_annotation_filter_masks
¶
You can filter variants in your output by the consequences they have on genes. Use the functional annotation filter masks file to create a list of filters for each consequence type.
The whole functional annotation filtering section is skipped in "variant mode" - in that case, the user provides the relevant information for variant filtering in the user region input file.
This is a JSON file. It comprises functional annotation labels for variants, for example "LoF"
, "missense"
and "synonymous"
, and within each of these, the filters you want all variants to pass ("and_mask"
) and optional filters ("or_mask"
).
The format for each functional annotation label section is as follows:
- Functional annotation label name
- AND mask settings - any variant will have to pass ALL filters in this section to be included
- Zero or more specific filters
- OR mask settings - any variant will have to pass ANY filter in this section to be included
- Zero or more specific filters
- AND mask settings - any variant will have to pass ALL filters in this section to be included
Annotation labels will be used in the SAIGE-GENE, REGENIE, and Fisher's test "branches" of the workflow, and need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.
The functional annotation labels defined in this file will be used in the mask ranks file, the SAIGE saige_masks
parameter, and the REGENIE masks file, which all need to be adjusted accordingly.
The rules that govern the functional filtering are:
- Any annotation in the CSQ section of the functional annotation VCF can be used as an annotation to filter on. This is done by
bcftools +split-vep
internally. - The AND and OR filters are themselves combined with AND, therefore a variant must pass all filters in the AND block, AND one or more filters in the OR block.
- If the annotation field annotation and the VEP severity field
vep_severity_to_include
are empty, the filter is skipped. Leaving filters empty does not impact the running of the workflow. vep_severity_to_include
operates in an identical manner tobcftools +split-vep -s
. You can specify an exact consequence, e.g.stop_gained
, and only variants with that consequence will be retained. Alternatively, you can specify a consequence or worse, e.g.missense+
, to include all variants that are at least as severe as missense. If it is left blank, then all consequences will be considered.include_missing
can be set to "yes" or "no". If set to "no" then only variants that pass the filter will be included. If set to "yes" then variants that pass the filter and variants that are annotated with missing entries by VEP i.e. '.', '-' or '' will be included. This can be useful for example when wanting to filter on CADD_PHRED scores and also include INDELs, as many INDELs do not have a CADD_PHRED score.- Comparators can be ">", ">=", "==", "<=", "<" for float values, or "==" for string values. The workflow will exit with an error if the wrong comparator type is detected.
Make sure to give each filter a unique name WITHIN the AND and OR blocks for each mask. Having the same name for multiple filters means that some filters will get skipped, and lead to confusing output.
functional_annotation_filter_masks
{
"LoF": {
"and_mask": {
"filter1": {"annotation": "CANONICAL", "comparator": "==", "condition": "YES", "include_missing": "no", "vep_severity_to_include": ""},
"filter2": {"annotation": "LoF", "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
},
"or_mask": {
"filter1": {"annotation": "", "comparator": "", "condition": "","include_missing": "","vep_severity_to_include": ""}
}
},
"missense": {
"and_mask": {
"filter1": {"annotation": "CANONICAL", "comparator": "==", "condition": "YES", "include_missing": "no", "vep_severity_to_include": ""}
},
"or_mask": {
"filter1": {"annotation": "CADD_PHRED", "comparator": ">=", "condition": "10","include_missing": "no", "vep_severity_to_include": "missense+"},
"filter2": {"annotation": "LoF", "comparator": "==", "condition": "HC", "include_missing": "no", "vep_severity_to_include": ""}
}
},
"synonymous": {
"and_mask": {
"filter1": {"annotation": "Consequence", "comparator": "==", "condition": "synonymous_variant", "include_missing": "no", "vep_severity_to_include": ""}
},
"or_mask": {
"filter1": {"annotation": "", "comparator": "", "condition": "", "include_missing": "", "vep_severity_to_include": ""}
}
}
}
Mask ranks file --mask_rank
¶
The mask rank file defines the order of the strictness for each annotation label, so that a variant is assigned to the correct label in the case that is passes multiple ones. The lower the number, the more strict the annotation label is.
The file is a JSON file of ranks for functional annotation labels (for example "LoF", "missense" and "synonymous"). Annotation labels will be used in the SAIGE-GENE, REGENIE, and Fisher's test "branches" of the workflow, and need to be compatible with those in the relevant SAIGE-GENE and REGENIE input files and parameters.
The whole functional annotation filtering section is skipped in "variant mode" - in that case, the user provides the relevant information for variant filtering in the region input file.
Make sure that your mask rank file is compatible with the functional annotation filter masks file.
Regenie masks file --regenie_masks
¶
A JSON file of regenie functional annotation "masks" (composed of one or a combination of functional annotation labels) - please see the REGENIE docs for more detail. Also see our Known issues and limitations page.
- parameter:
- default:
${projectDir}/input/regenie_masks.json
See the default below as an example - set this to your own file, compatible with your functional annotation filter masks file.
Genomic data file --genomic_data
¶
A four-column, tab-delimited file, with no header. This file contains paths to the files for the input variant dataset, ie to the per chromosome multi-sample PLINK pgen/psam/pvar-format
files that contain genomic data. The default input variant dataset is aggV2.
head aggV2_pgen_list_by_chromosomes_all_variants_biallelic_and_multiallelic.tsv
chr10 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr10_allvariants_masked.pgen /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr10_allvariants_masked.psam /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr10_allvariants_masked.pvar
chr11 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr11_allvariants_masked.pgen /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr11_allvariants_masked.psam /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr11_allvariants_masked.pvar
chr12 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr12_allvariants_masked.pgen /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr12_allvariants_masked.psam /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr12_allvariants_masked.pvar
chr13 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr13_allvariants_masked.pgen /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr13_allvariants_masked.psam /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr13_allvariants_masked.pvar
chr14 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr14_allvariants_masked.pgen /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr14_allvariants_masked.psam /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/pgen_allvariants/by_chromosome/gel_mainProgramme_aggV2_chr14_allvariants_masked.pvar
Annotation VCF list file --vcf_files
¶
A two-column tab-delimited file with no header. Contains paths to the files with functional annotations for the input variant dataset, ie paths to annotated multi-sample VCF files corresponding to the PLINK files included in the genomic data file. The default input variant dataset is aggV2.
head aggV2_functional_annotations_list_by_chunks_VEP105.tsv
chr10 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_10064150_12359131_annotated.vcf.gz
chr10 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_101007881_103656324_annotated.vcf.gz
chr10 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_103656325_106340362_annotated.vcf.gz
chr10 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_106340363_108834875_annotated.vcf.gz
chr10 /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP_105/gel_mainProgramme_aggV2_chr10_108834876_111348918_annotated.vcf.gz
Pre-computed PLINK files¶
A pre-computed set of PLINK files that contain high quality SNP variants from across the genome for all samples contained in the input variant dataset. The default input variant dataset is aggV2.
There are three parameters:
--precomputed_plink_files_for_grm_bed
--precomputed_plink_files_for_grm_bim
--precomputed_plink_files_for_grm_fam
Consequence severity ranking file --consequence_severity_ranking_file
¶
We do not recommend that you alter the contents of this file. See predicted data
VEP_severity_bcftools_translation_and_ranking.tsv
# This should normally not be edited by users.
# See https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
ensembl_annotation bcftools_annotation ranking
transcript_ablation transcript_ablation 20
splice_acceptor_variant splice_acceptor 19
splice_donor_variant splice_donor 19
stop_gained stop_gained 18
frameshift_variant frameshift 18
stop_lost stop_lost 18
start_lost start_lost 18
transcript_amplification transcript_amplification 15
inframe_insertion inframe 14
inframe_deletion inframe 14
missense_variant missense 14
protein_altering_variant protein_altering 14
splice_region_variant splice_region 13
splice_donor_5th_base_variant splice_region 13
splice_donor_region_variant splice_region 13
splice_polypyrimidine_tract_variant splice_region 13
incomplete_terminal_codon_variant incomplete_terminal_codon 12
start_retained_variant start_retained 11
stop_retained_variant stop_retained 11
synonymous_variant synonymous 11
coding_sequence_variant coding_sequence 10
mature_miRNA_variant mature_miRNA 10
5_prime_UTR_variant 5_prime_utr 9
3_prime_UTR_variant 3_prime_utr 9
non_coding_transcript_exon_variant non_coding_transcript_exon 8
intron_variant intron 7
NMD_transcript_variant NMD_transcript 7
non_coding_transcript_variant non_coding_transcript 6
upstream_gene_variant upstream 5
downstream_gene_variant downstream 5
TFBS_ablation TFBS 4
TFBS_amplification TFBS 4
TF_binding_site_variant TF_binding_site 4
regulatory_region_ablation regulatory 3
regulatory_region_amplification regulatory 3
regulatory_region_variant regulatory 3
feature_elongation feature_elongation 2
feature_truncation feature_truncation 2
intergenic_variant intergenic 1
Gene coordinates file --ensembl_gene_list
¶
This file must be consistent with the Genomic Data and the Annotation Data.
A tab-delimited file of gene coordinates (Ensembl 105 GRCh38, in the default case), with header.
Protein-coding gene coordinates file --ensembl_gene_list_protein_coding
¶
This file must be consistent with the Genomic Data and the Annotation Data.
A tab-delimited file of protein-coding gene coordinates (Ensembl 105 GRCh38 protein-coding, in the default case), with header.