Detailed examples on how to query the aggregated dataset¶
Here are some example use cases for this version of the Aggregate Variant Testing (AVT) workflow - for each case, we show what needs to be modified in the workflow files, in particular in the "input variables" file.
The example are not necessarily mutually exclusive, so we recommend taking a look at the simplest ones (top ones) anyway.
Example 1 - I want to run AVT for all protein-coding genes in chr20, only missense variants or worse, low allele frequency in gnomAD¶
This is a basic use case. All protein coding genes in chr20 will be processed.
The input variables file needs to be modified in a few places - for example, this file would work (please see details below):
input variables file for Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
|
Please note that this is very similar to the default content for this file. The most important variables to consider are:
lsf_project_code
: change this to your HPC project code, as usualinput_variants_dataset
: this was set to the name of the dataset containing Genomics England variants for protein coding genes onlyphenotype_input_file
: this was set to the path of an example phenotype file you can use - if you change this to your own cohort's phenotype file, remember to amend the following few variables as appropriateprecomputed_plink_files_for_grm
: this was set to an empty array, meaning that the GRM will be computed from scratch in this casepart_1_inputs.chromosomes_input_file
: this was set to the path of the file containing the chromosome namepart_2_filtering.info_bcftools_expressions_to_include
: we are including here further filters, for example processing only SNPs, which can be removed or altered if neededpart_2_filtering.vep_severity_to_include
: this was set to accept only variants that are annotated by VEP to have a Consequence of missense or worsepart_2_filtering.functional_annotation_filters
: this was set to include only one condition, gnomAD frequency < 0.001, and to include variants that are missing from gnomAD altogetherpart_4_testing.saige_output_file_name
: this was set to the name of the output file, to be created in the folder specified in the "options.json" file
Moreover, you need to make sure that the file referred to by variable "part_1_inputs.chromosomes_input_file
", i.e. the file found at "input_user_data/chromosomes.txt", contains only one line, equal to "chr20".
Finally, you need to specify your HPC project code in the file called "submit_workflow.sh", too.
Example 2 - I want to run AVT for my regions with my own set of variants in protein coding genes¶
Assuming all variants of interest have an Ensembl Consequence of at least 3_prime_UTR_variant or worse, we can use the "aggV2_PASS_UTRplus_proteincodinggenes" dataset again.
The input variables file needs to be modified in a few places - for example, this file would work (please see details below):
input variables file for Example 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
|
Please note that this is very similar to the default content for this file. The most important variables to consider are:
lsf_project_code
: change this to your HPC project code, as usualinput_variants_dataset
: this was set to the name of the dataset containing Genomics England variants for protein coding genes onlyphenotype_input_file
: this was set to the path of an example phenotype file you can use - if you change this to your own cohort's phenotype file, remember to amend the following few variables as appropriateprecomputed_plink_files_for_grm
: this time, this was set to an array of existing plink files corresponding to the same cohort as the phenotype file, meaning that the GRM will be re-used and the whole part_3 of the workflow will be skippedpart_1_inputs.chromosomes_input_file
: this was set to the path of the file containing the chromosome namepart_1_inputs.``groups_input_file
: this was set to the path of the file containing the custom variants, split by custom region/grouppart_2_filtering.use_vep_filtering
: this is set to false, as VEP annotation filtering cannot be used for custom variant groups (note that this will be set to false automatically by the workflow at runtime anyway, if thegroups_input_file
variable is provided)part_2_filtering.info_bcftools_expressions_to_include
: we are including here further filters, for example processing only SNPs, which can be removed or altered if neededpart_2_filtering.vep_severity_to_include
: the value provided for this variable is irrelevant (here it is left as the default), as we are not going to use VEP annotation filteringpart_2_filtering.functional_annotation_filters
: the value provided for this variable is irrelevant (here it is left as the default), as we are not going to use VEP annotation filteringpart_4_testing.saige_output_file_name
: this was set to the name of the output file, to be created in the folder specified in the "options.json" file
Then, your own regions and variants need to be provided in the file referred to by variable "part_1_inputs.groups_input_file
", i.e. the file found at "input_user_data/groups.tsv". The file format is described in the main documentation page; briefly, each line represents a region (group of variants), so the first field in the line is an identifier for the group, followed by TAB and then a TAB-separated list of the variants in that group, formatted as CHR:POS_REF/ALT
. Please note that the final list of variants processed may be smaller, because in this example we are still performing some filtering and masking on the custom variants, as shown by the fact taht variables "part_2_filtering.use_main_filtering
" and "part_2_filtering.use_masking
" are set to "true".
Moreover, you need to make sure that the file referred to by variable "part_1_inputs.chromosomes_input_file
", i.e. the file found at "input_user_data/chromosomes.txt", contains only one line, equal to the name of the chromosome you want to process. The "groups_input_file
" file discussed just above will be subset to this chromosome, meaning that only variants on this chromosome will be processed in this run.
Finally, you need to specify your HPC project code in the file called "submit_workflow.sh", too.
Example 3 - I want to run AVT on selected parts of the genome and I have the coordinates¶
Assuming that some of the location of interest may be outside protein-coding genes, we will use the full Genomics England dataset of variants for this example use case - please note that this causes in average much longer run times compared to the other dataset.
The input variables file needs to be modified in a few places - for example, this file would work (please see details below):
input variables file for Example 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
|
Please note that this is very similar to the default content for this file. The most important variables to consider are:
lsf_project_code
: change this to your HPC project code, as usualinput_variants_dataset
: this was set to the name of the dataset containing all of Genomics England variantsphenotype_input_file
: this was set to the path of an example phenotype file you can use - if you change this to your own cohort's phenotype file, remember to amend the following few variables as appropriateprecomputed_plink_files_for_grm
: this was set to an array of existing plink files corresponding to the same cohort as the phenotype file, meaning that the GRM will be re-used and the whole part_3 of the workflow will be skipped, saving considerable time for this large variant datasetpart_1_inputs.chromosomes_input_file
: this was set to the path of the file containing the chromosome namepart_1_inputs.``coordinates_input_file
: this was set to the path of a BED file containing the locations of interestpart_2_filtering.use_vep_filtering
: this is set to true again, as VEP annotation filtering is available for this use case (note that all variants athis will be set to false automatically by the workflow at runtime anyway, if thegroups_input_file
variable is provided)part_2_filtering.info_bcftools_expressions_to_include
: we are including here further filters, for example processing only SNPs, which can be removed or altered if neededpart_2_filtering.vep_severity_to_include
: this was set to a Consequence of synonymous or worsepart_2_filtering.functional_annotation_filters
: the same filter for gnomAD frequencies was used as in Example 1; also, a filter was added that selects variants annotated with CADD PHRED vale >=10, and excludes variants with no CADD PHRED annotationspart_4_testing.saige_output_file_name
: this was set to the name of the output file, to be created in the folder specified in the "options.json" file
Then, your own genome locations of interest need to be provided in a BED file, referred to by variable "part_1_inputs.coordinates_input_file
", i.e. the file found at "input_user_data/coordinates.bed". Each line of the BED file will be interpreted as a group of variants, meainng that all variants found at locations included in that line will be processed together for the actual Aggregate Variant Testing. Please note that in this use case, variants can be filtered and masked like in Example 1, including being filtered by VEP annotation values - for each individual locus, all variants will be processed together regardless of the gene they affect, because no gene information is present in the input (for instance, only the variant and transcript with worst Consequence will be processed for each locus, even if two genes span that locus). Also, in this use case the longest stretch of genome that is allowed for one line of the BED file is 3 Mbp.
Moreover, you need to make sure that the file referred to by variable "part_1_inputs.chromosomes_input_file
", i.e. the file found at "input_user_data/chromosomes.txt", contains only one line, equal to the name of the chromosome you want to process. The "coordinates_input_file
" file discussed just above will be subset to this chromosome, meaning that only variants on this chromosome will be processed in this run.
Finally, you need to specify your HPC project code in the file called "submit_workflow.sh", too.
Example 4 - I want to run AVT on chrX¶
Please note that the workflow has not been tested extensively for chrX analysis. The workflow must always be run separately for chrX, and currently only on a single-sex cohort (female samples only, or male samples only). This example use case is for a male-only cohort, so that we can highlight a couple of variables that can be used for non-PAR regions in the case of male samples - but the workflow can be run in a very similar way on female-only cohorts.
We want to run AVT on all protein coding genes in chrX for our cohort - therefore, this use case is similar to Example 1 from that perspective.
The input variables file needs to be modified in a few places - for example, this file would work (please see details below):
input variables file for Example 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 |
|
Please note that this is very similar to the default content for this file. The most important variables to consider are:
lsf_project_code
: change this to your HPC project code, as usualinput_variants_dataset
: this was set to the name of the dataset containing Genomics England variants for protein coding genes onlyphenotype_input_file
: this was set to the path of an example phenotype file you can use (male samples only) - if you change this to your own cohort's phenotype file, remember to amend the following few variables as appropriatechrX_male_only_cohort
: this was set to true, as the cohort is consists of male samples only, and it is for chrX analysisprecomputed_plink_files_for_grm
: this was set to an empty array, meaning that the GRM will be computed from scratch in this casepart_1_inputs.chromosomes_input_file
: this was set to the path of the file containing the chromosome namepart_2_filtering.info_bcftools_expressions_to_include
: we are including here further filters, for example processing only SNPs, which can be removed or altered if neededpart_2_filtering.vep_severity_to_include
: this was set to accept only variants that are annotated by VEP to have a Consequence of missense or worsepart_2_filtering.functional_annotation_filters
: this was set to include only one condition, gnomAD frequency < 0.001, and to include variants that are missing from gnomAD altogetherpart_4_testing.sampleFile_male
: this was set to the path to a text file containing a list of the IDs of all male samples in the phenotype file, one per linepart_4_testing.is_rewrite_XnonPAR_forMales
: this was set to "TRUE", because we would like SAIGE-GENE to duplicate alleles in non-PAR regions of chrX for male samplespart_4_testing.saige_output_file_name
: this was set to the name of the output file, to be created in the folder specified in the "options.json" file
Moreover, you need to make sure that the file referred to by variable "part_1_inputs.chromosomes_input_file
", i.e. the file found at "input_user_data/chromosomes.txt", contains only one line, equal to "chrX".
Finally, you need to specify your HPC project code in the file called "submit_workflow.sh", too.