Functional annotation workflow¶
The Functional Annotation workflow can simplify the process of annotating VCFs using VEP. The workflow takes a list of file-paths pointing to VCFs of any type, from which it strips the genotype data prior to annotating the files using VEP with your specified options. In the current version, VEP99 is applied.
The workflow is written in WDL and is run using the Cromwell engine. It relies on external tools in order to execute the workflow. These are:
Instructions¶
To run the workflow you will need to:
- Copy the files to your working directory
- Create an input file
- Edit
submit_workflow.sh
- Edit
workflow_inputs_b38.json
- Submit the job to the HPC
Copy the files to your working directory¶
The workflow files are available in:
/gel_data_resources/workflows/BRS_tools_functional_annotation_workflow/v1.1/
Copy them to your working folder like:
cp /gel_data_resources/workflows/BRS_tools_functional_annotation_workflow/v1.1/setup_workflow.sh /full/path/to/target_directory
Create an input file¶
Create an input file (for instance my_input.txt
) containing a file paths to your target VCFs, for example:
/gel_data_resources/workflows/input_material/test_vcf_1000genomes/CCDG_13607_B01_GRM_WGS_2019-02-19_chr17.recalibrated_variants.BRCA.flanks.vcf.gz
Edit submit_workflow.sh
¶
In this file you will need to edit:
#BSUB -P
: Edit this to your LSF ptoject codeexport TMPDIR=/re_scratch/<your_scratch_location>
: edit to your tmp directory, we suggest/re_scratch/<your_GECIP>/<username>/
or/re_scratch/discovery_forum/<your_discovery_forum>/<username>/
- optionally
#BSUB -cwd /filepath/to/your/workflow_folder/
: the default is set to your current working directory, but you can change this if you prefer.
Edit workflow_inputs_b38.json
¶
Edit the workflow_inputs_b38.json
to provide the input file name you made.
Change the following line:
"functional_annotation_workflow.input_file": "input_example_b38.txt"
to:
"functional_annotation_workflow.input_file": "my_input.txt"
If you are a Discovery Forum member, refer to the workflow_inputs_b38_commercial.json
as the inputs file. The regular workflow_inputs_b38.json
will refer to data resources that are limited to academic use only, and may result in your workflow crashing. If you do however have a licence available for a particular resource that you are unable to access, please reach out to us via the Genomics England Service Desk.
Submit the job to the HPC¶
In the terminal (HPC), enter the following line:
bsub < ./submit_workflow.sh
The workflow, will be submitted to the outputs/data/
".
- The workflow will crash if the same the final output folder already contains files with the same file name. For example, if you ran the workflow for the dataset X on day one, and you do so unchanged on day two, it will crash trying to copy the final data to the output folder. This is done in order to protect your outputs from accidental overwriting - if you do not require a particular output file any more please remove it before re-running the workflow in the same location.
- In order to save disk space, please delete the content of output folders "
cromwell-workflow-logs
" and "cromwell-executions
" after you have checked the run was successful. Those folders simply contain redundant data and files that can be useful for debugging when the run fails. Additionally, the content of folders "final_output/call_logs
" and "final_output/workflow_logs
" can also be deleted safely after a successful run if not needed.
Inputs and outputs¶
Type | Type | File or Folder | Description | Requires editing |
---|---|---|---|---|
Input | File | input_example_b37.txt | .txt file containing a list of VCF files. | Yes, can be provided by user. Example only. |
Input | File | input_example_b38.txt | .txt file containing a list of VCF files. | Yes, can be provided by user. Example only. |
Input | File | functional_annotation_workflow.wdl | Main workflow script. | no |
Input | File | workflow_inputs_b37.json | Specifies runtime inputs such as the VCF list location. GRCh37 only. | yes (for to provide input file) |
Input | File | workflow_inputs_b38.json | Specifies runtime inputs such as the VCF list location. GRCh38 only. | yes (for to provide input file) |
Input | File | workflow_inputs_b38_commercial.json | Specifies runtime inputs such as the VCF list location. GRCh38 only. For Discovery Forum members. Filtered for commercially accessible datasets. | yes (for to provide input file) |
Input | File | workflow_options.json | Specifies output dir for overall output and output dir for log-files. | no (optional) |
Input | File | cromwell.conf | Configuration file on how Cromwell interacts with the LSF scheduler. | no |
Input | File | submit_workflow.sh | The final submission script to run the workflow. | yes (for project code, tmp directory, and optionally working directory) |
Output | File | final_output/data/~{output_name_prefix}~{fixed_input}~{output_name_suffix}.vcf.gz | The annotated VCF file. | NA |
Output | File | final_output/data/~{output_name_prefix}~{fixed_input}~{output_name_suffix}.vcf.gz.csi | VCF.csi (index) corresponding to the annotated VCF. | NA |
Output | Folder | final_output/call_logs/ | Contains all the log files that are generated during the query. | NA |
Output | Folder | final_output/workflow_logs/ | Will contain the running workflow log. | NA |
Output | Folder | cromwell-workflow-logs/ | Will contain the overall workflow log. | NA |
Output | Folder | cromwell-executions/ | All the temporary files for each run are stored within the genetic_report folder. Once the jobs have been completed, they will be copied over to the final output folder specified by the user. This will also contain the cromwell database for callcaching. In case an identical query is being done, the workflow will check this database to see if it has been done before, and simply extract the data again without running the full script again. |
NA |
Workflow files¶
The workflow files are available in:
/gel_data_resources/workflows/BRS_tools_functional_annotation_workflow/v1.1/
The workflow is contained within a single file specified by the .wdl suffix. This is:
functional_annotation_workflow.wdl
(the Cromwell workflow file needed by the HPC to manage the annotation job)
There are a few files available that specify the runtime inputs. However, depending on which genome build your VCFs are based on, you will need one of these. There is also a specific input file if you are a Discovery Forum member as some datasets are restricted to academic use only. These are:
workflow_inputs*
(specifies runtime inputs such as the gene list location, and Release Programme Version)_b37.json
_b38.json
_b38_commercial.json
There is another file that specifies remaining options for the workflow. This is:
workflow_options.json
(specifies output dir for overall output and output dir for log-files)
Finally, there is a file detailing how Cromwell interacts with the LSF scheduler. This is:
cromwell.conf
(this file should never be changed)
Beyond the general configuration files, we also provide two example files. One for VCFs called on genomes aligned to build 37 and one for VCFs called on genomes aligned build 38.
input_example_b37.txt
input_example_b38.txt
Further reading¶
In some cases you may wish to change the workflow configurations by editing the input files. Please refer to the list below to see which file you need to edit for which topic.
- How do I change the output directory?
- Please refer to
workflow_options.json
- Please refer to
- How do I change the prefix and suffix of the output files?
- Please refer to the
workflow_inputs_*.json
files
- Please refer to the
- How do I specify which genome build my VCFs are built upon?
- Please refer to the
submit_workflow.sh
file
- Please refer to the
workflow_options.json¶
While the workflow_options.json does not need to be adjusted, you can edit the file to change the output directory. See in-line comments:
workflow_options.json
workflow_inputs_b37|8.json¶
The workflow inputs JSON files contain the VEP options used within the workflow, there are separate files for GRCh37 and GRCh38 (and also a separate one for Discovery Forum members). We recommend against modifying this file too extensively.
The input file argument on line four of the relevant file will need to be edited to point the workflow at the list of files you wish to have annotated.
Please refer to in-line comments on line six and line seven to edit the prefix and suffix of the output files, respectively.
Should you wish to use a VEP plugin that is not listed in the example script below you should include these between lines 38 and 42 for the build38 file. While we host all of the officially listed plugins PolyPhen2 scores cannot be used by our Discovery Forum users unless their Company/Organisation separately purchase a commercial licence from PolyPhen2. This restriction does not apply to our Research Network members. Due to these restrictions however, we have also provided an workflow_inputs_*.json specifically for our Discovery Forum members.
workflow_inputs_b38.json
workflow_inputs_b38_commercial.json
workflow_inputs_b37.json
The submit_workflow.sh¶
This is the main script that will need to use to submit your annotation job. As mentioned in the Usage section, you will need to ensure that you include the correct project code for your work on line two (See this page for more information on project codes). Second, you will need to set the correct TMPDIR variable on line 12 (The TMPDIR variable must be set. Finally, to provide your workflow with the information to match the genome build of your VCF and the annotation workflow, you will need to edit the relevant inputs JSON file on line 16.