Skip to content

Functional annotation workflow

The Functional Annotation workflow can simplify the process of annotating VCFs using VEP. The workflow takes a list of file-paths pointing to VCFs of any type, from which it strips the genotype data prior to annotating the files using VEP with your specified options. In the current version, VEP99 is applied.

The workflow is written in WDL and is run using the Cromwell engine. It relies on external tools in order to execute the workflow. These are:

Instructions

To run the workflow you will need to:

  1. Copy the files to your working directory
  2. Create an input file
  3. Edit submit_workflow.sh
  4. Edit workflow_inputs_b38.json
  5. Submit the job to the HPC

Copy the files to your working directory

The workflow files are available in:

/gel_data_resources/workflows/BRS_tools_functional_annotation_workflow/v1.1/

Copy them to your working folder like:

cp /gel_data_resources/workflows/BRS_tools_functional_annotation_workflow/v1.1/setup_workflow.sh /full/path/to/target_directory

Create an input file

Create an input file (for instance my_input.txt) containing a file paths to your target VCFs, for example:

/gel_data_resources/workflows/input_material/test_vcf_1000genomes/CCDG_13607_B01_GRM_WGS_2019-02-19_chr17.recalibrated_variants.BRCA.flanks.vcf.gz

Edit submit_workflow.sh

In this file you will need to edit:

  • #BSUB -P: Edit this to your LSF ptoject code
  • export TMPDIR=/re_scratch/<your_scratch_location>: edit to your tmp directory, we suggest /re_scratch/<your_GECIP>/<username>/ or /re_scratch/discovery_forum/<your_discovery_forum>/<username>/
  • optionally #BSUB -cwd /filepath/to/your/workflow_folder/: the default is set to your current working directory, but you can change this if you prefer.

Edit workflow_inputs_b38.json

Edit the workflow_inputs_b38.json to provide the input file name you made.

Change the following line:

"functional_annotation_workflow.input_file": "input_example_b38.txt"

to:

"functional_annotation_workflow.input_file": "my_input.txt"

If you are a Discovery Forum member, refer to the workflow_inputs_b38_commercial.json as the inputs file. The regular workflow_inputs_b38.json will refer to data resources that are limited to academic use only, and may result in your workflow crashing. If you do however have a licence available for a particular resource that you are unable to access, please reach out to us via the Genomics England Service Desk.

Submit the job to the HPC

In the terminal (HPC), enter the following line:

bsub < ./submit_workflow.sh

The workflow, will be submitted to the queue. This will generate subsequent tasks which are submitted to the HPC in the queue to allow annotation of multiple VCFs in parallel. This may lead up to a few hundred or thousand jobs being submitted from your username, depending on the number of input VCFs you have provided. The medium queue will run for 24 hours at most, so if your VCFs are very large, it may be required to subset your VCFs. The main output files will be saved in a folder called by default "outputs/data/".

  • The workflow will crash if the same the final output folder already contains files with the same file name. For example, if you ran the workflow for the dataset X on day one, and you do so unchanged on day two, it will crash trying to copy the final data to the output folder. This is done in order to protect your outputs from accidental overwriting - if you do not require a particular output file any more please remove it before re-running the workflow in the same location.
  • In order to save disk space, please delete the content of output folders "cromwell-workflow-logs" and "cromwell-executions" after you have checked the run was successful. Those folders simply contain redundant data and files that can be useful for debugging when the run fails. Additionally, the content of folders "final_output/call_logs" and "final_output/workflow_logs" can also be deleted safely after a successful run if not needed.

Inputs and outputs

Type Type File or Folder Description Requires editing
Input File input_example_b37.txt .txt file containing a list of VCF files. Yes, can be provided by user. Example only.
Input File input_example_b38.txt .txt file containing a list of VCF files. Yes, can be provided by user. Example only.
Input File functional_annotation_workflow.wdl Main workflow script. no
Input File workflow_inputs_b37.json Specifies runtime inputs such as the VCF list location. GRCh37 only. yes (for to provide input file)
Input File workflow_inputs_b38.json Specifies runtime inputs such as the VCF list location. GRCh38 only. yes (for to provide input file)
Input File workflow_inputs_b38_commercial.json Specifies runtime inputs such as the VCF list location. GRCh38 only. For Discovery Forum members. Filtered for commercially accessible datasets. yes (for to provide input file)
Input File workflow_options.json Specifies output dir for overall output and output dir for log-files. no (optional)
Input File cromwell.conf Configuration file on how Cromwell interacts with the LSF scheduler. no
Input File submit_workflow.sh The final submission script to run the workflow. yes (for project code, tmp directory, and optionally working directory)
Output File final_output/data/~{output_name_prefix}~{fixed_input}~{output_name_suffix}.vcf.gz The annotated VCF file. NA
Output File final_output/data/~{output_name_prefix}~{fixed_input}~{output_name_suffix}.vcf.gz.csi VCF.csi (index) corresponding to the annotated VCF. NA
Output Folder final_output/call_logs/ Contains all the log files that are generated during the query. NA
Output Folder final_output/workflow_logs/ Will contain the running workflow log. NA
Output Folder cromwell-workflow-logs/ Will contain the overall workflow log. NA
Output Folder cromwell-executions/ All the temporary files for each run are stored within the genetic_report folder. Once the jobs have been completed, they will be copied over to the final output folder specified by the user.
This will also contain the cromwell database for callcaching. In case an identical query is being done, the workflow will check this database to see if it has been done before, and simply extract the data again without running the full script again.
NA

Workflow files

The workflow files are available in:

/gel_data_resources/workflows/BRS_tools_functional_annotation_workflow/v1.1/

The workflow is contained within a single file specified by the .wdl suffix. This is:

  • functional_annotation_workflow.wdl (the Cromwell workflow file needed by the HPC to manage the annotation job)

There are a few files available that specify the runtime inputs. However, depending on which genome build your VCFs are based on, you will need one of these. There is also a specific input file if you are a Discovery Forum member as some datasets are restricted to academic use only. These are:

  • workflow_inputs* (specifies runtime inputs such as the gene list location, and Release Programme Version)
    • _b37.json
    • _b38.json
    • _b38_commercial.json

There is another file that specifies remaining options for the workflow. This is:

  • workflow_options.json (specifies output dir for overall output and output dir for log-files)

Finally, there is a file detailing how Cromwell interacts with the LSF scheduler. This is:

  • cromwell.conf (this file should never be changed)

Beyond the general configuration files, we also provide two example files. One for VCFs called on genomes aligned to build 37 and one for VCFs called on genomes aligned build 38.

  • input_example_b37.txt
  • input_example_b38.txt

Further reading

In some cases you may wish to change the workflow configurations by editing the input files. Please refer to the list below to see which file you need to edit for which topic.

  • How do I change the output directory?
    • Please refer to workflow_options.json
  • How do I change the prefix and suffix of the output files?
    • Please refer to the workflow_inputs_*.json files
  • How do I specify which genome build my VCFs are built upon?
    • Please refer to the submit_workflow.sh file

workflow_options.json

While the workflow_options.json does not need to be adjusted, you can edit the file to change the output directory. See in-line comments:

workflow_options.json
{
    "final_workflow_outputs_dir": "outputs/data/" // edit this line to provide an overall output location
    "final_workflow_log_dir": "outputs/workflow_logs/", // edit this line to provide an output location for the logs
    "final_call_logs_dir": "outputs/call_logs/", // edit this line  to provide an output location for the specific task logs
    "use_relative_output_paths": true,
    "write_to_cache": true,
    "read_from_cache": true,
    "default_runtime_attributes": {
        "maxRetries": 5
    }
}

workflow_inputs_b37|8.json

The workflow inputs JSON files contain the VEP options used within the workflow, there are separate files for GRCh37 and GRCh38 (and also a separate one for Discovery Forum members). We recommend against modifying this file too extensively.

The input file argument on line four of the relevant file will need to be edited to point the workflow at the list of files you wish to have annotated.

Please refer to in-line comments on line six and line seven to edit the prefix and suffix of the output files, respectively.

Should you wish to use a VEP plugin that is not listed in the example script below you should include these between lines 38 and 42 for the build38 file. While we host all of the officially listed plugins PolyPhen2 scores cannot be used by our Discovery Forum users unless their Company/Organisation separately purchase a commercial licence from PolyPhen2. This restriction does not apply to our Research Network members. Due to these restrictions however, we have also provided an workflow_inputs_*.json specifically for our Discovery Forum members.

workflow_inputs_b38.json
{
    "functional_annotation_workflow.threads":4,
    "functional_annotation_workflow.bcftools": "bio/BCFtools/1.9-foss-2018b",
    "functional_annotation_workflow.vep": "bio/VEP/99.1-foss-2019a-Perl-5.28.1",
    "functional_annotation_workflow.input_file": "input_example_b38.txt",
    "functional_annotation_workflow.PERL5LIB": "/resources/tools/manual_apps/software/bio/Bio-DB-BigFile/foss-2019a-Perl-5.28.1/lib/perl5/x86_64-linux-thread-multi:/resources/tools/apps/software/bio/VEP/99.1-foss-2019a-Perl-5.28.1/Plugins/loftee-GRCh38:$PERL5LIB",
    "functional_annotation_workflow.output_name_prefix":"", // rename "" to change the output file prefix
    "functional_annotation_workflow.output_name_suffix":"annotated", // rename "annotated" to change the output file suffix
    "functional_annotation_workflow.vep_options": [
        "--assembly GRCh38",
        "--dir_cache /resources/data/vep.caches/helix/99",
        "--cache_version 99",
        "--verbose",
        "--no_stats",
        "--fasta /public_data_resources/reference/GRCh38/GRCh38Decoy_no_alt.fa",
        "--ccds",
        "--uniprot",
        "--hgvs",
        "--symbol",
        "--numbers",
        "--domains",
        "--regulatory",
        "--canonical",
        "--protein",
        "--biotype",
        "--tsl",
        "--appris",
        "--gene_phenotype",
        "--af",
        "--af_1kg",
        "--af_esp",
        "--max_af",
        "--pubmed",
        "--variant_class",
        "--mane",
        "--overlaps",
        "--plugin dbNSFP,/tools/apps/restricted_academic/software/bio/dbNSFP/4.0/dbNSFP4.0a.txt.gz,LRT_score,MutationTaster_score,SIFT_score,SIFT_converted_rankscore,SIFT_pred,SIFT4G_score,SIFT4G_converted_rankscore,Polyphen2_HDIV_score,Polyphen2_HDIV_rankscore,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_rankscore,Polyphen2_HVAR_pred,REVEL_score,REVEL_rankscore,MutPred_score,MutPred_rankscore,MutPred_protID,PrimateAI_pred",
        "--plugin CADD,/public_data_resources/CADD/v1.6/GRCh38/whole_genome_SNVs.tsv.gz,/public_data_resources/CADD/v1.6/GRCh38/gnomad.genomes.r3.0.indel.tsv.gz",
        "--plugin SpliceAI,snv=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.snv.hg38.vcf.gz,indel=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.indel.hg38.vcf.gz",
        "--plugin SpliceRegion",
        "--plugin LoF,loftee_path:/resources/tools/apps/software/bio/VEP/99.1-foss-2019a-Perl-5.28.1/Plugins/loftee-GRCh38,human_ancestor_fa:/public_data_resources/vep_resources/Build-38/human_ancestor.fa.gz,gerp_bigwig:/public_data_resources/vep_resources/Build-38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,conservation_file:/public_data_resources/vep_resources/Build-38/loftee.sql",
        "--custom /public_data_resources/gnomad/v3/gnomad.genomes.r3.0.sites.vcf.bgz,gnomADg,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_sas,AF_fin,AF_nfe,AF_oth,AF_ami,AF_male,AF_female",
        "--custom /public_data_resources/phylop100way/hg38.phyloP100way.bw,PhyloP,bigwig",
        "--custom /public_data_resources/TOPMed/allele_frequencies/bravo-dbsnp-all.vcf.gz,topmedg,vcf,exact,0,AF,SVM",
        "--custom /public_data_resources/vep_resources/Build-38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,GERP,bigwig",
        "--fork 4",
        "--compress_output bgzip"
    ]

}
workflow_inputs_b38_commercial.json
{
    "functional_annotation_workflow.threads":4,
    "functional_annotation_workflow.bcftools": "bio/BCFtools/1.9-foss-2018b",
    "functional_annotation_workflow.vep": "bio/VEP/99.1-foss-2019a-Perl-5.28.1",
    "functional_annotation_workflow.input_file": "input_example_b38.txt",
    "functional_annotation_workflow.PERL5LIB": "/resources/tools/manual_apps/software/bio/Bio-DB-BigFile/foss-2019a-Perl-5.28.1/lib/perl5/x86_64-linux-thread-multi:/resources/tools/apps/software/bio/VEP/99.1-foss-2019a-Perl-5.28.1/Plugins/loftee-GRCh38:$PERL5LIB",
    "functional_annotation_workflow.output_name_prefix":"", // rename "" to change the output file prefix
    "functional_annotation_workflow.output_name_suffix":"annotated", // rename "annotated" to change the output file suffix
    "functional_annotation_workflow.vep_options": [
        "--assembly GRCh38",
        "--dir_cache /resources/data/vep.caches/helix/99",
        "--cache_version 99",
        "--verbose",
        "--no_stats",
        "--fasta /public_data_resources/reference/GRCh38/GRCh38Decoy_no_alt.fa",
        "--ccds",
        "--uniprot",
        "--hgvs",
        "--symbol",
        "--numbers",
        "--domains",
        "--regulatory",
        "--canonical",
        "--protein",
        "--biotype",
        "--tsl",
        "--appris",
        "--gene_phenotype",
        "--af",
        "--af_1kg",
        "--af_esp",
        "--max_af",
        "--pubmed",
        "--variant_class",
        "--mane",
        "--overlaps",
        "--plugin dbNSFP,/public_data_resources/dbNSFP/dbNSFP4.2c/dbNSFP4.2c.txt.gz,LRT_score,MutationTaster_score,SIFT_score,SIFT_converted_rankscore,SIFT_pred,SIFT4G_score,SIFT4G_converted_rankscore,MutPred_score,MutPred_rankscore,MutPred_protID,PrimateAI_pred"
        "--plugin SpliceAI,snv=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.snv.hg38.vcf.gz,indel=/public_data_resources/SpliceAI/Predicting_splicing_from_primary_sequence-66029966/genome_scores_v1.3/spliceai_scores.raw.indel.hg38.vcf.gz",
        "--plugin SpliceRegion",
        "--plugin LoF,loftee_path:/resources/tools/apps/software/bio/VEP/99.1-foss-2019a-Perl-5.28.1/Plugins/loftee-GRCh38,human_ancestor_fa:/public_data_resources/vep_resources/Build-38/human_ancestor.fa.gz,gerp_bigwig:/public_data_resources/vep_resources/Build-38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,conservation_file:/public_data_resources/vep_resources/Build-38/loftee.sql",
        "--custom /public_data_resources/gnomad/v3/gnomad.genomes.r3.0.sites.vcf.bgz,gnomADg,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_sas,AF_fin,AF_nfe,AF_oth,AF_ami,AF_male,AF_female",
        "--custom /public_data_resources/phylop100way/hg38.phyloP100way.bw,PhyloP,bigwig",
        "--custom /public_data_resources/TOPMed/allele_frequencies/bravo-dbsnp-all.vcf.gz,topmedg,vcf,exact,0,AF,SVM",
        "--custom /public_data_resources/vep_resources/Build-38/gerp_conservation_scores.homo_sapiens.GRCh38.bw,GERP,bigwig",
        "--fork 4",
        "--compress_output bgzip"
    ]

}
workflow_inputs_b37.json
{
      "functional_annotation_workflow.threads":4,
      "functional_annotation_workflow.bcftools": "bio/BCFtools/1.9-foss-2018b",
      "functional_annotation_workflow.vep": "bio/VEP/99.1-foss-2019a-Perl-5.28.1",
      "functional_annotation_workflow.input_file": "input_example_v37.txt",
      "functional_annotation_workflow.PERL5LIB": "/resources/tools/manual_apps/software/bio/Bio-DB-BigFile/foss-2019a-Perl-5.28.1/lib/perl5/x86_64-linux-thread-multi:/resources/tools/apps/software/bio    /VEP/99.1-foss-2019a-Perl-5.28.1/Plugins/loftee-GRCh37:$PERL5LIB",
      "functional_annotation_workflow.output_name_prefix":"",
      "functional_annotation_workflow.output_name_suffix":"annotated",
      "functional_annotation_workflow.vep_options": [
          "--assembly GRCh37",
          "--dir_cache /resources/data/vep.caches/helix/99",
          "--cache_version 99",
          "--verbose",
          "--no_stats",
          "--fasta /public_data_resources/reference/GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa",
          "--ccds",
          "--uniprot",
          "--hgvs",
          "--symbol",
          "--numbers",
          "--domains",
          "--regulatory",
          "--canonical",
          "--protein",
          "--biotype",
          "--tsl",
          "--appris",
          "--gene_phenotype",
          "--af",
          "--af_1kg",
          "--af_esp",
          "--max_af",
          "--pubmed",
          "--variant_class",
          "--mane",
          "--overlaps",
          "--plugin LoF,loftee_path:/resources/tools/apps/software/bio/VEP/99.1-foss-2019a-Perl-5.28.1/Plugins/loftee-GRCh37,human_ancestor_fa:/public_data_resources/vep_resources/Build-37/human_ances    tor.fa.gz,gerp_file:/public_data_resources/vep_resources/Build-37/GERP_scores.final.sorted.txt.gz,conservation_file:/public_data_resources/vep_resources/Build-37/phylocsf_gerp.sql",
          "--custom /public_data_resources/gnomad/GRCh37/genomes/gnomad.genomes.r2.0.1.sites.9.vcf.gz,gnomADg,vcf,exact,0,AF,AF_afr,AF_amr,AF_asj,AF_eas,AF_sas,AF_fin,AF_nfe,AF_oth,AF_ami,AF_male,AF_f    emale",
          "--fork 4",
          "--compress_output bgzip"
      ]

 }

The submit_workflow.sh

This is the main script that will need to use to submit your annotation job. As mentioned in the Usage section, you will need to ensure that you include the correct project code for your work on line two (See this page for more information on project codes). Second, you will need to set the correct TMPDIR variable on line 12 (The TMPDIR variable must be set. Finally, to provide your workflow with the information to match the genome build of your VCF and the annotation workflow, you will need to edit the relevant inputs JSON file on line 16.

submit_workflow.sh
#BSUB -q inter
##BSUB -P <your_project_code> # edit this line
##BSUB -o logs/%J.stdout
##BSUB -e logs/%J.stderr
##BSUB -cwd .
##BSUB -n 2
##BSUB -R "rusage[mem=36000] span[hosts=1]"

module load bio/cromwell/51
mkdir -p logs

tstamp=$(date +%Y_%m_%d_%Hh%M)

export TMPDIR=/re_scratch/<your_scratch_location> # edit this line to include your temp_dir

java -Dconfig.file=cromwell.conf -jar $CROMWELL_JAR \
 run functional_annotation_workflow.wdl \
 -i workflow_inputs_b38.json \ # this will need to be changed to workflow_inputs_b37.json if your input data is aligned to hg19/GRCh37
 -o workflow_options.json \
 -m metadata_"$tstamp".json

Last update: November 8, 2023