Skip to content

I'm interested in a phenotype and I want to know what variants are related

How to use this page

Below you can switch between three categories: no code, existing tools and from scratch. Please select the version that matches your skills and the scale of the task you want to do.

Category Scale Skills needed Overview Audience
no code small basic IT skills Uses no code tools in the RE Clinicians and biologists without coding or command line skills
existing tools large command line and limited coding uses pipelines generated in-house to carry out standardised analyses bioinformaticians/computational biologists doing standard analyses
from scratch large command line, coding and common bioinformatics tools illustrates the steps you might follow using common bioinformatics tools to carry out custom analyses bioinformaticians/computational biologists doing custom analyses

The instructions in each section include links to the relevant pages in the documentation. Links are tagged as:

  • Tutorials
  • Tools - descriptive
  • Data - descriptive
  • Pre-made workflows
  • Reference lists/tables

Find participants with a phenotype

To search for phenotypes our participants were recruited for or phenotypes that can be identified from participants' medical records, you can use Participant Explorer or LabKey:

The IVA variant browser also allows you to search for variants by various filters, including HPO terms. This will give you all the variants found in participants who have a particular phenotype, not all variants found to be causal for a phenotype. Note that we only have HPO terms for rare disease participants, and that these are only those phenotypes identified on recruitment. Look at:

Once you've familiarised yourself with the tools, you can use this to create a cohort of participants with your phenotype of interest and filter variants.

You can compile together the data you've found in a text editor, but if you prefer a word processor or spreadsheet, we have LibreOffice available:

You may be able to perform some statistical analysis on your data and identify correlations using LO Calc, however, some analysis may require the use of coding on the HPC. Please take a look at the other sections for help with this.

Working with the HPC

You will need to work on the HPC for any large-scale analyses. You can learn more about the HPC and how to access it:

There are folders on the HPC for your Research Network domain or Discovery forum. You should use your relevant folder as your working directory. These are also accessible from the desktop:

GWAS workflow

The prebuilt GWAS workflow allows you to carry out a GWAS analysis. The documentation includes an example that you can run to try out the pipeline yourself:

To work with the GWAS pipeline, you will need to build a cohort, which you can do using LabKey. You will need to make yourself familiar with the clinical data we have available and the LabKey API which you will use to access it. We also have a tutorial on cohort building to work through:

Now you can create a cohort for your phenotype(s) of interest. These can be as complex as you need them to be, including sub-categories of phenotypes and modifiers.

We provide support for coding in Python or R in the RE. You can use interactive coding tools such as RStudio and Jupyter notebooks, which you can use on the HPC:

Use this cohort to run the GWAS pipeline. This will provide you with summary statistics and manhattan plots. You can further analyse these data using Python or R, or with LibreOffice Calc.

Find VCFs

If you prefer to work with the VCF files directly, you can find out information about our gVCFs and aggregate VCFs:

You can find out more about the file structure where these are located and your own working directories here:

Use tools on the HPC

You will find tools like BCFtools installed in the HPC, which you can use for exploring the VCFs.

Filter for consented samples

To ensure you are working only with consented samples, you may need to carry out some filtering steps on your VCFs. There are details of how to do this with the aggregated VCFs.

If you are working with the gVCFs, you will need to use LabKey and the current data version to filter.

Phenotypes associated with participants

You can also use LabKey to map participants to phenotypes, including HPO terms associated with rare disease, ICD10 codes in medical history and the disease participants were recruited for. We have tutorials on using the LabKey API to build cohorts based on phenotypes and fetching medical history for participants:

Create or import pipelines

You can analyse and combine these data in any way you choose, using any programming languages that are provided on the HPC. We also provide conda environments for working in Python and R libraries.

If you have your own pipelines written as containers, you can use Singularity to bring them into the RE.

Compile text and figures

You can use LO Calc to create figures and tables. You can also write any notes in LO Writer.


The only way to get the results of your analysis out is using Airlock. You should include any notes you may have made by hand. It is your responsibility to ensure your data conforms to the Airlock rules and does not contain any identifying data.