Warning

Past training sessions may include information that is no longer true, in either the presentation or the Q&A. Please double check against the relevant documentation pages.

Building rare disease cohorts with matching controls, May 2026¶

Building a cohort is a vital first step in many kinds of genomics studies, such as GWAS, survival analysis and identifying cancer characteristics. The vast array of phenotypic data available in the Genomics England Research Environment (GEL RE), both recruited disease and electronic health records, is a great resource for cohort building and verification.

This training session will go over some of the ways you can build cohorts in the GEL RE: Participant Explorer for no code creation and the Labkey API for programmatic construction and verification. Using both methods, we will show how you can pull out the genomic file locations, or the participant identifiers to use with variant aggregation files. During the session, we will discuss the tables in the database which contain phenotypic data, using ICD10 and HPO codes for diagnoses in the primary and secondary tables, plus continuous measurements in rare disease. We will look at how you can build matched cohorts of sex and ethnicity

Timetable¶

13.30 Introduction and admin
13.35 Parameters and considerations for building a cohort
13.45 no code cohort building with Participant Explorer
13.55 Labkey tables for cohort building in rare disease
14.05 Builing cohorts programmatically in Python and R
14.15 Creating a matched cohort 14.25 Getting genomic filepaths for your cohort
14.35 Using your cohort with aggregate VCFs and bcftools
14.45 Getting help and questions

Learning objectives¶

After this training you will know:

Where to find phenotypic and covariate data for building cohorts in the Genomics England Research Environment
How to create cohorts using the Participant Explorer no code interface
How to use the Labkey API to create and verify cohorts with Python or R

Target audience¶

This training is aimed at researchers:

working with the Genomics England Research Environment
working in rare disease genomics
who can programme in python and/or R (a small segment of the training is suitable for non-programmers)

Date¶

13th May 2026

Materials¶

You can access the redacted slides and video below. All sensitive data has been censored. You can access and copy code from the Jupyter and R notebooks used in the training at:

/gel_data_resources/example_scripts/workshop_scripts/rare_disease_cohorts_2026

Slides¶

View the sides

Video¶

Give us feedback on this tutorial

Q&A

Will a recording of this training be shared? Thank you

As Emily has mentioned, a redacted recording will be made available after the session. Or this can be accessed here: https://re-docs.genomicsengland.co.uk/upcoming/#past-training-sessions

Thank you and sorry I missed this earlier.

can you please again explain the normalized part of the record in short, if possible?

The medical records come from the NHS, clinicians will be filling in many fields with free-text, not all of which are shared with us. Because of this there can be some abiguity in the language used, so the diagnosis information is converted into standard codes, such as SNOMED HPO or ICD-10 which ensures that the terms used mean the same across all participants

cThank you!

I have a specific question about transcriptomics_file_paths_and_types in labkey what is the best way to pull a table of all the genome file paths for these cases? the lab key table only gives the RNA data location but i need the WGS data first. I tried cloudOS but it doesnt seem to have the extension data in it only the pilot.

The genome files will be the VCF files called from the RNAseq data that is provided, we generally do not recommend working from the alignments as the calling has been performed previously

ignore the above so there arent seperate WGS genomes for the transcriptomics data?

Each participant will have WGS data that is listed in the genome_file_paths_and_types table, you can filter this table by the participant ID to get the filepaths that you will need

so download participants from transcriptomics table and search the genomics table with them? would the Filepaths. how to guide - be best for that do you think?

The best way of doing this will be to use the API in the same way that Emily is showing here, you will need to use SQL JOIN arguments to have one table filtered by another. For additional information on the best way to use SQL queries I suggest looking at the W3 Schools pages on this language: https://www.w3schools.com/sql/

Thank you very much really appreciate that.

Is there a 'dictionary' for all the tables? their titles are not always intuitive

There is, this data dictionary is available on the release page for the version of the data that you are using, the latest release will be here: https://re-docs.genomicsengland.co.uk/release19/

do you support claude code?

There are no coding models available within the Research Environment, we also do not allow connections from the Research Environment to external AI models.

Do you have to do the netrc file everytime you log into the re to use labkey api?

Fortunately you do not, this is a one-time set up that allows you to programmatically access the API. full instructions are available here: https://re-docs.genomicsengland.co.uk/labkey_api_configuration/

I am interested in a specific rare genetic syndrome. how could I find these patients the easiest. It does not have a specific ICD10code. I started with searching the genetic data, but is there an easier way?

This is a question that has many possible avenues of exploration, I suggest that these would be better ahndled within a service desk ticket. PLease be aware that while we can provide some general principles, the service desk does not provide training on in-depth consultancy services.

where to store data, when we exceed the data limit allocated to the home folder ?

We would encourage you to work within your domain directories as these are accessible to the HPC.

are the solved cases are the ones with prioritized genes in the tiered data field? Are those fields somehow including the same info?

Soved cases are those that have been returned a diagnosis, this information is includedn in the GMC exit questionaire

Where can i find the code for today's tutorial?

live answered