Skip to content

The HPC is changing

We will soon be switching to a new High Performance Cluster, called Double Helix. This will mean that some of the commands you use to connect to the HPC and call modules will change. We will inform you by email when you are switching over, allowing you to make the necessary changes to your scripts. Please check our HPC changeover notes for more details on what will change.

Working with NHS GMS data within LabKey

As of the 15th of June 2022, Genomics England has started the process of releasing consented genomic data from the NHS Genomic Medicine Service (NHS GMS) within the Research Environment. The clinical and other relevant information relating to these participants will be contained within LabKey tables in much the same way that the main programme data is stored. For a summary of the data structure, we recommend that you consult the appropriate page and release notes on the current NHS GMS release.

Differences to 100kGP

Participant IDs within the NHS GMS data use a different format than found within the 100kGP.

Not all the same tables are included in the GMS release as in 100kGP, and many tables and field have been renamed. Please refer to the latest GMS data dictionary for details of the tables and fields.

It is not possible to cross-query the 100K Genomes and NHS GMS data in a single step. If the query you are working with will return data in both datasets you would need to recover these in two separate queries prior to combining the data within R or Python dataframes.

In the 100kGP releases the distinction between the Cancer and Rare Disease cohorts was included in the Participants table. This information can now be found in the referral table using the Clinical Indication Code column. Codes containing 'R' will refer to Rare Disease referrals and 'M' to Cancer referrals.

LabKey APIs

The LabKey interface will allow you to browse and filter the information located in a particular table. While this enables you to gain an insight into the information stored in that table, the major power of the LabKey database for cohort building lies in the ability to cross-query the tables. We would recommend the use of the APIs for this purpose. We support two APIs within the Research Environment, one for Python and one for R.

Examples

The APIs mentioned above will accept queries in a number of formats, however, we would recommend using the execute SQL functions to submit SQL queries of the data in the release. Using SQL will allow you to as more complex questions that will touch data across multiple tables, and will allow the returned data to be processed in a more effective manner.

The R version needed to access the API can be loaded in the Research Environment with the command module load R/4.0.2; on the HPC this command would be module load lang/R/4.0.2-foss-2019b.

The Python version needed to access the API can be loaded in the Research Environment with the command module load python/3.8.1 and in the HPC with the command module load lang/Python/3.7.4-GCCcore-8.3.0.

Import modules/libraries you need

LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:

1
2
3
library(tidyverse)
library(Rlabkey)
library(readr)
1
2
3
4
import numpy as np
import functools
import labkey
import pandas as pd

Helper function to access the LabKey API

We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:

  • sql_query: an SQL query to access the LabKey tables
  • database: the version of the database you want to access for example /main-programme/main-programme_v18_2023-12-21
  • maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.

Feel free to include this function in all your scripts and invoke it every time you want to query the data.

labkey_to_df <- function(sql_query, database, maxrows){

    labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

    labkey.executeSql(folderPath = database,
                      schemaName = "lists",
                      colNameOpt = "rname",
                      sql = sql_query,
                      maxRows = maxrows) %>%
        mutate(across(everything(), as.character))
}
def labkey_to_df(sql_query, database, maxrows):

ver = labkey.__version__

if ver == '1.2.0' or ver == '1.4.0' or ver == '1.4.1':
  server_context = labkey.utils.create_server_context(
      domain = "labkey-embassy.gel.zone",
      container_path = database,
      context_path = "labkey",
      use_ssl = True
  )

  results =  labkey.query.execute_sql(
      server_context,
      schema_name = "lists",
      sql = sql_query,
      max_rows = maxrows
  )

if ver == '2.4.0':
  from labkey.api_wrapper import APIWrapper

  labkey_server = "labkey-embassy.gel.zone"
  project_name = database
  contextPath = "labkey"
  schema = 'lists'
  api = APIWrapper(
    labkey_server,
    project_name,
    contextPath,
    use_ssl=True
    )
  results = api.query.execute_sql(sql=sql_query, schema_name=schema, max_rows = maxrows)

return(pd.DataFrame(results['rows']))

To run my queries, I'll need to set up my database version:

version <- "nhs-gms/nhs-gms-release_v3_2024-03-18"
version = "nhs-gms/nhs-gms-release_v3_2024-03-18"

Querying the LabKey data

In the following example, we will extract the genomic features for Tier 1 and 2 variants uncovered within participants that present an "intellectual disability" phenotype.

sql_query <- "SELECT
    DISTINCT td.participant_id,
    td.genomic_feature_hgnc,
    td.genotype,
    td.segregation_pattern,
    td.penetrance,
    td.phenotype AS tiered_phenotype,
    pa.phenotype AS panel_phenotype,
    pa.panel_name,
    td.ensembl_id,
    td.tier,
    td.chromosome,
    td.position,
    td.reference,
    td.alternate,
    CONCAT_WS(
      '_',
      td.chromosome,
      td.position,
      td.reference,
      td.alternate
      ) AS tiered_chr_pos_ref_alt,
    e.score,
    e.hgvs
  FROM
    tiering_data td
  JOIN
    panels_applied pa ON
      pa.participant_id = td.participant_id AND
      pa.phenotype = td.phenotype
  JOIN
    exomiser e ON e.participant_id = td.participant_id
  WHERE
    td.tier in ('TIER1', 'TIER2') AND
    LOWER(pa.phenotype) = LOWER('intellectual disability') AND
    (
      td.chromosome = e.chromosome AND
      td.position = e.position AND
      td.reference = e.reference AND
      td.alternate = e.alternate
      );"
sql_query = "SELECT
    DISTINCT td.participant_id,
    td.genomic_feature_hgnc,
    td.genotype,
    td.segregation_pattern,
    td.penetrance,
    td.phenotype AS tiered_phenotype,
    pa.phenotype AS panel_phenotype,
    pa.panel_name,
    td.ensembl_id,
    td.tier,
    td.chromosome,
    td.position,
    td.reference,
    td.alternate,
    CONCAT_WS(
      '_',
      td.chromosome,
      td.position,
      td.reference,
      td.alternate
      ) AS tiered_chr_pos_ref_alt,
    e.score,
    e.hgvs
  FROM
    tiering_data td
  JOIN
    panels_applied pa ON
      pa.participant_id = td.participant_id AND
      pa.phenotype = td.phenotype
  JOIN
    exomiser e ON e.participant_id = td.participant_id
  WHERE
    td.tier in ('TIER1', 'TIER2') AND
    LOWER(pa.phenotype) = LOWER('intellectual disability') AND
    (
      td.chromosome = e.chromosome AND
      td.position = e.position AND
      td.reference = e.reference AND
      td.alternate = e.alternate
      );"

The query above provides examples of:

  • the SELECT, FROM, WHERE principles of the SQL language
  • joining tables in single and multiple columns
  • conditions within lists
  • case-folding
  • compound conditioning
  • output collation

The basics of the SQL syntax can be reviewed from the W3Schools.

For more in-depth reference material and LabKey Specific commands, we would recommend consulting the LabKey SQL guide.

The information in this example can be drawn from three LabKey tables (tiering_data, panels_applied and exomiser) within a single query. As part of this query, we hope to obtain information on the number of participants affected, the genes that have been scored as being in Tier 1 or 2 within this cohort, and the indications of the genomic changes reported by the exomiser analysis.

You can then run the SQL query with to helper function to return a dataframe:

query <- labkey_to_df(sql_query, version, 10000)
query = labkey_to_df(sql_query, version, 10000)