Skip to content

The HPC is changing

We will soon be switching to a new High Performance Cluster, called Double Helix. This will mean that some of the commands you use to connect to the HPC and call modules will change. We will inform you by email when you are switching over, allowing you to make the necessary changes to your scripts. Please check our HPC changeover notes for more details on what will change.

Labkey API

The LabKey API allows you to query the LabKey tables programmatically using SQL queries in a variety of programming languages.

To do very simple queries on the clinical/phenotype data you can use the LabKey application on the Research Environment desktop. However, in order to do more complex analyses across multiple tables, or query very large tables, it is recommended that you use the provided LabKey APIs.

The LabKey client libraries (APIs) provide secure, auditable, programmatic access to LabKey data and services and allow you to:

  • Analyse and visualise data stored in LabKey in a statistical tool such as R or Python
  • Perform routine, automated tasks in a programmatic way.
  • Query and manipulate data in a repeatable and consistent way.

Currently, LabKey supports working with the following programming languages/environments.

  • Rlabkey Package
  • Python API
  • JavaScript API
  • Java API
  • Perl API
  • SAS Macros

We highly recommend using either the Python LabKey API or the R LabKey API to query the Genomics England clinical/phenotype data.

Configuring access to the LabKey APIs

To get access, you will need to set up permissions with your username and password.

Configure access

Writing LabKey API scripts

Here you will find a guide to writing LabKey API scripts in Python and R. Use the toggles to switch between the languages. You can find further information on using the API in Python and R in the main LabKey documentation.

Loading the required version of R or Python

The LabKey API only works with certain versions of R and Python. You will need to load the correct versions to run your scripts, and for Python this differs between the desktop and the HPC, which uses conda environments.

module load R/4.0.2
module load python/3.8.1
source /resources/conda/miniconda3/bin/activate && conda activate py3pypirev2

You can use interactive coding environments to work with the API, please check our further documentation on:

Loading LabKey libraries/modules

You will need to load LabKey in your scripts, notebooks and interactive coding sessions. There are also a few other modules we recommend you load alongside these to help you manipulate the data.

library(tidyverse)
library(Rlabkey)
1
2
3
4
import numpy as np
import functools
import labkey
import pandas as pd

Helper function for using the API for SQL queries

The LabKey API works by enabling SQL queries of the LabKey tables. The following code is a helper function, labkey_to_df that connects to the server, executes an SQL query and returns the results as a dataframe. Feel free to include this function in all your scripts and invoke it every time you want to query the data.

labkey_to_df <- function(sql_query, database, maxrows){

    labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

    labkey.executeSql(folderPath = database,
                      schemaName = "lists",
                      colNameOpt = "rname",
                      sql = sql_query,
                      maxRows = maxrows) %>%
        mutate(across(everything(), as.character))
      }
def labkey_to_df(sql_query, database, maxrows):

ver = labkey.__version__

if ver == '1.2.0' or ver == '1.4.0' or ver == '1.4.1':
  server_context = labkey.utils.create_server_context(
      domain = "labkey-embassy.gel.zone",
      container_path = database,
      context_path = "labkey",
      use_ssl = True
  )

  results =  labkey.query.execute_sql(
      server_context,
      schema_name = "lists",
      sql = sql_query,
      max_rows = maxrows
  )

if ver == '2.4.0':
  from labkey.api_wrapper import APIWrapper

  labkey_server = "labkey-embassy.gel.zone"
  project_name = database
  contextPath = "labkey"
  schema = 'lists'
  api = APIWrapper(
    labkey_server,
    project_name,
    contextPath,
    use_ssl=True
    )
  results = api.query.execute_sql(sql=sql_query, schema_name=schema, max_rows = maxrows)

return(pd.DataFrame(results['rows']))

You need to give this function:

  • sql_query: an SQL query to access the LabKey tables
  • database: the version of the database you want to access for example /main-programme/main-programme_v18_2023-12-21
  • maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.

Here is an example of how you can use this function:

sql <- "SELECT
    p.participant_id,
    p.programme,
    sr.lab_sample_id,
    sr.delivery_date,
  FROM
    participant p
  JOIN
    sequencing_report sr ON p.participant_id = sr.participant_id
  WHERE
    YEAR(sr.delivery_date) BETWEEN 2019 AND 2021;"

database <- "/main-programme/main-programme_v18_2023-12-21"

query <- labkey_to_df(sql, database, 100000)
sql = f"""
  SELECT
    p.participant_id,
    p.programme,
    sr.lab_sample_id,
    sr.delivery_date,
  FROM
    participant p
  JOIN
    sequencing_report sr ON p.participant_id = sr.participant_id
  WHERE
    YEAR(sr.delivery_date) BETWEEN 2019 AND 2021;"""

  database = "/main-programme/main-programme_v18_2023-12-21"

  query = labkey_to_df(sql, database, 100000)

Fetching a whole table

You can also fetch a whole table using the API, without using the helper function. Here is some example code to fetch the sequencing_report.

# Set the baseURL
labkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")

sequencing_report <- labkey.selectRows(                             

    schemaName="lists",           # Do not change this
    colNameOpt="rname",           # Do not change this
    maxRows = 100000000,            # Do not change this
    folderPath="/main-programme/main-programme_v18_2023-12-21",    # This can be changed to different main programme releases    
    queryName="sequencing_report" # This can be changed to different table names

)
# Specify what we are connecting to, and what schema and tables we want  
labkey_server = "labkey-embassy.gel.zone" # The labkey server we are connecting to. This will not change  
project_name = "/main-programme/main-programme_v18_2023-12-21" # The data we want to access. This will change depending on your study  
context_path = "labkey" # This does not change  
schema_name = "lists" # The schema we are getting data from. This does not change  
query_name = "sequencing_report" # The table we want to get data from. This does change  

# Create an object that will let us connect to the LabKey databases. This does not change.  
server_context = labkey.utils.create_server_context(  
labkey_server, project_name, context_path, use_ssl=True  
)  

# The data are returned and stored in the variable results.  
results = labkey.query.select_rows(server_context, schema_name, query_name, max_rows=200000)  

# Data are returned as a dictionary, will all of the table information stored under the key "rows".  
# We make a dataframe of all of the table information using pandas.  
table_of_data = pd.DataFrame(results["rows"])

Fetching part of a single table

The following code outlines how to filter a table by columns and/or rows in order to return a subset of a table, selecting specific columns (colSelect) and filtering for specified row values (colFilter). This does not allow you to combine this with another table, so in most cases using an SQL query is preferable.

labkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")

participant <- labkey.selectRows(

    # Default selection
    schemaName="lists",       # Do not change this
    colNameOpt="rname",       # Do not change this
    maxRows = 100000000,        # Do not change this
    folderPath="/main-programme/main-programme_v18_2023-12-21",    # This can be changed to different main programme releases
    queryName="participant",  # This can be changed to different table names

    # Additional parameters
    colFilter=makeFilter(     # Make various filters to subset rows
      c("participant_type", "EQUAL", "Proband"),
      c("year_of_birth", "BETWEEN", "1990,2010")),
    colSelect=c("participant_id", "rare_diseases_family_id", "year_of_birth"),     # Choose to only select these columns
    colSort=("+year_of_birth") # Sort by these columns
# Specify what we are connecting to, and what schema and tables we want  
labkey_server = "labkey-embassy.gel.zone" # The labkey server we are connecting to. This will not change  
project_name = "/main-programme/main-programme_v18_2023-12-21" # The data we want to access. This will change depending on your study  
context_path = "labkey" # This does not change  
schema_name = "lists" # The schema we are getting data from. This does not change  

query_name = "participant" # The table we want to get data from. This does change  

columns = "participant_id, rare_diseases_family_id, year_of_birth"  

filter1 = labkey.query.QueryFilter("participant_type", "Proband")  
filter2 = labkey.query.QueryFilter("year_of_birth", "1990,2010", labkey.query.QueryFilter.Types.BETWEEN) # Return only reports with year of birth included in a range.

combined_filter = [filter1, filter2] # Combine the previous two filters.  

# Create an object that will let us connect to the LabKey databases. This does not change.  
server_context = labkey.utils.create_server_context(  
labkey_server, project_name, context_path, use_ssl=True  
)  

# Fetch and store only the columns that we want.  
selected_columns = labkey.query.select_rows(  
server_context, schema_name, query_name,  
columns = columns  
)  

# Filters can be used to fetch only the rows that we are interested in.  
selected_rows1 = labkey.query.select_rows(  
server_context, schema_name, query_name,  
filter_array = [filter1]  
)  
selected_rows2 = labkey.query.select_rows(#  
server_context, schema_name, query_name,  
filter_array = [filter2]  
)  

# Filters can also be combined to with each other and with column selection to do more powerful selection of data from a table.  
selected_data = labkey.query.select_rows(  
server_context, schema_name, query_name,  
filter_array = combined_filter,  
columns = columns  
)  

# Data can then be stored in a dataframe as normal.  
table_of_data = pd.DataFrame(selected_data["rows"])

Video tutorials