Skip to content

Accessing mental health data programmatically

Give us feedback on this tutorial

The advantage of querying medical history programmatically is that the underlying tables are exposed, allowing you to verify the data in bulk. This also allows you to do very complex queries. You can save and tweak scripts to re-use in new releases or for similar queries.

There are a huge number of tables that you can use to query mental health history. This tutorial will take you through some of the tables and give you some example scripts that you can use in python or R to access the data.

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:

conda create -n query-tables r-glue r-tidyverse r-data.table r-dbi r-rpostgres r-irkernel -y
conda activate query-tables

You can now launch an R notebook under the query-tables kernel. More details

Contents:

Import modules/libraries you need

LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:

library(tidyverse)
library(Rlabkey)

If you are using R version 3 or above, you should also add the following line:

labkey.setWafEncoding(FALSE)
1
2
3
4
import numpy as np
import functools
import labkey
import pandas as pd
1
2
3
4
library(tidyverse)
library(data.table)
library(glue)
library(RPostgres)

Helper function to access the LabKey API

We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:

  • sql_query: an SQL query to access the LabKey tables
  • database: the version of the database you want to access for example /main-programme/main-programme_v19_2024-10-31
  • maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.

Feel free to include this function in all your scripts and invoke it every time you want to query the data.

labkey_to_df <- function(sql_query, database, maxrows){

    labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

    labkey.executeSql(folderPath = database,
                      schemaName = "lists",
                      colNameOpt = "rname",
                      sql = sql_query,
                      maxRows = maxrows) %>%
        mutate(across(everything(), as.character))
}
    def labkey_to_df(sql_query, database, maxrows):

    ver = labkey.__version__

    if ver == '1.2.0' or ver == '1.4.0' or ver == '1.4.1':
      server_context = labkey.utils.create_server_context(
          domain = "labkey-embassy.gel.zone",
          container_path = database,
          context_path = "labkey",
          use_ssl = True
      )

      results =  labkey.query.execute_sql(
          server_context,
          schema_name = "lists",
          sql = sql_query,
          max_rows = maxrows
      )

    if ver == '2.4.0':
      from labkey.api_wrapper import APIWrapper

      labkey_server = "labkey-embassy.gel.zone"
      project_name = database
      contextPath = "labkey"
      schema = 'lists'
      api = APIWrapper(
        labkey_server,
        project_name,
        contextPath,
        use_ssl=True
        )
      results = api.query.execute_sql(sql=sql_query, schema_name=schema, max_rows = maxrows)

    return(pd.DataFrame(results['rows']))
labkey_to_df <- function(sql_query, database, maxrows){
  DBNAME = "gel_clinical_cb_sql_pro"
  HOST = "clinical-cb-sql-pro.cfe5cdx3wlef.eu-west-2.rds.amazonaws.com"
  PORT = 5432
  PASSWORD = 'anXReTz36Q5r'
  USER = 'jupyter_notebook'

  connection <- DBI::dbConnect(
      RPostgres::Postgres(),
      dbname = DBNAME,
      host = HOST,
      port = PORT,
      password = PASSWORD,
      user = USER,
      options = paste("-c search_path=", database, sep="")
      )

  dbGetQuery(connection, sql_query, n = maxrows)
  }

To run my queries, I'll need to set up my database version:

version <- "/main-programme/main-programme_v19_2024-10-31"
version = "/main-programme/main-programme_v19_2024-10-31"
version <- "source_data_100kv16_covidv4"

Accessing mental health mhmd and mhldds tables

There are three sets of tables that cover different time periods:

  • mhmd (Mental Health Minimum Dataset) – 2011-2014
  • mhldds (Mental Health Learning Disabilities Dataset) – 2014-2016
  • mhsds (Mental Health Services Dataset) – 2016 onwards

mhmd and mhldds both follow similar schema, with record, event and episode tables, where a record is a superset of episodes which is a superset of events, linked by a spell_id. We'll start by querying the mhmd tables to find all details for our participant of interest.

In this and all following code we will refer to a string mh_part, which is the participant_id of the participant of interest. Assume that you have already defined this string in your code.

mhmd_sql <- paste("SELECT r.participant_id, r.mhd_mhmds_spell_id,
    ep.mhd_epitype, ep.mhd_epistart_date, ep.mhd_epiend_date,
    ev.mhd_eventtype, ev.mhd_eventdate
    FROM mhmd_v4_record as r
    LEFT JOIN mhmd_v4_episode as ep
        ON r.mhd_mhmds_spell_id = ep.mhd_mhmds_spell_id
    LEFT JOIN mhmd_v4_event as ev
        ON r.mhd_mhmds_spell_id = ev.mhd_mhmds_spell_id
    WHERE r.participant_id = ", mh_part,
    sep = "")

mhmd_query <- labkey_to_df(mhmd_sql, version, 300000)
mhmd_sql = (f'''
    SELECT r.participant_id, r.mhd_mhmds_spell_id,
    ep.mhd_epitype, ep.mhd_epistart_date, ep.mhd_epiend_date,
    ev.mhd_eventtype, ev.mhd_eventdate
    FROM mhmd_v4_record as r
    LEFT JOIN mhmd_v4_episode as ep
        ON r.mhd_mhmds_spell_id = ep.mhd_mhmds_spell_id
    LEFT JOIN mhmd_v4_event as ev
        ON r.mhd_mhmds_spell_id = ev.mhd_mhmds_spell_id
    WHERE r.participant_id = {mh_part}
    ''')
mhmd_query = labkey_to_df(mhmd_sql, version, 300000)
mhmd_sql <- paste("SELECT r.participant_id, r.mhd_mhmds_spell_id,
    ep.mhd_epitype, ep.mhd_epistart_date, ep.mhd_epiend_date,
    ev.mhd_eventtype, ev.mhd_eventdate
    FROM mhmd_v4_record as r
    LEFT JOIN mhmd_v4_episode as ep
        ON r.mhd_mhmds_spell_id = ep.mhd_mhmds_spell_id
    LEFT JOIN mhmd_v4_event as ev
        ON r.mhd_mhmds_spell_id = ev.mhd_mhmds_spell_id
    WHERE r.participant_id = ", mh_part,
    sep = "")

mhmd_query <- labkey_to_df(mhmd_sql, version, 100000)

We can use a very similar query with mhldds, as most of the columns are the same. You may find you need to change the column selections for certain data types.

mhldds_sql <- paste("SELECT r.participant_id, r.mhd_mhmds_spell_id,
    ep.mhd_epitype, ep.mhd_epistart_date, ep.mhd_epiend_date,
    ev.mhd_eventtype, ev.mhd_eventdate
    FROM mhldds_record as r
    LEFT JOIN mhldds_episode as ep
        ON r.mhd_mhmds_spell_id = ep.mhd_mhmds_spell_id
    LEFT JOIN mhldds_event as ev
        ON r.mhd_mhmds_spell_id = ev.mhd_mhmds_spell_id
    WHERE r.participant_id = ", mh_part,
    sep = "")

mhldds_query <- labkey_to_df(mhldds_sql, version, 50000)
mhldds_sql = (f'''
    SELECT r.participant_id, r.mhd_mhmds_spell_id,
    ep.mhd_epitype, ep.mhd_epistart_date, ep.mhd_epiend_date,
    ev.mhd_eventtype, ev.mhd_eventdate
    FROM mhldds_record as r
    LEFT JOIN mhldds_episode as ep
        ON r.mhd_mhmds_spell_id = ep.mhd_mhmds_spell_id
    LEFT JOIN mhldds_event as ev
        ON r.mhd_mhmds_spell_id = ev.mhd_mhmds_spell_id
    WHERE r.participant_id = {mh_part}
    ''')
mhldds_query = labkey_to_df(mhldds_sql, version, 50000)
mhldds_sql <- paste("SELECT r.participant_id, r.mhd_mhmds_spell_id,
    ep.mhd_epitype, ep.mhd_epistart_date, ep.mhd_epiend_date,
    ev.mhd_eventtype, ev.mhd_eventdate
    FROM mhldds_record as r
    LEFT JOIN mhldds_episode as ep
        ON r.mhd_mhmds_spell_id = ep.mhd_mhmds_spell_id
    LEFT JOIN mhldds_event as ev
        ON r.mhd_mhmds_spell_id = ev.mhd_mhmds_spell_id
    WHERE r.participant_id = ", mh_part,
    sep = "")

mhldds_query <- labkey_to_df(mhldds_sql, version, 100000)

Accessing mental health mhsds tables

To query the more recent mhsds records we have to use a completely different method. We will query the curated tables, which are tied together by the recordnumber.

mhsds_sql <- paste("SELECT p.participant_id, p.recordnumber,
    p.startdategmpregistration, p.enddategmpregistration,
    i.startdatewardstay, i.enddatewardstay, i.treatfunccodemh,
    c.refertotreatperiodstartdate, c.refertotreatperiodenddate,
    a.startdatecareclust, a.enddatecareclust
    FROM mhsds_curated_participant as p
    FULL OUTER JOIN mhsds_curated_inpatient as i
        ON p.recordnumber = i.recordnumber
    FULL OUTER JOIN mhsds_curated_community as c
        ON p.recordnumber = c.recordnumber
    FULL OUTER JOIN  mhsds_curated_assessment_diagnoses_and_cluster as a
        ON p.recordnumber = a.recordnumber
    WHERE p.participant_id = ", mh_part,
    sep = "")

mhsds_query <- labkey_to_df(mhsds_sql, version, 100)
mhsds_sql = (f'''
    SELECT p.participant_id, p.recordnumber,
    p.startdategmpregistration, p.enddategmpregistration,
    i.startdatewardstay, i.enddatewardstay, i.treatfunccodemh,
    c.refertotreatperiodstartdate, c.refertotreatperiodenddate,
    a.startdatecareclust, a.enddatecareclust
    FROM mhsds_curated_participant as p
    FULL OUTER JOIN mhsds_curated_inpatient as i
        ON p.recordnumber = i.recordnumber
    FULL OUTER JOIN mhsds_curated_community as c
        ON p.recordnumber = c.recordnumber
    FULL OUTER JOIN  mhsds_curated_assessment_diagnoses_and_cluster as a
        ON p.recordnumber = a.recordnumber
    WHERE p.participant_id = {mh_part}
    ''')

mhsds_query = labkey_to_df(mhsds_sql, version, 100)
mhsds_sql <- paste("SELECT p.participant_id, p.recordnumber,
    p.startdategmpregistration, p.enddategmpregistration,
    i.startdatewardstay, i.enddatewardstay, i.treatfunccodemh,
    c.refertotreatperiodstartdate, c.refertotreatperiodenddate,
    a.startdatecareclust, a.enddatecareclust
    FROM mhsds_curated_participant as p
    FULL OUTER JOIN mhsds_curated_inpatient as i
        ON p.recordnumber = i.recordnumber
    FULL OUTER JOIN mhsds_curated_community as c
        ON p.recordnumber = c.recordnumber
    FULL OUTER JOIN  mhsds_curated_assessment_diagnoses_and_cluster as a
        ON p.recordnumber = a.recordnumber
    WHERE p.participant_id = ", mh_part,
    sep = "")

mhsds_query <- labkey_to_df(mhsds_sql, version, 100000)