Skip to content

Accessing medical history data programmatically

Give us feedback on this tutorial

There are a huge number of tables that you can use to query medical history. This tutorial will take you through some of the tables and give you some example scripts that you can use in python or R to access the data, or using R in an interactive session on CloudOS, to access the data.

The advantage of querying medical history programmatically is that the underlying tables are exposed, allowing you to verify the data in bulk. This also allows you to do very complex queries. You can save and tweak scripts to re-use in new releases or for similar queries.

Tables to query

To access medical history, you might query the following tables:

Table Full name Details
hes_apc Hospital episode statistics - Admitted patient care Details of any overnight stays in hospital
hes_op Hospital episode statistics - Outpatients Details of any planned day appointments in hospital
hes_ae Hospital episode statistics - Accident and emergency Details of unplanned walk-in/ambulance visits to hospital
hes_cc Hospital episode statistics - Critical care Details of any time spent on life support
ecds Emergency care dataset Details of unplanned walk-in/ambulance visits to hospital
did Diagnostic imaging metadata Details of any medical imaging

The tables for medical history are extensive and were designed for medical record keeping, not for scientific research. They may have a lot of fields not relevant to research, or that are confusing. We recommend referring to the latest data dictionary for full details of these tables.

For your convenience, concatenations of some of the diagnosis and treatment columns, can make them easier to query. All the example code makes use of these concatenated columns, rather than the source data.

Diagnosis and treatment columns
Table Concatenated Column Name Source Columns Description Codes used
ecds diagnosis_code_all diagnosis_code_01 - diagnosis_code_12 Diagnoses SNOMED CT
ecds diagnosis_qualifier_all diagnosis_qualifier_01 - diagnosis_qualifier_12 Diagnosis qualifiers SNOMNED CT
ecds investigation_code_all investigation_code_01 - investigation_code_12 Investigations SNOMED CT
ecds treatment_code_all treatment_code_01 - treatment_code_12 Treatments SNOMED CT
hes_apc diag_all diag_01 - diag_20 Diagnoses ICD-10
hes_apc opertn_all opertn_01 - opertn_24 Treatments OPCS
hes_ae diag_all diag_01 - diag_12 Diagnoses ICD-10
hes_ae diag2_all diag2_01 - diag2_12 Diagnoses in two characters Refer to the data dictionary
hes_ae diaga_all diaga_01 - diaga_12 Diagnosis anatomical area Refer to the data dictionary
hes_ae diags_all diags_01 - diags_12 Diagnosis anatomical site Left, Right, Bilateral, 8 not applicable
hes_ae invest_all invest_01 - invest_12 Investigations 6an
hes_ae invest2_all invest2_01 - invest2_12 Investigations Refer to the data dictionary
hes_ae treat2_all treat2_01 - treat2_12 Treatment two-letter codes Refer to the data dictionary
hes_ae treat_all treat_01 - treat_12 Treatments 6an
hes_op diag_all diag_01 - diag_12 Diagnoses ICD-10
hes_op opertn_all opertn_01 - opertn_24 Treatments OPCS
Links between tables

Some of the tables have fields which link them together, allowing you to see how medical events relate to each other. However many of the tables can only be linked through participant IDs and corresponding dates.

Tables Column(s) linking them Details
hes_apc and hes_ae aekey and epikey Where an accident and emergency visit results in the participant being admitted to an overnight stay
hes_apc and hes_cc susrecid Where a participant spends some time on life-support during an overnight stay
Dates in tables

Many of the tables have multiple date fields, signifying different events. We have highlighted that fields we think are the most useful.

Table Field Meaning Most useful date in the table?
did did_date1 The date of referral
did did_date2 The date the referral was received
did did_date3 The date of imaging yes
did did_date4 The date of the report
hes_ae appdate Record updates
hes_ae arrivaldate the date of arrival yes
hes_ae cdsextdate Record updates
hes_ae subdate Record updates
hes_ae suslddate Record updates
hes_apc apcend_1 - apcend_9 End date of periods of augmented care
hes_apc apcstar_1 - apcstar_9 Start date of periods of augmented care
hes_apc admidate Date of admission yes
hes_apc anasdate Date of first antenatal assessment
hes_apc cdsextdate Record updates
hes_apc disdate Date of discharge yes
hes_apc disreadydate Date the participant was medically ready for discharge
hes_apc elecdate Date of decision to admit
hes_apc epiend Date of end of care under a particular consultant
hes_apc epistart Date of start of care under a particular consultant
hes_apc opdate_01 - opdate_24 Date of corresponding numbered treatment in columns oper_01 - oper_24
hes_apc rtt_per_end End date of a referral treatment period
hes_apc rtt_per_start Start date of a referral treatment period
hes_apc subdate Record updates
hes_apc suslddate Record updates
hes_cc ccdisdate Date of discharge yes
hes_cc ccstartdate Date of start yes
hes_op apptdate Date of appointment yes
hes_op dnadate Date of missed appointment
hes_op reqdate Date referral received
hes_op rtt_per_end End date of a referral treatment period
hes_op rtt_per_start Start date of a referral treatment period
hes_op subdate Record updates
hes_op suslddate Record updates
ecds arrival_date Date of arrival yes
ecds assessment_date Date of assessment yes
ecds conclusion_date Date of conclusion of treatment in accident and emergency
ecds departure_date Date of leaving accident and emergency yes
ecds injury_date Date of injury yes
ecds investigation_date_01 - investigation_date_12 Date of corresponding numbered investigation in columns investigation_code_01 - investigation_code_12
ecds period_end_date End date of referral to treatment
ecds period_start_date Start date of referral to treatment
ecds referral_assessment_date_1 - referral_assessment_date_4 Date of corresponding numbered referral in columns referred_to_service_1 - referred_to_service_4
ecds service_request_date_1 - service_request_date_4 Date of corresponding numbered referral in columns referred_to_service_1 - referred_to_service_4
ecds treatment_date_01 - treatment_date_10 Date of corresponding numbered treatment in columns treatment_code_01 - treatment_code_10

Setting up access

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:

conda create -n query-tables r-glue r-tidyverse r-data.table r-dbi r-rpostgres r-irkernel -y
conda activate query-tables

You can now launch an R notebook under the query-tables kernel. More details

Contents:

Import modules/libraries you need

LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:

library(tidyverse)
library(Rlabkey)
1
2
3
4
import numpy as np
import functools
import labkey
import pandas as pd
1
2
3
4
library(tidyverse)
library(data.table)
library(glue)
library(RPostgres)

Helper function to access the LabKey API

We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:

  • sql_query: an SQL query to access the LabKey tables
  • database: the version of the database you want to access for example /main-programme/main-programme_v18_2023-12-21
  • maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.

Feel free to include this function in all your scripts and invoke it every time you want to query the data.

labkey_to_df <- function(sql_query, database, maxrows){

    labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

    labkey.executeSql(folderPath = database,
                      schemaName = "lists",
                      colNameOpt = "rname",
                      sql = sql_query,
                      maxRows = maxrows) %>%
        mutate(across(everything(), as.character))
}
def labkey_to_df(sql_query, database, maxrows):

ver = labkey.__version__

if ver == '1.2.0' or ver == '1.4.0' or ver == '1.4.1':
  server_context = labkey.utils.create_server_context(
      domain = "labkey-embassy.gel.zone",
      container_path = database,
      context_path = "labkey",
      use_ssl = True
  )

  results =  labkey.query.execute_sql(
      server_context,
      schema_name = "lists",
      sql = sql_query,
      max_rows = maxrows
  )

if ver == '2.4.0':
  from labkey.api_wrapper import APIWrapper

  labkey_server = "labkey-embassy.gel.zone"
  project_name = database
  contextPath = "labkey"
  schema = 'lists'
  api = APIWrapper(
    labkey_server,
    project_name,
    contextPath,
    use_ssl=True
    )
  results = api.query.execute_sql(sql=sql_query, schema_name=schema, max_rows = maxrows)

return(pd.DataFrame(results['rows']))
labkey_to_df <- function(sql_query, database, maxrows){
  DBNAME = "gel_clinical_cb_sql_pro"
  HOST = "clinical-cb-sql-pro.cfe5cdx3wlef.eu-west-2.rds.amazonaws.com"
  PORT = 5432
  PASSWORD = 'anXReTz36Q5r'
  USER = 'jupyter_notebook'

  connection <- DBI::dbConnect(
      RPostgres::Postgres(),
      dbname = DBNAME,
      host = HOST,
      port = PORT,
      password = PASSWORD,
      user = USER,
      options = paste("-c search_path=", database, sep="")
      )

  dbGetQuery(connection, sql_query, n = maxrows)
  }

To run my queries, I'll need to set up my database version:

version <- "/main-programme/main-programme_v18_2023-12-21"
version = "/main-programme/main-programme_v18_2023-12-21"
version <- "source_data_100kv16_covidv4"

tidyup function for making diagnosis and treatment columns more readable

Many of the queries presented in this tutorial will pull up columns of concatenated data, such as diag_all and opertn_all. These are presented in the form:

code1|code2|code3||||||||

In this example above, code1 etc represent an ICD10 or opcs code. Each vertical pipe | represents one of the columns that were concatenated together; where these columns were empty, there are no spaces between the pipes.

The following function uses a regex to tidy up these to comma separated values, deleting any empty fields. It takes a dataframe as input and returns the tidied-up dataframe. You can use this any time you want to view the output of these analyses.

1
2
3
4
5
tidyup <- function(df){
  df[] <- lapply(df, gsub, pattern='\\|{2,}', replacement='')
  df[] <- lapply(df, gsub, pattern='\\|', replacement=', ')
  data.frame(df)
}
1
2
3
def tidyup(dataframe):
    dataframe.replace('\|\|+', '', regex=True, inplace = True)
    dataframe.replace('\|', ', ', regex=True, inplace = True)
1
2
3
4
5
tidyup <- function(df){
  df[] <- lapply(df, gsub, pattern='\\|{2,}', replacement='')
  df[] <- lapply(df, gsub, pattern='\\|', replacement=', ')
  data.frame(df)
}

Get details from a single hes_* table or ecds for a participant

We're going to find all diagnoses for an individual from a hes_* table. In this case we're going to query hes_ae. If you want to query hes_apc or hes_op, you would need to change the date field to one(s) available in those tables; you might also want to add further diagnosis and treatment columns.

In this and all following code we will refer to a string part_id, which is the participant_id of the participant of interest. Assume that you have already defined this string in your code.

1
2
3
4
5
hes_sql <- paste("SELECT participant_id, arrivaldate, diag_all  
    FROM hes_ae
    WHERE participant_id = ", part_id, sep="")

hes_query <- labkey_to_df(hes_sql, version, 1000)
1
2
3
4
5
6
7
8
hes_sql = (f'''
    SELECT participant_id, arrivaldate, diag_all
    FROM hes_ae
    WHERE participant_id = {part_id}
    ''')

hes_query = labkey_to_df(hes_sql, version, 1000)
hes_query
1
2
3
4
5
hes_sql <- paste("SELECT participant_id, arrivaldate, diag_all  
    FROM hes_ae
    WHERE participant_id = ", part_id, sep="")

hes_query <- labkey_to_df(hes_sql, version, 100000)

The table ecds covers emergency admissions after 2017. The ecds table has different column headings, so we have to alter our queries accordingly. The following query gets all the diagnoses, treatments and investigations for a participant, with the date of arrival.

1
2
3
4
5
ecds_sql <- paste("SELECT participant_id, seen_date, diagnosis_code_all, investigation_code_all, treatment_code_all ",
    "FROM ecds ",
    "WHERE participant_id = ", part_id, sep="")

ecds_query <- labkey_to_df(ecds_sql, version, 1000)
1
2
3
4
5
6
7
ecds_sql = (f'''
    SELECT participant_id, seen_date, diagnosis_code_all, investigation_code_all, treatment_code_all
    FROM ecds
    WHERE participant_id = {part_id}
    ''')

ecds_query = labkey_to_df(ecds_sql, version, 1000)
1
2
3
4
5
ecds_sql <- paste("SELECT participant_id, seen_date, diagnosis_code_all, investigation_code_all, treatment_code_all ",
    "FROM ecds ",
    "WHERE participant_id = ", part_id, sep="")

ecds_query <- labkey_to_df(ecds_sql, version, 100000)

Match accident and emergency episodes to overnight admissions

Where accident and emergency admissions resulted in the participant being admitted for overnight care, the entries in the hes_ae and hes_apc tables are linked by the aekey and epikey. The following query finds all diagnoses and operations the participant received in both A&E and after admission to overnight care. Note that in this code we rename the column headers, since many of these are the same between hes_ae and hes_apc.

ae_apc_sql <- paste("SELECT ae.participant_id,
  ae.arrivaldate as ae_arrival,
  ae.diag_all as ae_diag,
  apc.admidate as apc_arrival,
  apc.disdate as apc_discharge,
  apc.opertn_all as apc_operation,
  apc.diag_all as apc_diag
    FROM hes_ae as ae, hes_apc as apc
    WHERE ae.epikey = apc.epikey
    AND ae.participant_id = ", part_id, sep = "")

ae_apc_query <- labkey_to_df(ae_apc_sql, version, 1000)
ae_apc_sql = (f'''
    SELECT ae.participant_id,
    ae.arrivaldate as ae_arrival,
    ae.diag_all as ae_diag,
    apc.admidate as apc_arrival,
    apc.disdate as apc_discharge,
    apc.opertn_all as apc_operation,
    apc.diag_all as apc_diag
    FROM hes_ae as ae, hes_apc as apc
    WHERE ae.epikey = apc.epikey
    AND ae.participant_id =  {part_id}
    ''')

ae_apc_query = labkey_to_df(ae_apc_sql, version, 1000)
ae_apc_sql <- paste("SELECT ae.participant_id,
  ae.arrivaldate as ae_arrival,
  ae.diag_all as ae_diag,
  apc.admidate as apc_arrival,
  apc.disdate as apc_discharge,
  apc.opertn_all as apc_operation,
  apc.diag_all as apc_diag
    FROM hes_ae as ae, hes_apc as apc
    WHERE ae.epikey = apc.epikey
    AND ae.participant_id = ", part_id, sep = "")

ae_apc_query <- labkey_to_df(ae_apc_sql, version, 100000)

Find critical care periods within an overnight stay

Critical care periods in the hes_cc table all fall within periods of overnight care in hes_apc. These are linked together by the susrecid column.

apc_cc_sql <- paste("SELECT apc.participant_id,
  apc.admidate as apc_arrival,
  apc.disdate as apc_discharge,
  apc.opertn_all as apc_operation,
  apc.diag_all as apc_diag,
  cc.ccstartdate as cc_start,
  cc.ccdisdate as cc_discharge
    FROM hes_apc as apc, hes_cc as cc
    WHERE apc.susrecid = cc.susrecid
    AND apc.participant_id = ", part_id,
  "ORDER BY apc.admidate", sep = "")

apc_cc_query <- labkey_to_df(apc_cc_sql, version, 1000)
apc_cc_sql = (f'''
    SELECT apc.participant_id,
    apc.admidate as apc_arrival,
    apc.disdate as apc_discharge,
    apc.opertn_all as apc_operation,
    apc.diag_all as apc_diag,
    cc.ccstartdate as cc_start,
    cc.ccdisdate as cc_discharge
    FROM hes_apc as apc, hes_cc as cc
    WHERE apc.susrecid = cc.susrecid
    AND apc.participant_id = {part_id}
    ORDER BY apc.admidate
    ''')

apc_cc_query = labkey_to_df(apc_cc_sql, version, 1000)
apc_cc_sql <- paste("SELECT apc.participant_id,
  apc.admidate as apc_arrival,
  apc.disdate as apc_discharge,
  apc.opertn_all as apc_operation,
  apc.diag_all as apc_diag,
  cc.ccstartdate as cc_start,
  cc.ccdisdate as cc_discharge
    FROM hes_apc as apc, hes_cc as cc
    WHERE apc.susrecid = cc.susrecid
    AND apc.participant_id = ", part_id,
  "ORDER BY apc.admidate", sep = "")

apc_cc_query <- labkey_to_df(apc_cc_sql, version, 100000)

Pull out time periods from medical history tables

Other tables, such as the hes_op table of Outpatients day appointments at a hospital, and did table of diagnostic imaging, do not have explicit links to other tables. These can only be linked to other medical history tables by the dates of events.

The following code pulls out the start and end dates of periods of admitted patient care where a diagnosis was made (in this case R06, abnormalities of breathing). We will pull dates out of the table apc_cc_query that we created above.

1
2
3
4
5
6
7
8
9
diag <- 'R06'
episodes <- apc_cc_query %>% filter(grepl(diag, apc_diag))

for(i in 1:nrow(episodes)) {
  start_date <- episodes[i, "apc_arrival"]
  end_date <- episodes[i, "apc_discharge"]

  # carry out a query using the start and end date
}  
1
2
3
4
5
6
7
8
diag = 'R06'
episodes = apc_cc_query[apc_cc_query['apc_diag'].str.contains(diag)]

for i in range(len(episodes)):
  start_date = episodes.iloc[i]['apc_arrival']
  end_date = episodes.iloc[i]['apc_discharge']

  # carry out a query using the start and end date
1
2
3
4
5
6
7
8
9
diag <- 'R06'
episodes <- apc_cc_query %>% filter(grepl(diag, apc_diag))

for(i in 1:nrow(episodes)) {
  start_date <- episodes[i, "apc_arrival"]
  end_date <- episodes[i, "apc_discharge"]

  # carry out a query using the start and end date
}

Alternatively, you may wish to take a start or end date from one event and find all other events within a period of time before or after that event.

1
2
3
4
5
6
7
8
9
diag <- 'R06'
episodes <- apc_cc_query %>% filter(grepl(diag, apc_diag))

for(i in 1:nrow(episodes)) {
  start_date <- episodes[i, "apc_discharge"]
  end_date <- as.Date(start_date) + 90

  # carry out a query using the start and end date
}  
1
2
3
4
5
6
7
8
diag = 'R06'
episodes = apc_cc_query[apc_cc_query['apc_diag'].str.contains(diag)]

for i in range(len(episodes)):
  start_date = episodes.iloc[i]['discharge']
  end_date = datetime.strptime(start_date, '%Y-%m-%d') + pd.DateOffset(days = 90)

  # carry out a query using the start and end date
1
2
3
4
5
6
7
8
9
diag <- 'R06'
episodes <- apc_cc_query %>% filter(grepl(diag, apc_diag))

for(i in 1:nrow(episodes)) {
  start_date <- episodes[i, "apc_discharge"]
  end_date <- as.Date(start_date) + 90

  # carry out a query using the start and end date
}

Find events and episodes within a time period

We can use the dates we have pulled out to find all events in other tables that occurred within that time period. In the following query, we will query the did table of diagnostic imaging for rows where the imaging date did_date3 falls within the hospital stay.

To match up rows from did to rows from our hes_* tables, we will create our own index called did_index.

row_counter <- 1
did_combined <- data.frame()

for(i in 1:nrow(episodes)) {

  start_date <- episodes[i, "apc_arrival"]
  end_date <- episodes[i, "apc_discharge"]

  episodes$did_index[i] <- row_counter

  did_sql <- paste("SELECT participant_id,
    did_date3, ic_region_desc, ic_snomedct_desc, ic_sub_sys_desc, ic_nicip_desc
    FROM did
    WHERE participant_id = ", part_id,
    " AND did_date3 >= '", start_date,
    "' AND did_date3 <= '", end_date,
    "'", sep = "")

  did_query <- labkey_to_df(did_sql, version, 1000)
  did_query$did_index <- row_counter
  did_combined <- rbind(did_combined, did_query)
  row_counter <- row_counter + 1
}

episodes <- merge(episodes, did_combined, by = c("did_index", "participant_id"))
row_counter = 1
did_combined = pd.DataFrame()

for i in range(len(episodes)):
    start_date = episodes.iloc[i]['apc_arrival']
    end_date = episodes.iloc[i]['apc_discharge']

    episodes.loc[episodes.index[i], 'did_index'] = row_counter

    did_sql = (f'''
        SELECT participant_id,
        did_date3, ic_region_desc, ic_snomedct_desc, ic_sub_sys_desc, ic_nicip_desc
        FROM did
        WHERE participant_id = {part_id}
        AND did_date3 >= '{start_date}'
        AND did_date3 <= '{end_date}'
        ''')

    did_query = labkey_to_df(did_sql, version, 1000)
    did_query['did_index'] = row_counter
    did_combined = pd.concat([did_query, did_combined])
    row_counter += 1

episodes = pd.merge(episodes, did_combined, on = ['did_index', 'participant_id'])
row_counter <- 1
did_combined <- data.frame()

for(i in 1:nrow(episodes)) {

  start_date <- episodes[i, "apc_arrival"]
  end_date <- episodes[i, "apc_discharge"]

  episodes$did_index[i] <- row_counter

  did_sql <- paste("SELECT participant_id,
    did_date3, ic_region_desc, ic_snomedct_desc, ic_sub_sys_desc, ic_nicip_desc
    FROM did
    WHERE participant_id = ", part_id,
    " AND did_date3 >= '", start_date,
    "' AND did_date3 <= '", end_date,
    "'", sep = "")

  did_query <- labkey_to_df(did_sql, version, 100000)
  did_query$did_index <- row_counter
  did_combined <- rbind(did_combined, did_query)
  row_counter <- row_counter + 1
}

episodes <- merge(episodes, did_combined, by = c("did_index", "participant_id"))