The LabKey API allows you to query the LabKey tables programmatically using SQL queries in a variety of programming languages.
To do very simple queries on the clinical/phenotype data you can use the LabKey application on the Research Environment desktop. However, in order to do more complex analyses across multiple tables, or query very large tables, it is recommended that you use the provided LabKey APIs.
The LabKey client libraries (APIs) provide secure, auditable, programmatic access to LabKey data and services and allow you to:
Analyse and visualise data stored in LabKey in a statistical tool such as R or Python
Perform routine, automated tasks in a programmatic way.
Query and manipulate data in a repeatable and consistent way.
Currently, LabKey supports working with the following programming languages/environments.
Rlabkey Package
Python API
JavaScript API
Java API
Perl API
SAS Macros
We highly recommend using either the Python LabKey API or the R LabKey API to query the Genomics England clinical/phenotype data.
Here you will find a guide to writing LabKey API scripts in Python and R. Use the toggles to switch between the languages. You can find further information on using the API in Python and R in the main LabKey documentation.
The LabKey API only works with certain versions of R and Python. You will need to load the correct versions to run your scripts, and for Python this differs between the desktop and the HPC, which uses conda environments.
You will need to load LabKey in your scripts, notebooks and interactive coding sessions. There are also a few other modules we recommend you load alongside these to help you manipulate the data.
Helper function for using the API for SQL queries¶
The LabKey API works by enabling SQL queries of the LabKey tables. The following code is a helper function, labkey_to_df that connects to the server, executes an SQL query and returns the results as a dataframe. Feel free to include this function in all your scripts and invoke it every time you want to query the data.
sql_query: an SQL query to access the LabKey tables
database: the version of the database you want to access for example /main-programme/main-programme_v18_2023-12-21
maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.
Here is an example of how you can use this function:
sql<-"SELECT p.participant_id, p.programme, sr.lab_sample_id, sr.delivery_date, FROM participant p JOIN sequencing_report sr ON p.participant_id = sr.participant_id WHERE YEAR(sr.delivery_date) BETWEEN 2019 AND 2021;"database<-"/main-programme/main-programme_v18_2023-12-21"query<-labkey_to_df(sql,database,100000)
sql=f""" SELECT p.participant_id, p.programme, sr.lab_sample_id, sr.delivery_date, FROM participant p JOIN sequencing_report sr ON p.participant_id = sr.participant_id WHERE YEAR(sr.delivery_date) BETWEEN 2019 AND 2021;"""database="/main-programme/main-programme_v18_2023-12-21"query=labkey_to_df(sql,database,100000)
# Set the baseURLlabkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")sequencing_report<-labkey.selectRows(schemaName="lists",# Do not change thiscolNameOpt="rname",# Do not change thismaxRows=100000000,# Do not change thisfolderPath="/main-programme/main-programme_v18_2023-12-21",# This can be changed to different main programme releases queryName="sequencing_report"# This can be changed to different table names)
# Specify what we are connecting to, and what schema and tables we want labkey_server="labkey-embassy.gel.zone"# The labkey server we are connecting to. This will not change project_name="/main-programme/main-programme_v18_2023-12-21"# The data we want to access. This will change depending on your study context_path="labkey"# This does not change schema_name="lists"# The schema we are getting data from. This does not change query_name="sequencing_report"# The table we want to get data from. This does change # Create an object that will let us connect to the LabKey databases. This does not change. server_context=labkey.utils.create_server_context(labkey_server,project_name,context_path,use_ssl=True)# The data are returned and stored in the variable results. results=labkey.query.select_rows(server_context,schema_name,query_name,max_rows=200000)# Data are returned as a dictionary, will all of the table information stored under the key "rows". # We make a dataframe of all of the table information using pandas. table_of_data=pd.DataFrame(results["rows"])
The following code outlines how to filter a table by columns and/or rows in order to return a subset of a table, selecting specific columns (colSelect) and filtering for specified row values (colFilter). This does not allow you to combine this with another table, so in most cases using an SQL query is preferable.
labkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")participant<-labkey.selectRows(# Default selectionschemaName="lists",# Do not change thiscolNameOpt="rname",# Do not change thismaxRows=100000000,# Do not change thisfolderPath="/main-programme/main-programme_v18_2023-12-21",# This can be changed to different main programme releasesqueryName="participant",# This can be changed to different table names# Additional parameterscolFilter=makeFilter(# Make various filters to subset rowsc("participant_type","EQUAL","Proband"),c("year_of_birth","BETWEEN","1990,2010")),colSelect=c("participant_id","rare_diseases_family_id","year_of_birth"),# Choose to only select these columnscolSort=("+year_of_birth")# Sort by these columns
# Specify what we are connecting to, and what schema and tables we want labkey_server="labkey-embassy.gel.zone"# The labkey server we are connecting to. This will not change project_name="/main-programme/main-programme_v18_2023-12-21"# The data we want to access. This will change depending on your study context_path="labkey"# This does not change schema_name="lists"# The schema we are getting data from. This does not change query_name="participant"# The table we want to get data from. This does change columns="participant_id, rare_diseases_family_id, year_of_birth"filter1=labkey.query.QueryFilter("participant_type","Proband")filter2=labkey.query.QueryFilter("year_of_birth","1990,2010",labkey.query.QueryFilter.Types.BETWEEN)# Return only reports with year of birth included in a range.combined_filter=[filter1,filter2]# Combine the previous two filters. # Create an object that will let us connect to the LabKey databases. This does not change. server_context=labkey.utils.create_server_context(labkey_server,project_name,context_path,use_ssl=True)# Fetch and store only the columns that we want. selected_columns=labkey.query.select_rows(server_context,schema_name,query_name,columns=columns)# Filters can be used to fetch only the rows that we are interested in. selected_rows1=labkey.query.select_rows(server_context,schema_name,query_name,filter_array=[filter1])selected_rows2=labkey.query.select_rows(# server_context,schema_name,query_name,filter_array=[filter2])# Filters can also be combined to with each other and with column selection to do more powerful selection of data from a table. selected_data=labkey.query.select_rows(server_context,schema_name,query_name,filter_array=combined_filter,columns=columns)# Data can then be stored in a dataframe as normal. table_of_data=pd.DataFrame(selected_data["rows"])
You must filter participants in your analysis to ensure they all have active consent for research. To do this, use the programme_consent_status column of the participant table in the 100kGP. Here is example SQL using a participant_id list called list.
consent_sql<-paste("SELECT participant_id, programme_consent_status FROM participant WHERE programme_consent_status = 'Consenting' AND participant_id IN (",list,")",sep="")
consent_sql=(f''' SELECT participant_id, programme_consent_status FROM participant WHERE programme_consent_status = 'Consenting' AND participant_id IN {list}''')