AggV2 code book - phenotype queries¶

Details of all the samples included in aggV2 are listed in the LabKey table aggregate_gvcf_sample_stats. You can use this table to link the samples to their phenotypes in other LabKey tables, using either the LabKey desktop application and the APIs.

Overview¶

The LabKey table aggregate_gvcf_sample_stats, containing sample-level information on all participants within aggV2, can be queried though either the LabKey desktop application or through the LabKey APIs. Please see the documentation for the LabKey Desktop application and the APIs for more information.

The desktop application allows you to browse the data and perform basic filtering on an individual table. Complete tables or filtered tables can then be exported as text or .xlsx files for downstream analysis. Alternatively, you can interrogate the tables using the LabKey APIs (written in multiple languages including R and Python). We recommend using the LabKey APIs as you can perform more complex and easily reproducible queries on the data.

Examples of how to query the aggregate_gvcf_sample_stats table and integrate with the phenotypic/clinical data of these participants using the R LabKey API (Rlabkey package) are listed below. We make use of the SQL-like syntax to query the tables. You can use RStudio or R in the terminal to run these examples and to do further downstream processing of the data.

aggregate_gvcf_sample_stats

The sample IDs in the aggV2 VCFs are the 'plate-keys' (e.g. LP3000448-DNA_A09) - not the participant IDs. In order to link between plate-key and participant ID - use the aggregate_gvcf_sample_stats table. There is one sample, one participant per row in this table.

LabKey API¶

Setting up access¶

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:

conda create -n r-tables r-glue r-tidyverse r-data.table r-dbi r-rpostgres r-irkernel -y
conda activate r-tables

You can now launch an R notebook under the r-tables kernel. More details

To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required python packages:

conda create -n python-tables psycopg2 ipykernel sqlalchemy pandas numpy -y
conda activate python-tables

You can now launch a Jupyter notebook under the python-tables kernel.

Import modules/libraries you need ¶

LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

library(tidyverse)
library(Rlabkey)
library(readr)

If you are using R version 3 or above, you should also add the following line:

labkey.setWafEncoding(FALSE)

import numpy as np
import functools
import labkey
import pandas as pd

library(tidyverse)
library(data.table)
library(glue)
library(RPostgres)
library(readr)

import numpy as np
import functools
import psycopg2
from sqlalchemy import create_engine, event, text

Helper function to access the LabKey API ¶

We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:

sql_query: an SQL query to access the LabKey tables
database: the version of the database you want to access for example /main-programme/main-programme_v19_2024-10-31
maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.

Feel free to include this function in all your scripts and invoke it every time you want to query the data.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

labkey_to_df <- function(sql_query, database, maxrows){

    labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

    labkey.executeSql(folderPath = database,
                      schemaName = "lists",
                      colNameOpt = "rname",
                      sql = sql_query,
                      maxRows = maxrows) %>%
        mutate(across(everything(), as.character))
}

def labkey_to_df(sql_query, database, maxrows):

    ver = labkey.__version__

    if ver < '2.4.0':
        server_context = labkey.utils.create_server_context(
            domain = "labkey-embassy.gel.zone",
            container_path = database,
            context_path = "labkey",
            use_ssl = True
        )

        results =  labkey.query.execute_sql(
            server_context,
            schema_name = "lists",
            sql = sql_query,
            max_rows = maxrows
        )

    if ver >= '2.4.0':
        from labkey.api_wrapper import APIWrapper

        labkey_server = "labkey-embassy.gel.zone"
        project_name = database
        contextPath = "labkey"
        schema = 'lists'
        api = APIWrapper(
            labkey_server,
            project_name,
            contextPath,
            use_ssl=True
            )
        results = api.query.execute_sql(
            sql=sql_query,
            schema_name=schema,
            max_rows=maxrows,
            waf_encode_sql=False
            )

    return(pd.DataFrame(results['rows']))

labkey_to_df <- function(sql_query, database, maxrows){
  DBNAME = "gel_clinical_cb_sql_pro"
  HOST = "clinical-cb-sql-pro.cfe5cdx3wlef.eu-west-2.rds.amazonaws.com"
  PORT = 5432
  PASSWORD = 'anXReTz36Q5r'
  USER = 'jupyter_notebook'

  connection <- DBI::dbConnect(
      RPostgres::Postgres(),
      dbname = DBNAME,
      host = HOST,
      port = PORT,
      password = PASSWORD,
      user = USER,
      options = paste("-c search_path=", database, sep="")
      )

  dbGetQuery(connection, sql_query, n = maxrows)
  }

def query_to_df(sql_query, version):
    database = "gel_clinical_cb_sql_pro"
    host = "clinical-cb-sql-pro.cfe5cdx3wlef.eu-west-2.rds.amazonaws.com"
    port = 5432
    password = 'anXReTz36Q5r'
    user = 'jupyter_notebook'
    engine = create_engine(f'''postgresql://{user}:{password}@{host}:{port}/{database}''')

    @event.listens_for(engine, "connect", insert=True)
    def set_search_path(dbapi_connection, connection_record):
        existing_autocommit = dbapi_connection.autocommit
        dbapi_connection.autocommit = True
        cursor = dbapi_connection.cursor()
        cursor.execute(f"SET SESSION search_path={version}")
        cursor.close()
        dbapi_connection.autocommit = existing_autocommit

    with engine.connect() as connection:
        result = connection.execute(text(sql_query))
        return(pd.DataFrame(result))

To run my queries, I'll need to set up my database version:

100kGP or main programmeNHS GMS

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

version <- "/main-programme/main-programme_v19_2024-10-31"

version = "/main-programme/main-programme_v19_2024-10-31"

version <- "source_data_100kv16_covidv4"

version = "source_data_100kv16_covidv4"

RE LabKey API RRE LabKey API Python

version <- "nhs-gms/nhs-gms-release_v5_2025-08-28"

version = "nhs-gms/nhs-gms-release_v5_2025-08-28"

Importing the `aggregate_gvcf_sample_stats` table¶

This example show how to fetch the entire aggregate_gvcf_sample_stats table and save as an object in your session. The table can then be post-filtered.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

whole_agg_sql <- "SELECT * from aggregate_gvcf_sample_stats"
whole_agg_query <- labkey_to_df(whole_agg_sql, version, 1000000)

whole_agg_sql = "SELECT * from aggregate_gvcf_sample_stats"
whole_agg_query = labkey_to_df(whole_agg_sql, version, 1000000)

whole_agg_sql <- "SELECT * from aggregate_gvcf_sample_stats"
whole_agg_query <- labkey_to_df(whole_agg_sql, version, 1000000)

whole_agg_sql = "SELECT * from aggregate_gvcf_sample_stats"
whole_agg_query = query_to_df(whole_agg_sql, version)

Filtering tables and selecting columns¶

Tables can be filtered for certain attributes and specified columns selected in labkey using SQL. In this example, we can filter the aggregate_gvcf_sample_stats table for participants in aggV2 from the rare disease programme for which the WGS was from a blood sample. We can then select only the participant ID and platekey columns.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

source <- "BLOOD"
source_sql <- paste("SELECT participant_id, platekey
    FROM aggregate_gvcf_sample_stats
    WHERE type = 'rare disease germline'
    AND sample_source = '", source, "'", sep = "")
source_query <- labkey_to_df(source_sql, version, 100000)

source = "BLOOD"

source_sql = (f'''
SELECT participant_id, platekey
  FROM aggregate_gvcf_sample_stats
  WHERE type = 'rare disease germline'
    AND sample_source = '{source}'
    ''')
source_query = labkey_to_df(source_sql, version, 10000)

source <- "BLOOD"
source_sql <- paste("SELECT participant_id, platekey
    FROM aggregate_gvcf_sample_stats
    WHERE type = 'rare disease germline'
    AND sample_source = '", source, "'", sep = "")
source_query <- labkey_to_df(source_sql, version, 100000)

source = "BLOOD"

source_sql = (f'''
SELECT participant_id, platekey
  FROM aggregate_gvcf_sample_stats
  WHERE type = 'rare disease germline'
    AND sample_source = '{source}'
    ''')
source_query = query_to_df(source_sql, version)

Joining multiple tables together to create a cohort¶

The utility of the LabKey APIs comes when joining multiple tables together. In this example, we perform the query above to find participants in aggV2 from the rare disease programme for which the WGS was from a blood sample. We then inner-join these results together with the rare_diseases_participant_phenotype table to identify participants who are positive for the HPO Term: HP:0000365. We pull back a table of participant ID and plate-key of participants who satisfy the criteria.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

source <- "BLOOD"
phenotype <- "HP:0000365"
type <- "rare disease germline"

source_phenotype_sql <- paste("SELECT agg.participant_id, agg.platekey
   FROM aggregate_gvcf_sample_stats
   AS agg
   INNER JOIN rare_diseases_participant_phenotype
   AS phe
   ON agg.participant_id = phe.participant_id
   WHERE agg.type = '", type, "'
   AND agg.sample_type = '", source, "'
   AND phe.hpo_id = '", phenotype, "'
   AND phe.hpo_present = 'Yes'", sep = "")
source_phenotype_query <- labkey_to_df(source_phenotype_sql, version, 100000)

source = "BLOOD"
phenotype = "HP:0000365"
type = "rare disease germline"

source_phenotype_sql = (f'''
SELECT agg.participant_id, agg.platekey
   FROM aggregate_gvcf_sample_stats
   AS agg
   INNER JOIN rare_diseases_participant_phenotype
   AS phe
   ON agg.participant_id = phe.participant_id
   WHERE agg.type = '{type}'
   AND agg.sample_type = '{source}'
   AND phe.hpo_id = '{phenotype}'
   AND phe.hpo_present = 'Yes'
  ''')
source_phenotype_query = labkey_to_df(source_phenotype_sql, version, 10000)

source <- "BLOOD"
phenotype <- "HP:0000365"
type <- "rare disease germline"

source_phenotype_sql <- paste("SELECT agg.participant_id, agg.platekey
     FROM aggregate_gvcf_sample_stats
     AS agg
     INNER JOIN rare_diseases_participant_phenotype
     AS phe
     ON agg.participant_id = phe.participant_id
     WHERE agg.type = '", type, "'
     AND agg.sample_type = '", source, "'
     AND phe.hpo_id = '", phenotype, "'
     AND phe.hpo_present = 'Yes'", sep = "")
source_phenotype_query <- labkey_to_df(source_phenotype_sql, version, 100000)

source = "BLOOD"
phenotype = "HP:0000365"
type = "rare disease germline"

source_phenotype_sql = (f'''
SELECT agg.participant_id, agg.platekey
   FROM aggregate_gvcf_sample_stats
   AS agg
   INNER JOIN rare_diseases_participant_phenotype
   AS phe
   ON agg.participant_id = phe.participant_id
   WHERE agg.type = '{type}'
   AND agg.sample_type = '{source}'
   AND phe.hpo_id = '{phenotype}'
   AND phe.hpo_present = 'Yes'
  ''')
source_phenotype_query = query_to_df(source_phenotype_sql, version)

AggV2 code book - phenotype queries¶

Overview¶

LabKey API¶

Setting up access¶

Import modules/libraries you need ¶

Helper function to access the LabKey API ¶

Importing the aggregate_gvcf_sample_stats table¶

Filtering tables and selecting columns¶

Joining multiple tables together to create a cohort¶

Importing the `aggregate_gvcf_sample_stats` table¶