Finding participants with prioritised variants programmatically¶

Give us feedback on this tutorial

Not all genetic variants can be found using LabKey. The only variants you can see in LabKey are those that have been prioritised as likely causal for the disorder studied. For rare disease, variants are prioritised using tiering and Exomiser, whereas cancer variants are prioritised using tiering, in the tables tiering_data, exomiser and cancer_tier_and_domain_variants respectively for 100k. Rare disease tiering_data and exomiser are also available for NHS GMS genomes.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:

conda create -n r-tables r-glue r-tidyverse r-data.table r-dbi r-rpostgres r-irkernel -y
conda activate r-tables

You can now launch an R notebook under the r-tables kernel. More details

To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required python packages:

conda create -n python-tables psycopg2 ipykernel pandas numpy -y
conda activate python-tables

You can now launch a Jupyter notebook under the python-tables kernel.

Contents:¶

Import modules/libraries you need
Helper function to access the LabKey API
Querying the tiering_data table
Querying the exomiser table
Querying the cancer_tier_and_domain_variants table
Querying the NHS GMS tiering_data table

Import modules/libraries you need ¶

LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

library(tidyverse)
library(Rlabkey)
library(readr)

If you are using R version 3 or above, you should also add the following line:

labkey.setWafEncoding(FALSE)

import numpy as np
import functools
import labkey
import pandas as pd

library(tidyverse)
library(data.table)
library(glue)
library(RPostgres)
library(readr)

import numpy as np
import functools
import psycopg2

Helper function to access the LabKey API ¶

We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:

sql_query: an SQL query to access the LabKey tables
database: the version of the database you want to access for example /main-programme/main-programme_v19_2024-10-31
maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.

Feel free to include this function in all your scripts and invoke it every time you want to query the data.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

labkey_to_df <- function(sql_query, database, maxrows){

    labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

    labkey.executeSql(folderPath = database,
                      schemaName = "lists",
                      colNameOpt = "rname",
                      sql = sql_query,
                      maxRows = maxrows) %>%
        mutate(across(everything(), as.character))
}

    def labkey_to_df(sql_query, database, maxrows):

    ver = labkey.__version__

    if ver == '1.2.0' or ver == '1.4.0' or ver == '1.4.1':
      server_context = labkey.utils.create_server_context(
          domain = "labkey-embassy.gel.zone",
          container_path = database,
          context_path = "labkey",
          use_ssl = True
      )

      results =  labkey.query.execute_sql(
          server_context,
          schema_name = "lists",
          sql = sql_query,
          max_rows = maxrows
      )

    if ver >= '2.4.0':
      from labkey.api_wrapper import APIWrapper

      labkey_server = "labkey-embassy.gel.zone"
      project_name = database
      contextPath = "labkey"
      schema = 'lists'
      api = APIWrapper(
        labkey_server,
        project_name,
        contextPath,
        use_ssl=True
        )
      results = api.query.execute_sql(sql=sql_query, schema_name=schema, max_rows = maxrows)

    return(pd.DataFrame(results['rows']))

labkey_to_df <- function(sql_query, database, maxrows){
  DBNAME = "gel_clinical_cb_sql_pro"
  HOST = "clinical-cb-sql-pro.cfe5cdx3wlef.eu-west-2.rds.amazonaws.com"
  PORT = 5432
  PASSWORD = 'anXReTz36Q5r'
  USER = 'jupyter_notebook'

  connection <- DBI::dbConnect(
      RPostgres::Postgres(),
      dbname = DBNAME,
      host = HOST,
      port = PORT,
      password = PASSWORD,
      user = USER,
      options = paste("-c search_path=", database, sep="")
      )

  dbGetQuery(connection, sql_query, n = maxrows)
  }

def query_to_df(sql_query, database):
    connection = psycopg2.connect(
        dbname = "gel_clinical_cb_sql_pro",
        options = f"-c search_path=source_data_100kv16_covidv4",
        host = "clinical-cb-sql-pro.cfe5cdx3wlef.eu-west-2.rds.amazonaws.com",
        port = 5432,
        password = 'anXReTz36Q5r',
        user = 'jupyter_notebook'
    )
    return(pd.read_sql_query(sql_query, connection))

To run my queries, I'll need to set up my database version:

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

version <- "/main-programme/main-programme_v19_2024-10-31"

version = "/main-programme/main-programme_v19_2024-10-31"

version <- "source_data_100kv16_covidv4"

version = "source_data_100kv16_covidv4"

Querying the tiering_data table ¶

We can query the rare disease tiering_data table by a genomic locus or by a gene. The following query finds all participants with prioritised variants in a gene of interest, getting the variant loci and reference and alternative alleles. Since this is a gene-based query, it will find all rows where the gene name is listed, which will include participant genomes aligned to GRCh37 and GRCh38.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

gene <- "NIPBL"
tiering_sql <- paste("SELECT participant_id, phenotype, assembly, chromosome, position, reference, alternate, genotype ",
  "FROM tiering_data ",
  "WHERE genomic_feature_hgnc='", gene, "'", sep="")

tiering_query <- labkey_to_df(tiering_sql, version, 10000)

gene = "NIPBL"

gene_tiering_sql = (f'''
    SELECT participant_id, phenotype, assembly, chromosome, position, reference, alternate, genotype
    FROM tiering_data
    WHERE genomic_feature_hgnc = '{gene}'
    ''')
gene_tiering_query = labkey_to_df(gene_tiering_sql, version, 10000)

gene <- "NIPBL"
tiering_sql <- paste("SELECT participant_id, phenotype, assembly, chromosome, position, reference, alternate, genotype ",
  "FROM tiering_data ",
  "WHERE genomic_feature_hgnc='", gene, "'", sep="")

tiering_query <- labkey_to_df(tiering_sql, version, 10000)

gene = "NIPBL"

gene_tiering_sql = (f'''
    SELECT participant_id, phenotype, assembly, chromosome, position, reference, alternate, genotype
    FROM tiering_data
    WHERE genomic_feature_hgnc = '{gene}'
    ''')
gene_tiering_query = labkey_to_df(gene_tiering_sql, version)

We can also query for a genomic locus. In this case we must include the genome assembly as an argument. For completeness, you may wish to also query for the remapped coordinate on GRCh37.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

chromosome <- "5"
coordinate <- "37064560"
assembly <- "GRCh38"

locus_tiering_sql <- paste("SELECT participant_id, phenotype, genomic_feature_hgnc, genotype ",
    "FROM tiering_data ",
    "WHERE chromosome = '", chromosome,
      "' AND position = '", coordinate,
      "' AND assembly = '", assembly, "'", sep="")

locus_tiering_query <- labkey_to_df(locus_tiering_sql, version, 10000)

chromosome = "5"
coordinate = "37064560"
assembly = "GRCh38"

locus_tiering_sql = (f'''
    SELECT participant_id, phenotype, genomic_feature_hgnc, genotype
    FROM tiering_data
    WHERE chromosome = '{chromosome}'
      AND position = '{coordinate}'
      AND assembly = '{assembly}'
    ''')

locus_tiering_query = labkey_to_df(locus_tiering_sql, version, 10000)

chromosome <- "5"
coordinate <- "37064560"
assembly <- "GRCh38"

locus_tiering_sql <- paste("SELECT participant_id, phenotype, genomic_feature_hgnc, genotype ",
    "FROM tiering_data ",
    "WHERE chromosome = '", chromosome,
      "' AND position = '", coordinate,
      "' AND assembly = '", assembly, "'", sep="")

locus_tiering_query <- labkey_to_df(locus_tiering_sql, version, 10000)

chromosome = "5"
coordinate = "37064560"
assembly = "GRCh38"

locus_tiering_sql = (f'''
    SELECT participant_id, phenotype, genomic_feature_hgnc, genotype
    FROM tiering_data
    WHERE chromosome = '{chromosome}'
      AND position = '{coordinate}'
      AND assembly = '{assembly}'
    ''')

locus_tiering_query = labkey_to_df(locus_tiering_sql, version)

Querying the exomiser table ¶

The exomiser table allows you to search for participants by locus or gene, just like the tiering table. However, it also contains HGVS notation for variants in the form:

gene_name:ENST0000000####:c.99A>T:p.(Gly33Glu)

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

gene_name <- "MUM1L1"
aa_change <- "Leu7Gln"

exomiser_hgvs_sql <- paste( "SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype ",
    "FROM exomiser ",
    " WHERE hgvs like '%", gene_name,
    "%' AND hgvs like '%", aa_change, "%'",
    sep="")

exomiser_hgvs_query <- labkey_to_df(exomiser_hgvs_sql, version, 10000)

gene_name = "MUM1L1"
aa_change = "Leu7Gln"

exomiser_hgvs_sql = (f'''
    SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype
    FROM exomiser
    WHERE hgvs like '%{gene_name}%'
      AND hgvs like '%{aa_change}%'
    ''')

exomiser_hgvs_query = labkey_to_df(exomiser_hgvs_sql, version, 10000)

gene_name <- "MUM1L1"
aa_change <- "Leu7Gln"

exomiser_hgvs_sql <- paste( "SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype ",
    "FROM exomiser ",
    " WHERE hgvs like '%", gene_name,
    "%' AND hgvs like '%", aa_change, "%'",
    sep="")

exomiser_hgvs_query <- labkey_to_df(exomiser_hgvs_sql, version, 10000)

gene_name = "MUM1L1"
aa_change = "Leu7Gln"

exomiser_hgvs_sql = (f'''
    SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype
    FROM exomiser
    WHERE hgvs like '%{gene_name}%'
      AND hgvs like '%{aa_change}%'
    ''')

exomiser_hgvs_query = labkey_to_df(exomiser_hgvs_sql, version)

Querying the cancer_tier_and_domain_variants table ¶

Now we'll query the cancer_tier_and_domain_variants. This has a similar structure to the rare disease tiering table, so we can query it for a gene and/or region in the same way.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

cancer_gene <- "TP53"

cancer_gene_tiering_sql = paste( "SELECT participant_id, disease_type, chr, pos, ref, alt
    FROM cancer_tier_and_domain_variants
    WHERE gene = '", cancer_gene, "'
    ", sep = "")
cancer_gene_tiering_query <- labkey_to_df(cancer_gene_tiering_sql, version, 10000)

cancer_gene = "TP53"

cancer_gene_tiering_sql = (f'''
    SELECT participant_id, disease_type, chr, pos, ref, alt
    FROM cancer_tier_and_domain_variants
    WHERE gene = '{cancer_gene}'
    ''')
cancer_gene_tiering_query = labkey_to_df(cancer_gene_tiering_sql, version, 10000)

cancer_gene <- "TP53"

cancer_gene_tiering_sql = paste( "SELECT participant_id, disease_type, chr, pos, ref, alt
    FROM cancer_tier_and_domain_variants
    WHERE gene = '", cancer_gene, "'
    ", sep = "")
cancer_gene_tiering_query <- labkey_to_df(cancer_gene_tiering_sql, version, 10000)

cancer_gene = "TP53"

cancer_gene_tiering_sql = (f'''
    SELECT participant_id, disease_type, chr, pos, ref, alt
    FROM cancer_tier_and_domain_variants
    WHERE gene = '{cancer_gene}'
    ''')
cancer_gene_tiering_query = labkey_to_df(cancer_gene_tiering_sql, version)

Since all cancer genomes are aligned to GRCh38, querying by a locus does not require the assembly argument.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

cancer_chromosome <- "17"
cancer_coordinate <- "7675166"

cancer_locus_tiering_sql <- paste("SELECT participant_id, disease_type, gene
    FROM cancer_tier_and_domain_variants
    WHERE chr = '", cancer_chromosome, "'
      AND pos = '", cancer_coordinate, "'
    ", sep = "")

cancer_locus_tiering_query <- labkey_to_df(cancer_locus_tiering_sql, version, 10000)

cancer_chromosome = "17"
cancer_coordinate = "7675166"

cancer_locus_tiering_sql = (f'''
    SELECT participant_id, disease_type, gene
    FROM cancer_tier_and_domain_variants
    WHERE chr = '{cancer_chromosome}'
      AND pos = '{cancer_coordinate}'
    ''')

cancer_locus_tiering_query = labkey_to_df(cancer_locus_tiering_sql, version, 10000)

cancer_chromosome <- "17"
cancer_coordinate <- "7675166"

cancer_locus_tiering_sql <- paste("SELECT participant_id, disease_type, gene
    FROM cancer_tier_and_domain_variants
    WHERE chr = '", cancer_chromosome, "'
      AND pos = '", cancer_coordinate, "'
    ", sep = "")

cancer_locus_tiering_query <- labkey_to_df(cancer_locus_tiering_sql, version, 10000)

cancer_chromosome = "17"
cancer_coordinate = "7675166"

cancer_locus_tiering_sql = (f'''
    SELECT participant_id, disease_type, gene
    FROM cancer_tier_and_domain_variants
    WHERE chr = '{cancer_chromosome}'
      AND pos = '{cancer_coordinate}'
    ''')

cancer_locus_tiering_query = labkey_to_df(cancer_locus_tiering_sql, version)

Querying the NHS GMS tiering_data table ¶

Again, the structure is very similar to that in 100k, but we don't need to fetch the assembly as all genomes are aligned to GRCh38. We have to make a new version object for the NHS GMS database.

RE LabKey API RRE LabKey API PythonCloudOS interactive session RCloudOS interactive session Python

gms_version <- "nhs-gms/nhs-gms-release_v2_2023-02-28"

gms_gene_tiering_sql <- paste("SELECT participant_id, phenotype, chromosome, position, reference, alternate, genotype
    FROM tiering_data
    WHERE genomic_feature_hgnc = '", gene, "'
    ", sep = "")

gms_gene_tiering_query <- labkey_to_df(gms_gene_tiering_sql, gms_version, 10000)

gms_version = "nhs-gms/nhs-gms-release_v2_2023-02-28"

gms_gene_tiering_sql = (f'''
    SELECT participant_id, phenotype, chromosome, position, reference, alternate, genotype
    FROM tiering_data
    WHERE genomic_feature_hgnc = '{gene}'
    ''')

gms_gene_tiering_query = labkey_to_df(gms_gene_tiering_sql, gms_version, 10000)

NHS GMS data is not currently accessible in CloudOS.

Finding participants with prioritised variants programmatically¶

Contents:¶

Import modules/libraries you need ¶

Helper function to access the LabKey API ¶

Querying the tiering_data table¶

Querying the exomiser table¶

Querying the cancer_tier_and_domain_variants table¶

Querying the NHS GMS tiering_data table¶

Querying the tiering_data table ¶

Querying the exomiser table ¶

Querying the cancer_tier_and_domain_variants table ¶

Querying the NHS GMS tiering_data table ¶