Skip to content

Finding participants with prioritised variants programmatically

Give us feedback on this tutorial

Not all genetic variants can be found using LabKey. The only variants you can see in LabKey are those that have been prioritised as likely causal for the disorder studied. For rare disease, variants are prioritised using tiering and Exomiser, whereas cancer variants are prioritised using tiering, in the tables tiering_data, exomiser and cancer_tier_and_domain_variants respectively for 100k. Rare disease tiering_data and exomiser are also available for NHS GMS genomes.

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:

conda create -n query-tables r-glue r-tidyverse r-data.table r-dbi r-rpostgres r-irkernel -y

You can now launch an R notebook under the query-tables kernel. More details

Contents:

Import modules/libraries you need

LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:

1
2
3
library(tidyverse)
library(Rlabkey)
library(readr)
1
2
3
4
import numpy as np
import functools
import labkey
import pandas as pd
1
2
3
4
5
library(tidyverse)
library(data.table)
library(glue)
library(RPostgres)
library(readr)

Helper function to access the LabKey API

We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:

  • sql_query: an SQL query to access the LabKey tables
  • database: the version of the database you want to access for example /main-programme/main-programme_v18_2023-12-21
  • maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.

Feel free to include this function in all your scripts and invoke it every time you want to query the data.

labkey_to_df <- function(sql_query, database, maxrows){

    labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

    labkey.executeSql(folderPath = database,
                      schemaName = "lists",
                      colNameOpt = "rname",
                      sql = sql_query,
                      maxRows = maxrows) %>%
        mutate(across(everything(), as.character))
}
def labkey_to_df(sql_query, database, maxrows):

ver = labkey.__version__

if ver == '1.2.0' or ver == '1.4.0' or ver == '1.4.1':
  server_context = labkey.utils.create_server_context(
      domain = "labkey-embassy.gel.zone",
      container_path = database,
      context_path = "labkey",
      use_ssl = True
  )

  results =  labkey.query.execute_sql(
      server_context,
      schema_name = "lists",
      sql = sql_query,
      max_rows = maxrows
  )

if ver == '2.4.0':
  from labkey.api_wrapper import APIWrapper

  labkey_server = "labkey-embassy.gel.zone"
  project_name = database
  contextPath = "labkey"
  schema = 'lists'
  api = APIWrapper(
    labkey_server,
    project_name,
    contextPath,
    use_ssl=True
    )
  results = api.query.execute_sql(sql=sql_query, schema_name=schema, max_rows = maxrows)

return(pd.DataFrame(results['rows']))
labkey_to_df <- function(sql_query, database, maxrows){
  DBNAME = "gel_clinical_cb_sql_pro"
  HOST = "clinical-cb-sql-pro.cfe5cdx3wlef.eu-west-2.rds.amazonaws.com"
  PORT = 5432
  PASSWORD = 'anXReTz36Q5r'
  USER = 'jupyter_notebook'

  connection <- DBI::dbConnect(
      RPostgres::Postgres(),
      dbname = DBNAME,
      host = HOST,
      port = PORT,
      password = PASSWORD,
      user = USER,
      options = paste("-c search_path=", database, sep="")
      )

  dbGetQuery(connection, sql_query, n = maxrows)
  }

To run my queries, I'll need to set up my database version:

version <- "/main-programme/main-programme_v18_2023-12-21"
version = "/main-programme/main-programme_v18_2023-12-21"
version <- "source_data_100kv16_covidv4"

Querying the tiering_data table

We can query the rare disease tiering_data table by a genomic locus or by a gene. The following query finds all participants with prioritised variants in a gene of interest, getting the variant loci and reference and alternative alleles. Since this is a gene-based query, it will find all rows where the gene name is listed, which will include participant genomes aligned to GRCh37 and GRCh38.

1
2
3
4
5
6
gene <- "NIPBL"
tiering_sql <- paste("SELECT participant_id, phenotype, assembly, chromosome, position, reference, alternate, genotype ",
  "FROM tiering_data ",
  "WHERE genomic_feature_hgnc='", gene, "'", sep="")

tiering_query <- labkey_to_df(tiering_sql, version, 10000)
1
2
3
4
5
6
7
8
gene = "NIPBL"

gene_tiering_sql = (f'''
    SELECT participant_id, phenotype, assembly, chromosome, position, reference, alternate, genotype
    FROM tiering_data
    WHERE genomic_feature_hgnc = '{gene}'
    ''')
gene_tiering_query = labkey_to_df(gene_tiering_sql, version, 10000)
1
2
3
4
5
6
gene <- "NIPBL"
tiering_sql <- paste("SELECT participant_id, phenotype, assembly, chromosome, position, reference, alternate, genotype ",
  "FROM tiering_data ",
  "WHERE genomic_feature_hgnc='", gene, "'", sep="")

tiering_query <- labkey_to_df(tiering_sql, version, 10000)

We can also query for a genomic locus. In this case we must include the genome assembly as an argument. For completeness, you may wish to also query for the remapped coordinate on GRCh37.

chromosome <- "5"
coordinate <- "37064560"
assembly <- "GRCh38"

locus_tiering_sql <- paste("SELECT participant_id, phenotype, genomic_feature_hgnc, genotype ",
    "FROM tiering_data ",
    "WHERE chromosome = '", chromosome,
      "' AND position = '", coordinate,
      "' AND assembly = '", assembly, "'", sep="")

locus_tiering_query <- labkey_to_df(locus_tiering_sql, version, 10000)
chromosome = "5"
coordinate = "37064560"
assembly = "GRCh38"

locus_tiering_sql = (f'''
    SELECT participant_id, phenotype, genomic_feature_hgnc, genotype
    FROM tiering_data
    WHERE chromosome = '{chromosome}'
      AND position = '{coordinate}'
      AND assembly = '{assembly}'
    ''')

locus_tiering_query = labkey_to_df(locus_tiering_sql, version, 10000)
chromosome <- "5"
coordinate <- "37064560"
assembly <- "GRCh38"

locus_tiering_sql <- paste("SELECT participant_id, phenotype, genomic_feature_hgnc, genotype ",
    "FROM tiering_data ",
    "WHERE chromosome = '", chromosome,
      "' AND position = '", coordinate,
      "' AND assembly = '", assembly, "'", sep="")

locus_tiering_query <- labkey_to_df(locus_tiering_sql, version, 10000)

Querying the exomiser table

The exomiser table allows you to search for participants by locus or gene, just like the tiering table. However, it also contains HGVS notation for variants in the form:

gene_name:ENST0000000####:c.99A>T:p.(Gly33Glu)

gene_name <- "MUM1L1"
aa_change <- "Leu7Gln"

exomiser_hgvs_sql <- paste( "SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype ",
    "FROM exomiser ",
    " WHERE hgvs like '%", gene_name,
    "%' AND hgvs like '%", aa_change, "%'",
    sep="")

exomiser_hgvs_query <- labkey_to_df(exomiser_hgvs_sql, version, 10000)
gene_name = "MUM1L1"
aa_change = "Leu7Gln"

exomiser_hgvs_sql = (f'''
    SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype
    FROM exomiser
    WHERE hgvs like '%{gene_name}%'
      AND hgvs like '%{aa_change}%'
    ''')

exomiser_hgvs_query = labkey_to_df(exomiser_hgvs_sql, version, 10000)
gene_name <- "MUM1L1"
aa_change <- "Leu7Gln"

exomiser_hgvs_sql <- paste( "SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype ",
    "FROM exomiser ",
    " WHERE hgvs like '%", gene_name,
    "%' AND hgvs like '%", aa_change, "%'",
    sep="")

exomiser_hgvs_query <- labkey_to_df(exomiser_hgvs_sql, version, 10000)

Querying the cancer_tier_and_domain_variants table

Now we'll query the cancer_tier_and_domain_variants. This has a similar structure to the rare disease tiering table, so we can query it for a gene and/or region in the same way.

1
2
3
4
5
6
7
cancer_gene <- "TP53"

cancer_gene_tiering_sql = paste( "SELECT participant_id, disease_type, chr, pos, ref, alt
    FROM cancer_tier_and_domain_variants
    WHERE gene = '", cancer_gene, "'
    ", sep = "")
cancer_gene_tiering_query <- labkey_to_df(cancer_gene_tiering_sql, version, 10000)
1
2
3
4
5
6
7
8
cancer_gene = "TP53"

cancer_gene_tiering_sql = (f'''
    SELECT participant_id, disease_type, chr, pos, ref, alt
    FROM cancer_tier_and_domain_variants
    WHERE gene = '{cancer_gene}'
    ''')
cancer_gene_tiering_query = labkey_to_df(cancer_gene_tiering_sql, version, 10000)
1
2
3
4
5
6
7
cancer_gene <- "TP53"

cancer_gene_tiering_sql = paste( "SELECT participant_id, disease_type, chr, pos, ref, alt
    FROM cancer_tier_and_domain_variants
    WHERE gene = '", cancer_gene, "'
    ", sep = "")
cancer_gene_tiering_query <- labkey_to_df(cancer_gene_tiering_sql, version, 10000)

Since all cancer genomes are aligned to GRCh38, querying by a locus does not require the assembly argument.

cancer_chromosome <- "17"
cancer_coordinate <- "7675166"

cancer_locus_tiering_sql <- paste("SELECT participant_id, disease_type, gene
    FROM cancer_tier_and_domain_variants
    WHERE chr = '", cancer_chromosome, "'
      AND pos = '", cancer_coordinate, "'
    ", sep = "")

cancer_locus_tiering_query <- labkey_to_df(cancer_locus_tiering_sql, version, 10000)
cancer_chromosome = "17"
cancer_coordinate = "7675166"

cancer_locus_tiering_sql = (f'''
    SELECT participant_id, disease_type, gene
    FROM cancer_tier_and_domain_variants
    WHERE chr = '{cancer_chromosome}'
      AND pos = '{cancer_coordinate}'
    ''')

cancer_locus_tiering_query = labkey_to_df(cancer_locus_tiering_sql, version, 10000)
cancer_chromosome <- "17"
cancer_coordinate <- "7675166"

cancer_locus_tiering_sql <- paste("SELECT participant_id, disease_type, gene
    FROM cancer_tier_and_domain_variants
    WHERE chr = '", cancer_chromosome, "'
      AND pos = '", cancer_coordinate, "'
    ", sep = "")

cancer_locus_tiering_query <- labkey_to_df(cancer_locus_tiering_sql, version, 10000)

Querying the NHS GMS tiering_data table

Again, the structure is very similar to that in 100k, but we don't need to fetch the assembly as all genomes are aligned to GRCh38. We have to make a new version object for the NHS GMS database.

1
2
3
4
5
6
7
8
gms_version <- "nhs-gms/nhs-gms-release_v2_2023-02-28"

gms_gene_tiering_sql <- paste("SELECT participant_id, phenotype, chromosome, position, reference, alternate, genotype
    FROM tiering_data
    WHERE genomic_feature_hgnc = '", gene, "'
    ", sep = "")

gms_gene_tiering_query <- labkey_to_df(gms_gene_tiering_sql, gms_version, 10000)
1
2
3
4
5
6
7
8
9
gms_version = "nhs-gms/nhs-gms-release_v2_2023-02-28"

gms_gene_tiering_sql = (f'''
    SELECT participant_id, phenotype, chromosome, position, reference, alternate, genotype
    FROM tiering_data
    WHERE genomic_feature_hgnc = '{gene}'
    ''')

gms_gene_tiering_query = labkey_to_df(gms_gene_tiering_sql, gms_version, 10000)

NHS GMS data is not currently accessible in CloudOS.