Skip to content

Working with NHS GMS data within LabKey

NHS GMS data in LabKey

As of the 15th of June 2022, Genomics England has started the process of releasing consented genomic data from the NHS Genomic Medicine Service (NHS GMS) within the Research Environment. The clinical and other relevant information relating to these participants will be contained within LabKey tables in much the same way that the main programme data is stored. For a summary of the data structure, we recommend that you consult the appropriate page and release notes on the current NHS GMS release.

Participant IDs

Participant IDs within the NHS GMS data use a different format than found within the 100,000 genomes project. Due to the underlying infrastructure used by our implementation of LabKey it is not possible to cross-query the 100K Genomes and NHS GMS data in a single step. If the query you are working with will return data in both datasets you would need to recover these in two separate queries prior to combining the data within R or Python dataframes. In the Main Programme releases the distinction between the Cancer and Rare Disease cohorts was included in the Participants table. Within the first release of the NHS GMS tables, this information can now be found in the referral table using the Clinical Indication Code column. Codes containing 'R' will refer to Rare Disease referrals and 'M' to Cancer referrals. We will improve clarity on this aspect in future releases.

LabKey APIs

The LabKey interface will allow you to browse and filter the information located in a particular table. While this enables you to gain an insight into the information stored in that table, the major power of the LabKey database for cohort building lies in the ability to cross-query the tables. We would recommend the use of the APIs for this purpose. We support two APIs within the Research Environment, one for Python and one for R.

Examples

The APIs mentioned above will accept queries in a number of formats, however, we would recommend using the execute SQL functions to submit SQL queries of the data in the release. Using SQL will allow you to as more complex questions that will touch data across multiple tables, and will allow the returned data to be processed in a more effective manner.

In the following example, we will extract the genomic features for Tier 1 and 2 variants uncovered within participants that present an "intellectual disability" phenotype.

SELECT
  DISTINCT td.participant_id,
  td.genomic_feature_hgnc,
  td.genotype,
  td.segregation_pattern,
  td.penetrance,
  td.phenotype AS tiered_phenotype,
  pa.phenotype AS panel_phenotype,
  pa.panel_name,
  td.ensembl_id,
  td.tier,
  td.chromosome,
  td.position,
  td.reference,
  td.alternate,
  CONCAT_WS(
    '_',
    td.chromosome,
    td.position,
    td.reference,
    td.alternate
    ) AS tiered_chr_pos_ref_alt,
  e.score,
  e.hgvs
FROM
  tiering_data td
JOIN
  panels_applied pa ON
    pa.participant_id = td.participant_id AND
    pa.phenotype = td.phenotype
JOIN
  exomiser e ON e.participant_id = td.participant_id
WHERE
  td.tier in ('TIER1', 'TIER2') AND
  LOWER(pa.phenotype) = LOWER('intellectual disability') AND
  (
    td.chromosome = e.chromosome AND
    td.position = e.position AND
    td.reference = e.reference AND
    td.alternate = e.alternate
    );
The query above provides examples of:
  • the SELECT, FROM, WHERE principles of the SQL language
  • joining tables in single and multiple columns
  • conditions within lists
  • case-folding
  • compound conditioning
  • output collation

The basics of the SQL syntax can be reviewed from the W3Schools.

For more in-depth reference material and LabKey Specific commands, we would recommend consulting the LabKey SQL guide.

Python API

The library needed to access the API can be loaded in the Research Environment with the command module load python/3.8.1 and in the HPC with the command module load lang/Python/3.7.4-GCCcore-8.3.0.

The following script will allow you to query the initial release of the NHS GMS data from LabKey using the v1.x python API that is included in the modules stated above, the syntax is slightly different from the code autogenerated by the LabKey user interface:

import pandas as pd
import labkey


# Define LabKey constants
labkey_server = "labkey-embassy.gel.zone"                # Do not change this
project_name = "nhs-gms/nhs-gms-release_v2_2023-02-28"   # This can be changed to different NHS GMS releases
context_path = "labkey"                                  # Do not change this
schema_name = "lists"                                    # Do not change this


# Create and object that will help connect to the LabKey databases. This does not change
server_context = labkey.utils.create_server_context(
    labkey_server,
    project_name,
    context_path,
    use_ssl = True
)


# Create the querying function
def run_query(sql_str):
    """
    Runs an SQL query
    input:
        - sql_str: str; an SQL query string. This can be either written out long
        form or generated by a function
    output:
        - data: pd.DataFrame
    """
    results = labkey.query.execute_sql(
        server_context = server_context,
        schema_name = schema_name,
        sql = sql_str,
        max_rows = 10000000 # Only needed if your query will exceed the default number of rows returned
)

    data = pd.DataFrame(results["rows"])

    return data

The information in this example can be drawn from three LabKey tables (tiering_data, panels_applied and exomiser) within a single query. As part of this query, we hope to obtain information on the number of participants affected, the genes that have been scored as being in Tier 1 or 2 within this cohort, and the indications of the genomic changes reported by the exomiser analysis.

The variable sql_str will need to be a correctly formatted SQL query such as described in the example in the previous section. The query would be run with the command:

gms_nhs_data = run_query(sql_str)

As the returned dataframe can display the columns in an unexpected order, you can rest the order of the columns using Pandas. For the above example query this would be:

gms_nhs_data[[
    'participant_id',
    'genomic_feature_hgnc',
    'genotype',
    'segregation_pattern',
    'penetrance',
    'phenotype',
    'panel_name',
    'ensembl_id',
    'tier',
    'chromosome',
    'position',
    'reference',
    'alternate'
    'tiered_chr_pos_ref_alt',
    'score',
    'hgvs'
]]

RLabKey API library

As with the 100,000 genomes project, the same query mentioned above can also be performed using the RLabKey library. This library is centrally installed within all the versions of R that we currently support. You will be able to access the Rlabkey library within the system library for all supported versions of R in the Research Environment and in the HPC. For example, in the Research Environment, you could access R v.4.0.2 with the terminal command module load R/4.0.2; on the HPC this command would be module load lang/R/4.0.2-foss-2019b.

The query used can be saved as a variable in the same way as shown in the python example above:

library(Rlabkey)

# Set the baseURL
labkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")

mysql <- labkey.executeSql(
    schemaName="lists",                                                 # Do not change this
    colNameOpt = "rname",                                               # Do not change this
    maxRows = 100000000,                                                # Do not change this
    folderPath="nhs-gms/nhs-gms-release_v2_2023-02-28",                 # change as needed to conform to your authorised release
    sql = sql_str                                                       # This can be changed to your query of choice
)

One major difference between the R and Python APIs is that the R dataframe does not need to be rearranged once the data has been recovered from the LabKey database.

These scripts have been encapsulated within test:

library(Rlabkey)

# Set the baseURL
labkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")

sql_query <- "
SELECT
    DISTINCT td.participant_id,
    td.genomic_feature_hgnc,
    td.genotype,
    td.segregation_pattern,
    td.penetrance,
    td.phenotype AS tiered_phenotype,
    pa.phenotype AS panel_phenotype,
    pa.panel_name,
    td.ensembl_id,
    td.tier,
    td.chromosome,
    td.position,
    td.reference,
    td.alternate,
    CONCAT_WS(
        '_',
        td.chromosome,
        td.position,
        td.reference,
        td.alternate
        ) AS tiered_chr_pos_ref_alt,
    e.score,
    e.hgvs
FROM
    tiering_data td
JOIN
    panels_applied pa ON
        pa.participant_id = td.participant_id AND
        pa.phenotype = td.phenotype
JOIN
    exomiser e on e.participant_id = td.participant_id
WHERE
    td.tier in ('TIER1','TIER2') AND
    LOWER(pa.phenotype) = LOWER('intellectual disability') AND
    (
        td.chromosome = e.chromosome AND
        td.position = e.position AND
        td.reference = e.reference AND
        td.alternate = e.alternate
    ); "

mysql <- labkey.executeSql(
    schemaName="lists",                                                 # Do not change this
    colNameOpt = "rname",                                               # Do not change this
    maxRows = 100000000,                                                # Do not change this
    folderPath="nhs-gms/nhs-gms-release_v2_2023-02-28",                 # This can be changed to different NHS GMS releases
    sql = sql_str                                                       # This can be changed to your query of choice
)
import pandas as pd
import labkey


# Define LabKey constants
labkey_server = "labkey-embassy.gel.zone"                # Do not change this
project_name = "nhs-gms/nhs-gms-release_v2_2023-02-28"   # This can be changed to different NHS GMS releases
context_path = "labkey"                                  # Do not change this
schema_name = "lists"                                    # Do not change this


# Create and object that will help connect to the LabKey databases. This does not change
server_context = labkey.utils.create_server_context(
    labkey_server,
    project_name,
    context_path,
    use_ssl = True
)

sql_query = """
SELECT
    DISTINCT td.participant_id,
    td.genomic_feature_hgnc,
    td.genotype,
    td.segregation_pattern,
    td.penetrance,
    td.phenotype AS tiered_phenotype,
    pa.phenotype AS panel_phenotype,
    pa.panel_name,
    td.ensembl_id,
    td.tier,
    td.chromosome,
    td.position,
    td.reference,
    td.alternate,
    CONCAT_WS(
        '_',
        td.chromosome,
        td.position,
        td.reference,
        td.alternate
        ) AS tiered_chr_pos_ref_alt,
    e.score,
    e.hgvs
FROM
    tiering_data td
JOIN
    panels_applied pa ON
        pa.participant_id = td.participant_id AND
        pa.phenotype = td.phenotype
JOIN
    exomiser e on e.participant_id = td.participant_id
WHERE
    td.tier in ('TIER1','TIER2') AND
    LOWER(pa.phenotype) = LOWER('intellectual disability') AND
    (
        td.chromosome = e.chromosome AND
        td.position = e.position AND
        td.reference = e.reference AND
        td.alternate = e.alternate
    );
"""

# Create the querying function
def run_query(sql_str):
    """
    Runs an SQL query
    input:
        - sql_str: str; an SQL query string. This can be either written out long
        form or generated by a function
    output:
        - data: pd.DataFrame
    """
    results = labkey.query.execute_sql(
        server_context = server_context,
        schema_name = schema_name,
        sql = sql_str,
        max_rows = 10000000 # Only needed if your query will exceed the default number of rows returned
)

    data = pd.DataFrame(results["rows"])

    return data

nhs_gms_data = run_query(sql_query)

Last update: November 17, 2023