As of the 15th of June 2022, Genomics England has started the process of releasing consented genomic data from the NHS Genomic Medicine Service (NHS GMS) within the Research Environment. The clinical and other relevant information relating to these participants will be contained within LabKey tables in much the same way that the main programme data is stored. For a summary of the data structure, we recommend that you consult the appropriate page and release notes on the current NHS GMS release.
Participant IDs within the NHS GMS data use a different format than found within the 100,000 genomes project. Due to the underlying infrastructure used by our implementation of LabKey it is not possible to cross-query the 100K Genomes and NHS GMS data in a single step. If the query you are working with will return data in both datasets you would need to recover these in two separate queries prior to combining the data within R or Python dataframes. In the Main Programme releases the distinction between the Cancer and Rare Disease cohorts was included in the Participants table. Within the first release of the NHS GMS tables, this information can now be found in the referral table using the Clinical Indication Code column. Codes containing 'R' will refer to Rare Disease referrals and 'M' to Cancer referrals. We will improve clarity on this aspect in future releases.
The LabKey interface will allow you to browse and filter the information located in a particular table. While this enables you to gain an insight into the information stored in that table, the major power of the LabKey database for cohort building lies in the ability to cross-query the tables. We would recommend the use of the APIs for this purpose. We support two APIs within the Research Environment, one for Python and one for R.
The APIs mentioned above will accept queries in a number of formats, however, we would recommend using the execute SQL functions to submit SQL queries of the data in the release. Using SQL will allow you to as more complex questions that will touch data across multiple tables, and will allow the returned data to be processed in a more effective manner.
In the following example, we will extract the genomic features for Tier 1 and 2 variants uncovered within participants that present an "intellectual disability" phenotype.
The library needed to access the API can be loaded in the Research Environment with the command module load python/3.8.1 and in the HPC with the command module load lang/Python/3.7.4-GCCcore-8.3.0.
The following script will allow you to query the initial release of the NHS GMS data from LabKey using the v1.x python API that is included in the modules stated above, the syntax is slightly different from the code autogenerated by the LabKey user interface:
importpandasaspdimportlabkey# Define LabKey constantslabkey_server="labkey-embassy.gel.zone"# Do not change thisproject_name="nhs-gms/nhs-gms-release_v2_2023-02-28"# This can be changed to different NHS GMS releasescontext_path="labkey"# Do not change thisschema_name="lists"# Do not change this# Create and object that will help connect to the LabKey databases. This does not changeserver_context=labkey.utils.create_server_context(labkey_server,project_name,context_path,use_ssl=True)# Create the querying functiondefrun_query(sql_str):""" Runs an SQL query input: - sql_str: str; an SQL query string. This can be either written out long form or generated by a function output: - data: pd.DataFrame """results=labkey.query.execute_sql(server_context=server_context,schema_name=schema_name,sql=sql_str,max_rows=10000000# Only needed if your query will exceed the default number of rows returned)data=pd.DataFrame(results["rows"])returndata
The information in this example can be drawn from three LabKey tables (tiering_data, panels_applied and exomiser) within a single query. As part of this query, we hope to obtain information on the number of participants affected, the genes that have been scored as being in Tier 1 or 2 within this cohort, and the indications of the genomic changes reported by the exomiser analysis.
The variable sql_str will need to be a correctly formatted SQL query such as described in the example in the previous section. The query would be run with the command:
gms_nhs_data=run_query(sql_str)
As the returned dataframe can display the columns in an unexpected order, you can rest the order of the columns using Pandas. For the above example query this would be:
As with the 100,000 genomes project, the same query mentioned above can also be performed using the RLabKey library. This library is centrally installed within all the versions of R that we currently support. You will be able to access the Rlabkey library within the system library for all supported versions of R in the Research Environment and in the HPC. For example, in the Research Environment, you could access R v.4.0.2 with the terminal command module load R/4.0.2; on the HPC this command would be module load lang/R/4.0.2-foss-2019b.
The query used can be saved as a variable in the same way as shown in the python example above:
library(Rlabkey)# Set the baseURLlabkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")mysql<-labkey.executeSql(schemaName="lists",# Do not change thiscolNameOpt="rname",# Do not change thismaxRows=100000000,# Do not change thisfolderPath="nhs-gms/nhs-gms-release_v2_2023-02-28",# change as needed to conform to your authorised releasesql=sql_str# This can be changed to your query of choice)
One major difference between the R and Python APIs is that the R dataframe does not need to be rearranged once the data has been recovered from the LabKey database.
library(Rlabkey)# Set the baseURLlabkey.setDefaults(baseUrl="https://labkey-embassy.gel.zone/labkey/")sql_query<-"SELECT DISTINCT td.participant_id, td.genomic_feature_hgnc, td.genotype, td.segregation_pattern, td.penetrance, td.phenotype AS tiered_phenotype, pa.phenotype AS panel_phenotype, pa.panel_name, td.ensembl_id, td.tier, td.chromosome, td.position, td.reference, td.alternate, CONCAT_WS( '_', td.chromosome, td.position, td.reference, td.alternate ) AS tiered_chr_pos_ref_alt, e.score, e.hgvsFROM tiering_data tdJOIN panels_applied pa ON pa.participant_id = td.participant_id AND pa.phenotype = td.phenotypeJOIN exomiser e on e.participant_id = td.participant_idWHERE td.tier in ('TIER1','TIER2') AND LOWER(pa.phenotype) = LOWER('intellectual disability') AND ( td.chromosome = e.chromosome AND td.position = e.position AND td.reference = e.reference AND td.alternate = e.alternate ); "mysql<-labkey.executeSql(schemaName="lists",# Do not change thiscolNameOpt="rname",# Do not change thismaxRows=100000000,# Do not change thisfolderPath="nhs-gms/nhs-gms-release_v2_2023-02-28",# This can be changed to different NHS GMS releasessql=sql_str# This can be changed to your query of choice)
importpandasaspdimportlabkey# Define LabKey constantslabkey_server="labkey-embassy.gel.zone"# Do not change thisproject_name="nhs-gms/nhs-gms-release_v2_2023-02-28"# This can be changed to different NHS GMS releasescontext_path="labkey"# Do not change thisschema_name="lists"# Do not change this# Create and object that will help connect to the LabKey databases. This does not changeserver_context=labkey.utils.create_server_context(labkey_server,project_name,context_path,use_ssl=True)sql_query="""SELECT DISTINCT td.participant_id, td.genomic_feature_hgnc, td.genotype, td.segregation_pattern, td.penetrance, td.phenotype AS tiered_phenotype, pa.phenotype AS panel_phenotype, pa.panel_name, td.ensembl_id, td.tier, td.chromosome, td.position, td.reference, td.alternate, CONCAT_WS( '_', td.chromosome, td.position, td.reference, td.alternate ) AS tiered_chr_pos_ref_alt, e.score, e.hgvsFROM tiering_data tdJOIN panels_applied pa ON pa.participant_id = td.participant_id AND pa.phenotype = td.phenotypeJOIN exomiser e on e.participant_id = td.participant_idWHERE td.tier in ('TIER1','TIER2') AND LOWER(pa.phenotype) = LOWER('intellectual disability') AND ( td.chromosome = e.chromosome AND td.position = e.position AND td.reference = e.reference AND td.alternate = e.alternate );"""# Create the querying functiondefrun_query(sql_str):""" Runs an SQL query input: - sql_str: str; an SQL query string. This can be either written out long form or generated by a function output: - data: pd.DataFrame """results=labkey.query.execute_sql(server_context=server_context,schema_name=schema_name,sql=sql_str,max_rows=10000000# Only needed if your query will exceed the default number of rows returned)data=pd.DataFrame(results["rows"])returndatanhs_gms_data=run_query(sql_query)