Not all genetic variants can be found using LabKey. The only variants you can see in LabKey are those that have been prioritised as likely causal for the disorder studied. For rare disease, variants are prioritised using tiering and Exomiser, whereas cancer variants are prioritised using cancer tiering, in the tables tiering_data, exomiser and cancer_tier_and_domain_variants respectively for 100k and NHS GMS.
Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.
Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.
To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:
LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:
We have here a helper function called query_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:
sql_query: an SQL query to access the LabKey tables
database: the version of the database you want to access for example /main-programme/main-programme_v19_2024-10-31
maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.
Feel free to include this function in all your scripts and invoke it every time you want to query the data.
baseURL change
We have updated the baseURL for accessing LabKey. If you are already using the LabKey API, you will find that this code is not compatible with your existing .netrc file. Please update this file to the new version. The new version contains the configuration for both the new baseURL and the old one, so you will not need to retroactively update old scripts.
# Helper functions to check labkey version and compare it to required versions for different # API callsdefget_labkey_version()->str|None:""" Helper functiont to recover the version of the labkey package in your enviroment Args:None Returns: str: Version number as string """try:ver=importlib.metadata.version("labkey")returnverexceptInvalidVersion:print("Invalid version format for labkey.")returnNonedefis_at_least_version(required_version:str)->bool|None:""" Helper function to check if labkey version is at least a specified version Args: required_cersion (str): the minimum version that is needed to select query functionality Returns: Bool: does the labkey package meet the requirement """current_version=get_labkey_version()ifcurrent_versionisNone:returnFalsetry:returncurrent_version>=required_versionexceptInvalidVersion:print("Invalid version format for comparison.")returnFalsedefis_one_of_versions(allowed_versions:List[str])->bool:""" Helper function to check if the labkey package one of a list of specifically supported versions Args:allowed_versions (List) """current_version=get_labkey_version()ifcurrent_versionisNone:returnFalsereturncurrent_versioninallowed_versions# Main function to execute labkey sql query and return results as a pandas dataframedefquery_to_df(sql_query,database)->DataFrame|str|Exception:"""generate an pandas dataframe from labkey sql query Args: sql_query (str): SQL query to execute database (str): LabKey project name Returns: pd.DataFrame: DataFrame containing the results of the SQL query msg (str): An error message that will let you know if your version of the package is supported Exception: A generic error generated if there is an issue with your enviroment or implementation """try:ifis_one_of_versions(['1.2.0','1.4.0','1.4.1']):server_context=labkey.utils.create_server_context(server="labkey.prod.aws.gel.ac",project=database,use_ssl=True)results=labkey.api.query.execute_sql(server_context,sql=sql_query,schema_name="lists",max_rows=100000000)returnDataFrame(results['rows'])elifis_at_least_version('2.4.0'):fromlabkey.api_wrapperimportAPIWrappercontext_path="labkey"api=APIWrapper(domain="labkey.prod.aws.gel.ac",container_path=database,context_path=context_path,use_ssl=True)results=api.query.execute_sql(sql=sql_query,schema_name="lists",max_rows=100000000,waf_encode_sql=False)returnDataFrame(results['rows'])else:msg=f"labkey version {get_labkey_version()} not supported, please update to version 1.2.0 or higher"returnmsgexceptExceptionase:print(f"Exception occurred:: {e}")traceback.print_exc()raise
We can query the rare disease tiering_data table by a genomic locus or by a gene. The following query finds all participants with prioritised variants in a gene of interest, getting the variant loci and reference and alternative alleles. Since this is a gene-based query, it will find all rows where the gene name is listed, which will include participant genomes aligned to GRCh37 and GRCh38.
gene="NIPBL"gene_tiering_sql=(f''' SELECT participant_id, phenotype, assembly, chromosome, position, reference, alternate, genotype FROM tiering_data WHERE genomic_feature_hgnc = '{gene}' ''')gene_tiering_query=query_to_df(gene_tiering_sql,version)
We can also query for a genomic locus. In this case we must include the genome assembly as an argument. For completeness, you may wish to also query for the remapped coordinate on GRCh37.
chromosome="5"coordinate="37064560"assembly="GRCh38"locus_tiering_sql=(f''' SELECT participant_id, phenotype, genomic_feature_hgnc, genotype FROM tiering_data WHERE chromosome = '{chromosome}' AND position = '{coordinate}' AND assembly = '{assembly}' ''')locus_tiering_query=labkey_to_df(locus_tiering_sql,version,10000)
chromosome="5"coordinate="37064560"assembly="GRCh38"locus_tiering_sql=(f''' SELECT participant_id, phenotype, genomic_feature_hgnc, genotype FROM tiering_data WHERE chromosome = '{chromosome}' AND position = '{coordinate}' AND assembly = '{assembly}' ''')locus_tiering_query=query_to_df(locus_tiering_sql,version)
chromosome="5"coordinate="37064560"locus_tiering_sql=(f''' SELECT participant_id, phenotype, genomic_feature_hgnc, genotype FROM tiering_data WHERE chromosome = '{chromosome}' AND position = '{coordinate}' ''')locus_tiering_query=labkey_to_df(locus_tiering_sql,version,10000)
The exomiser table allows you to search for participants by locus or gene, just like the tiering table. However, it also contains HGVS notation for variants in the form:
gene_name="MUM1L1"aa_change="Leu7Gln"exomiser_hgvs_sql=(f''' SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype FROM exomiser WHERE hgvs like '%{gene_name}%' AND hgvs like '%{aa_change}%' ''')exomiser_hgvs_query=labkey_to_df(exomiser_hgvs_sql,version,10000)
gene_name="MUM1L1"aa_change="Leu7Gln"exomiser_hgvs_sql=(f''' SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype FROM exomiser WHERE hgvs like '%{gene_name}%' AND hgvs like '%{aa_change}%' ''')exomiser_hgvs_query=query_to_df(exomiser_hgvs_sql,version)
gene_name<-"MUM1L1"aa_change<-"Leu7Gln"exomiser_hgvs_sql<-paste("SELECT participant_id, phenotype, chromosome, position, hgvs, reference, alternate, genotype ","FROM exomiser "," WHERE hgvs like '%",gene_name,"%' AND hgvs like '%",aa_change,"%'",sep="")exomiser_hgvs_query<-labkey_to_df(exomiser_hgvs_sql,version,10000)
gene_name="MUM1L1"aa_change="Leu7Gln"exomiser_hgvs_sql=(f''' SELECT participant_id, phenotype, chromosome, position, hgvs, reference, alternate, genotype FROM exomiser WHERE hgvs like '%{gene_name}%' AND hgvs like '%{aa_change}%' ''')exomiser_hgvs_query=labkey_to_df(exomiser_hgvs_sql,version,10000)
Querying the cancer_tier_and_domain_variants table¶
Now we'll query the cancer_tier_and_domain_variants. This has a similar structure to the rare disease tiering table, so we can query it for a gene and/or region in the same way.
cancer_gene<-"TP53"cancer_gene_tiering_sql=paste("SELECT participant_id, disease_type, chr, pos, ref, alt FROM cancer_tier_and_domain_variants WHERE gene = '",cancer_gene,"' ",sep="")cancer_gene_tiering_query<-labkey_to_df(cancer_gene_tiering_sql,version,10000)
cancer_gene="TP53"cancer_gene_tiering_sql=(f''' SELECT participant_id, disease_type, chr, pos, ref, alt FROM cancer_tier_and_domain_variants WHERE gene = '{cancer_gene}' ''')cancer_gene_tiering_query=labkey_to_df(cancer_gene_tiering_sql,version,10000)
cancer_gene<-"TP53"cancer_gene_tiering_sql=paste("SELECT participant_id, disease_type, chr, pos, ref, alt FROM cancer_tier_and_domain_variants WHERE gene = '",cancer_gene,"' ",sep="")cancer_gene_tiering_query<-labkey_to_df(cancer_gene_tiering_sql,version,10000)
cancer_gene="TP53"cancer_gene_tiering_sql=(f''' SELECT participant_id, disease_type, chr, pos, ref, alt FROM cancer_tier_and_domain_variants WHERE gene = '{cancer_gene}' ''')cancer_gene_tiering_query=query_to_df(cancer_gene_tiering_sql,version)
Since all cancer genomes are aligned to GRCh38, querying by a locus does not require the assembly argument.
cancer_chromosome<-"17"cancer_coordinate<-"7675166"cancer_locus_tiering_sql<-paste("SELECT participant_id, disease_type, gene FROM cancer_tier_and_domain_variants WHERE chr = '",cancer_chromosome,"' AND pos = '",cancer_coordinate,"' ",sep="")cancer_locus_tiering_query<-labkey_to_df(cancer_locus_tiering_sql,version,10000)
cancer_chromosome<-"17"cancer_coordinate<-"7675166"cancer_locus_tiering_sql<-paste("SELECT participant_id, disease_type, gene FROM cancer_tier_and_domain_variants WHERE chr = '",cancer_chromosome,"' AND pos = '",cancer_coordinate,"' ",sep="")cancer_locus_tiering_query<-labkey_to_df(cancer_locus_tiering_sql,version,10000)
cancer_gene<-"TP53"cancer_gene_tiering_sql=paste("SELECT participant_id, disease_type, chr, pos, ref, alt FROM cancer_tier_and_domain_variants WHERE gene = '",cancer_gene,"' ",sep="")cancer_gene_tiering_query<-labkey_to_df(cancer_gene_tiering_sql,version,10000)
cancer_chromosome<-"17"cancer_coordinate<-"7675166"cancer_locus_tiering_sql<-paste("SELECT participant_id, disease_type, gene FROM cancer_tier_and_domain_variants WHERE chr = '",cancer_chromosome,"' AND pos = '",cancer_coordinate,"' ",sep="")cancer_locus_tiering_query<-labkey_to_df(cancer_locus_tiering_sql,version,10000)