Not all genetic variants can be found using LabKey. The only variants you can see in LabKey are those that have been prioritised as likely causal for the disorder studied. For rare disease, variants are prioritised using tiering and Exomiser, whereas cancer variants are prioritised using tiering, in the tables tiering_data, exomiser and cancer_tier_and_domain_variants respectively for 100k. Rare disease tiering_data and exomiser are also available for NHS GMS genomes.
Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.
Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.
To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:
LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:
We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:
sql_query: an SQL query to access the LabKey tables
database: the version of the database you want to access for example /main-programme/main-programme_v19_2024-10-31
maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.
Feel free to include this function in all your scripts and invoke it every time you want to query the data.
We can query the rare disease tiering_data table by a genomic locus or by a gene. The following query finds all participants with prioritised variants in a gene of interest, getting the variant loci and reference and alternative alleles. Since this is a gene-based query, it will find all rows where the gene name is listed, which will include participant genomes aligned to GRCh37 and GRCh38.
We can also query for a genomic locus. In this case we must include the genome assembly as an argument. For completeness, you may wish to also query for the remapped coordinate on GRCh37.
chromosome="5"coordinate="37064560"assembly="GRCh38"locus_tiering_sql=(f''' SELECT participant_id, phenotype, genomic_feature_hgnc, genotype FROM tiering_data WHERE chromosome = '{chromosome}' AND position = '{coordinate}' AND assembly = '{assembly}' ''')locus_tiering_query=labkey_to_df(locus_tiering_sql,version,10000)
The exomiser table allows you to search for participants by locus or gene, just like the tiering table. However, it also contains HGVS notation for variants in the form:
gene_name="MUM1L1"aa_change="Leu7Gln"exomiser_hgvs_sql=(f''' SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype FROM exomiser WHERE hgvs like '%{gene_name}%' AND hgvs like '%{aa_change}%' ''')exomiser_hgvs_query=labkey_to_df(exomiser_hgvs_sql,version,10000)
gene_name<-"MUM1L1"aa_change<-"Leu7Gln"exomiser_hgvs_sql<-paste("SELECT participant_id, phenotype, assembly, chromosome, position, hgvs, reference, alternate, genotype ","FROM exomiser "," WHERE hgvs like '%",gene_name,"%' AND hgvs like '%",aa_change,"%'",sep="")exomiser_hgvs_query<-labkey_to_df(exomiser_hgvs_sql,version,10000)
Querying the cancer_tier_and_domain_variants table¶
Now we'll query the cancer_tier_and_domain_variants. This has a similar structure to the rare disease tiering table, so we can query it for a gene and/or region in the same way.
cancer_gene<-"TP53"cancer_gene_tiering_sql=paste("SELECT participant_id, disease_type, chr, pos, ref, alt FROM cancer_tier_and_domain_variants WHERE gene = '",cancer_gene,"' ",sep="")cancer_gene_tiering_query<-labkey_to_df(cancer_gene_tiering_sql,version,10000)
cancer_gene="TP53"cancer_gene_tiering_sql=(f''' SELECT participant_id, disease_type, chr, pos, ref, alt FROM cancer_tier_and_domain_variants WHERE gene = '{cancer_gene}' ''')cancer_gene_tiering_query=labkey_to_df(cancer_gene_tiering_sql,version,10000)
cancer_gene<-"TP53"cancer_gene_tiering_sql=paste("SELECT participant_id, disease_type, chr, pos, ref, alt FROM cancer_tier_and_domain_variants WHERE gene = '",cancer_gene,"' ",sep="")cancer_gene_tiering_query<-labkey_to_df(cancer_gene_tiering_sql,version,10000)
Since all cancer genomes are aligned to GRCh38, querying by a locus does not require the assembly argument.
cancer_chromosome<-"17"cancer_coordinate<-"7675166"cancer_locus_tiering_sql<-paste("SELECT participant_id, disease_type, gene FROM cancer_tier_and_domain_variants WHERE chr = '",cancer_chromosome,"' AND pos = '",cancer_coordinate,"' ",sep="")cancer_locus_tiering_query<-labkey_to_df(cancer_locus_tiering_sql,version,10000)
cancer_chromosome<-"17"cancer_coordinate<-"7675166"cancer_locus_tiering_sql<-paste("SELECT participant_id, disease_type, gene FROM cancer_tier_and_domain_variants WHERE chr = '",cancer_chromosome,"' AND pos = '",cancer_coordinate,"' ",sep="")cancer_locus_tiering_query<-labkey_to_df(cancer_locus_tiering_sql,version,10000)
Again, the structure is very similar to that in 100k, but we don't need to fetch the assembly as all genomes are aligned to GRCh38. We have to make a new version object for the NHS GMS database.