Details of all the samples included in somAgg are listed in the LabKey table cancer_analysis. You can use this table to link the samples to their phenotypes in other LabKey tables, using either the LabKey desktop application and the APIs.
The sample IDs in the somAgg VCFs are the 'tumour_sample_platekey' (e.g. LP1234567-DNA_A09) - not the participant IDs. In order to link between plate-key and participant ID - use the cancer_analysis table. There is one sample per row in this table.
Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.
Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.
To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:
LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:
We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:
sql_query: an SQL query to access the LabKey tables
database: the version of the database you want to access for example /main-programme/main-programme_v19_2024-10-31
maxrows: maximum number of rows you want to return. This parameter defaults to 100000000, but if the output contains exactly the default value, then we suggest increasing this value.
Feel free to include this function in all your scripts and invoke it every time you want to query the data.
Tables can be filtered for certain attributes and specified columns selected in labkey using SQL. In this example, we can filter the aggregate_gvcf_sample_stats table for participants in somAgg with colorectal cancer with WGS from fresh-frozen samples. We can then select only the participant ID and platekey columns.
disease<-"COLORECTAL"prep<-"FF"source_sql<-paste("SELECT participant_id, tumour_sample_platekey FROM aggregate_gvcf_sample_stats WHERE disease_type = '",disease,"' AND preparation_method = '",prep,"'",sep="")source_query<-labkey_to_df(source_sql,version,100000)
disease="COLORECTAL"prep="FF"source_sql=(f'''SELECT participant_id, tumour_sample_platekey FROM aggregate_gvcf_sample_stats WHERE disease_type = '{disease}' AND preparation_method = '{prep}' ''')source_query=labkey_to_df(source_sql,version,10000)
disease<-"COLORECTAL"prep<-"FF"source_sql<-paste("SELECT participant_id, tumour_sample_platekey FROM aggregate_gvcf_sample_stats WHERE disease_type = '",disease,"' AND preparation_method = '",prep,"'",sep="")source_query<-labkey_to_df(source_sql,version,100000)
disease="COLORECTAL"prep="FF"source_sql=(f'''SELECT participant_id, tumour_sample_platekey FROM aggregate_gvcf_sample_stats WHERE disease_type = '{disease}' AND preparation_method = '{prep}' ''')source_query=query_to_df(source_sql,version)
Joining multiple tables together to create a cohort¶
The utility of the LabKey APIs comes when joining multiple tables together. In this example, we perform the query above to find participants in somAgg with colorectal cancer with WGS from fresh-frozen samples. We then inner-join these results together with the cancer_staging_consolidated table to identify the staging of cancer at WGS. We pull pack a table of participant ID and plate-key of participants who satisfy the criteria.
disease<-"COLORECTAL"prep<-"FF"stage_not<-c('?','6','A','B','U','X','<NA>')source_phenotype_sql<-paste("SELECT ca.participant_id, ca.tumour_sample_platekey, av.stage_best FROM cancer_analysis AS ca INNER JOIN av_tumour AS av ON ca.participant_id = av.participant_id WHERE disease_type = '",disease,"' AND preparation_method = '",prep,"' AND stage_best NOT IN ('",paste(stage_not,collapse="', '")"source_phenotype_query<-labkey_to_df(source_phenotype_sql,version,100000)
disease="COLORECTAL"prep="FF"stage_not<-['?','6','A','B','U','X','<NA>']source_phenotype_sql=(f''' SELECT ca.participant_id, ca.tumour_sample_platekey, av.stage_best FROM cancer_analysis AS ca INNER JOIN av_tumour AS av ON ca.participant_id = av.participant_id WHERE disease_type = '{disease}' AND preparation_method = '{prep}' AND stage_best NOT IN {*stage_not} ''')source_phenotype_query=labkey_to_df(source_phenotype_sql,version,10000)
disease<-"COLORECTAL"prep<-"FF"stage_not<-c('?','6','A','B','U','X','<NA>')source_phenotype_sql<-paste("SELECT ca.participant_id, ca.tumour_sample_platekey, av.stage_best FROM cancer_analysis AS ca INNER JOIN av_tumour AS av ON ca.participant_id = av.participant_id WHERE disease_type = '",disease,"' AND preparation_method = '",prep,"' AND stage_best NOT IN ('",paste(stage_not,collapse="', '")"source_phenotype_query<-labkey_to_df(source_phenotype_sql,version,100000)
disease="COLORECTAL"prep="FF"stage_not<-['?','6','A','B','U','X','<NA>']source_phenotype_sql=(f''' SELECT ca.participant_id, ca.tumour_sample_platekey, av.stage_best FROM cancer_analysis AS ca INNER JOIN av_tumour AS av ON ca.participant_id = av.participant_id WHERE disease_type = '{disease}' AND preparation_method = '{prep}' AND stage_best NOT IN {*stage_not} ''')source_phenotype_query=querßy_to_df(source_phenotype_sql,version)
cancer_staging_consolidated contains other columns with staging information that can be used for sample where stage_best is not available. Also, interval_min can be used to set a maximum time allowed between sample collection and string information collected, which may not coincide.