Skip to content

somAgg code book phenotype queries

Due to a probable bug in BCFtools, site QC statistics for Chrom X are incorrect. We advise avoiding the use of FILTER and INFO field data until this can be corrected. All genotype data and the related VEP functional data are unaffected.

Overview

The LabKey table cancer_analysis, containing sample-level information on all participants within somAgg, can be queried though either the LabKey desktop application or through the LabKey APIs. Please see the documentation for the LabKey Desktop application and the APIs for more information. 

The desktop application allows you to browse the data and perform basic filtering on an individual table. Complete tables or filtered tables can then be exported as text or .xlsx files for downstream analysis. Alternatively, you can interrogate the tables using the LabKey APIs (written in multiple languages including R and Python). We recommend using the LabKey APIs as you can perform more complex and easily reproducible queries on the data.

Examples of how to query the cancer_analysis table and integrate with the phenotypic/clinical data of these participants using the R LabKey API (Rlabkey package) are listed here. We make use of the SQL like syntax to query the tables. You can use RStudio or R in the terminal to run these examples and to do further downstream processing of the data. 

The sample IDs in the somAgg VCFs are the 'tumour_sample_platekey' (e.g. LP3000448-DNA_A09) - not the participant IDs. In order to link between plate-key and participant ID - use the cancer_analysis table. There is one sample per row in this table. 

LabKey API

Importing the cancer_analysis table

This example show how to fetch the entire cancer_analysis table and save as an object in your R session. The table can then be post-filtered in R. 

library(Rlabkey)

labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

main_programme <- "/main-programme/main-programme_v17_2023-03-30"

ca <- labkey.executeSql(                           
  schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme,
  sql = "SELECT * FROM cancer_analysis"
)

Filtering tables and selecting columns

Tables can be filtered for certain attributes and specified columns selected in Rlabkey using SQL. In this example, we can filter the cancer_analysis table for participants in somAgg with colorectal cancer with WGS from fresh-frozen samples. We can then select only the participant ID and platekey columns. 

labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

main_programme <- "/main-programme/main-programme_v17_2023-03-30"

my_table <- labkey.executeSql(                           
  schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme,
  sql = "SELECT participant_id, tumour_sample_platekey
         FROM aggregate_gvcf_sample_stats
         WHERE disease_type = 'COLORECTAL'
         AND preparation_method = 'FF'
)

Joining multiple tables together to create a cohort

The utility of the LabKey APIs comes when joining multiple tables together. In this example, we perform the query above to find participants in somAgg with colorectal cancer with WGS from fresh-frozen samples. We then inner-join these results together with the cancer_staging_consolidated table to identify the staging of cancer at WGS. We pull pack a table of participant ID and plate-key of participants who satisfy the criteria. 

library(Rlabkey)

labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

main_programme <- "/main-programme/main-programme_v17_2023-03-30"

my_table <- labkey.executeSql(                           
  schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme,
  sql = "SELECT ca.participant_id, ca.tumour_sample_platekey, av.stage_best
         FROM cancer_analysis
         AS ca
         INNER JOIN av_tumour
         AS av
         ON ca.participant_id = av.participant_id
         WHERE disease_type = 'COLORECTAL'
         AND preparation_method = 'FF'
     AND stage_best NOT IN ('?', '6', 'A', 'B', 'U', 'X', '<NA>')"
)
  • Note that cancer_staging consolidate contains other columns with staging information that can be used for sample where stage_best is not available. Also, interval_min can be used to set a maximum time allowed between sample collection and string information collected, which may not coincide.

Last update: November 27, 2023