somAgg code book phenotype queries¶
Due to a probable bug in BCFtools, site QC statistics for Chrom X are incorrect. We advise avoiding the use of FILTER and INFO field data until this can be corrected. All genotype data and the related VEP functional data are unaffected.
Overview¶
The LabKey table cancer_analysis, containing sample-level information on all participants within somAgg, can be queried though either the LabKey desktop application or through the LabKey APIs. Please see the documentation for the LabKey Desktop application and the APIs for more information.
The desktop application allows you to browse the data and perform basic filtering on an individual table. Complete tables or filtered tables can then be exported as text or .xlsx files for downstream analysis. Alternatively, you can interrogate the tables using the LabKey APIs (written in multiple languages including R and Python). We recommend using the LabKey APIs as you can perform more complex and easily reproducible queries on the data.
Examples of how to query the cancer_analysis table and integrate with the phenotypic/clinical data of these participants using the R LabKey API (Rlabkey package) are listed here. We make use of the SQL like syntax to query the tables. You can use RStudio or R in the terminal to run these examples and to do further downstream processing of the data.
The sample IDs in the somAgg VCFs are the 'tumour_sample_platekey' (e.g. LP3000448-DNA_A09) - not the participant IDs. In order to link between plate-key and participant ID - use the cancer_analysis table. There is one sample per row in this table.
LabKey API¶
Importing the cancer_analysis table¶
This example show how to fetch the entire cancer_analysis table and save as an object in your R session. The table can then be post-filtered in R.
Filtering tables and selecting columns¶
Tables can be filtered for certain attributes and specified columns selected in Rlabkey using SQL. In this example, we can filter the cancer_analysis table for participants in somAgg with colorectal cancer with WGS from fresh-frozen samples. We can then select only the participant ID and platekey columns.
Joining multiple tables together to create a cohort¶
The utility of the LabKey APIs comes when joining multiple tables together. In this example, we perform the query above to find participants in somAgg with colorectal cancer with WGS from fresh-frozen samples. We then inner-join these results together with the cancer_staging_consolidated table to identify the staging of cancer at WGS. We pull pack a table of participant ID and plate-key of participants who satisfy the criteria.
- Note that cancer_staging consolidate contains other columns with staging information that can be used for sample where stage_best is not available. Also, interval_min can be used to set a maximum time allowed between sample collection and string information collected, which may not coincide.