Skip to content

AggV2 code book - phenotype queries

Due to a probable bug in BCFtools, site QC statistics for Chrom X are incorrect. We advise avoiding the use of FILTER and INFO field data until this can be corrected. All genotype data and the related VEP functional data are unaffected.

Overview

The LabKey table aggregate_gvcf_sample_stats, containing sample-level information on all participants within aggV2, can be queried though either the LabKey desktop application or through the LabKey APIs. Please see the documentation for the LabKey Desktop application and the APIs for more information.

The desktop application allows you to browse the data and perform basic filtering on an individual table. Complete tables or filtered tables can then be exported as text or .xlsx files for downstream analysis. Alternatively, you can interrogate the tables using the LabKey APIs (written in multiple languages including R and Python). We recommend using the LabKey APIs as you can perform more complex and easily reproducible queries on the data.

Examples of how to query the aggregate_gvcf_sample_stats table and integrate with the phenotypic/clinical data of these participants using the R LabKey API (Rlabkey package) are listed below. We make use of the SQL-like syntax to query the tables. You can use RStudio or R in the terminal to run these examples and to do further downstream processing of the data.

aggregate_gvcf_sample_stats

The sample IDs in the aggV2 VCFs are the 'plate-keys' (e.g. LP3000448-DNA_A09) - not the participant IDs. In order to link between plate-key and participant ID - use the aggregate_gvcf_sample_stats table. There is one sample, one participant per row in this table.

LabKey API

Importing the aggregate_gvcf_sample_stats table

This example show how to fetch the entire aggregate_gvcf_sample_stats table and save as an object in your R session. The table can then be post-filtered in R.

library(Rlabkey)

labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

main_programme <- "/main-programme/main-programme_v17_2023-03-30"

aggregate_gvcf_sample_stats <- labkey.executeSql(                           
  schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme,
  sql = "SELECT * FROM aggregate_gvcf_sample_stats"
)

Filtering tables and selecting columns

Tables can be filtered for certain attributes and specified columns selected in Rlabkey using SQL. In this example, we can filter the aggregate_gvcf_sample_stats table for participants in aggV2 from the rare disease programme for which the WGS was from a blood sample. We can then select only the participant ID and platekey columns.

labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

main_programme <- "/main-programme/main-programme_v17_2023-03-30"

my_table <- labkey.executeSql(                           
  schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme,
  sql = "SELECT participant_id, platekey
         FROM aggregate_gvcf_sample_stats
         WHERE type = 'rare disease germline'
         AND sample_source = 'BLOOD'"
)

Joining multiple tables together to create a cohort

The utility of the LabKey APIs comes when joining multiple tables together. In this example, we perform the query above to find participants in aggV2 from the rare disease programme for which the WGS was from a blood sample. We then inner-join these results together with the rare_diseases_participant_phenotype table to identify participants who are positive for the HPO Term: HP:0000365. We pull back a table of participant ID and plate-key of participants who satisfy the criteria.

library(Rlabkey)

labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

main_programme <- "/main-programme/main-programme_v17_2023-03-30"

my_table <- labkey.executeSql(                           
  schemaName ="lists", colNameOpt ="rname", maxRows = 100000000, folderPath = main_programme,
  sql = "SELECT agg.participant_id, agg.platekey
         FROM aggregate_gvcf_sample_stats
         AS agg
         INNER JOIN rare_diseases_participant_phenotype
         AS phe
         ON agg.participant_id = phe.participant_id
         WHERE agg.type = 'rare disease germline'
         AND agg.sample_type = 'BLOOD'
         AND phe.hpo_id = 'HP:0000365'
         AND phe.hpo_present = 'Yes'"
)

Last update: November 3, 2023