AggV2 code book - phenotype queries¶
Due to a probable bug in BCFtools, site QC statistics for Chrom X are incorrect. We advise avoiding the use of FILTER and INFO field data until this can be corrected. All genotype data and the related VEP functional data are unaffected.
The LabKey table
aggregate_gvcf_sample_stats, containing sample-level information on all participants within aggV2, can be queried though either the LabKey desktop application or through the LabKey APIs. Please see the documentation for the LabKey Desktop application and the APIs for more information.
The desktop application allows you to browse the data and perform basic filtering on an individual table. Complete tables or filtered tables can then be exported as text or .xlsx files for downstream analysis. Alternatively, you can interrogate the tables using the LabKey APIs (written in multiple languages including R and Python). We recommend using the LabKey APIs as you can perform more complex and easily reproducible queries on the data.
Examples of how to query the
aggregate_gvcf_sample_stats table and integrate with the phenotypic/clinical data of these participants using the R LabKey API (Rlabkey package) are listed below. We make use of the SQL-like syntax to query the tables. You can use RStudio or R in the terminal to run these examples and to do further downstream processing of the data.
The sample IDs in the aggV2 VCFs are the 'plate-keys' (e.g. LP3000448-DNA_A09) - not the participant IDs. In order to link between plate-key and participant ID - use the
aggregate_gvcf_sample_stats table. There is one sample, one participant per row in this table.
This example show how to fetch the entire
aggregate_gvcf_sample_stats table and save as an object in your R session. The table can then be post-filtered in R.
Filtering tables and selecting columns¶
Tables can be filtered for certain attributes and specified columns selected in Rlabkey using SQL. In this example, we can filter the
aggregate_gvcf_sample_stats table for participants in aggV2 from the rare disease programme for which the WGS was from a blood sample. We can then select only the participant ID and platekey columns.
Joining multiple tables together to create a cohort¶
The utility of the LabKey APIs comes when joining multiple tables together. In this example, we perform the query above to find participants in aggV2 from the rare disease programme for which the WGS was from a blood sample. We then inner-join these results together with the
rare_diseases_participant_phenotype table to identify participants who are positive for the HPO Term: HP:0000365. We pull back a table of participant ID and plate-key of participants who satisfy the criteria.