Skip to content

Building rare disease cohorts programmatically

Give us feedback on this tutorial

There are a huge number of tables that you can use to build rare disease cohorts. This tutorial will take you through some of the tables and give you some example scripts that you can use in python or R, or using R in an interactive session on CloudOS, to access the data.

The advantage of building cohorts programmatically is that the underlying tables are exposed, allowing you to verify the data in bulk. This also allows you to do very complex queries. You can save and tweak scripts to re-use in new releases or for similar queries.

Tables to query

To build cohorts, you might query the following tables:

Table Fields Details
rare_diseases_participant_disease normalised_specific_disease, normalised_disease_sub_group, normalised_disease_group Normalised terms for the recruited disease
rare_disease_participant_phenotype hpo_term, hpo_present The table contains all the HPO terms that were assessed, make sure you also filter by whether those terms were present.
hes_op, hes_apc, hes_ae diag_all All diagnoses from hospital episode statistics as ICD10 codes
hes_op, hes_apc opertn_all All treatments from hospital episode statistics as OPCS codes
gmc_exit_questionnaire case_solved_family Find families with solved or unsolved cases
submitted_diagnostic discovery n/a Find participants whose cases have been solved by other researchers
mortality icd10_multiple_cause_all Cause of death as ICD10 codes
cancer_analysis disease_type Exclude participants from the control with related cancers
participant_summary yob, genetically_inferred_ancestry_thr, participant_karyotyped_sex Age, ethnicity and sex
aggregate_gvcf_sample_stats sample_source, sample_preparation_method, sample_library_type Filter by sampling details
genome_file_paths_and_types platekey, filename, file_path, file_sub_type Get paths for genomic files

Setting up access

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.

To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:

conda create -n omop-source r-glue r-tidyverse r-data.table r-dbi r-rpostgres r-irkernel -y

You can now launch an R notebook under the omop-source kernel. More details

Contents:

Import modules/libraries you need

LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:

1
2
3
library(tidyverse)
library(Rlabkey)
library(readr)
1
2
3
4
import numpy as np
import functools
import labkey
import pandas as pd
1
2
3
4
5
library(tidyverse)
library(data.table)
library(glue)
library(RPostgres)
library(readr)

Helper function to access the LabKey API

We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:

  • sql_query: an SQL query to access the LabKey tables
  • database: the version of the database you want to access for example /main-programme/main-programme_v17_2023-03-30
  • maxrows: maximum number of rows you want to return. This parameter defaults to 100,000,000, but if the output contains exactly the default value, then we suggest increasing this value.

Feel free to include this function in all your scripts and invoke it every time you want to query the data.

labkey_to_df <- function(sql_query, database, maxrows){

    labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")

    labkey.executeSql(folderPath = database,
                      schemaName = "lists",
                      colNameOpt = "rname",
                      sql = sql_query,
                      maxRows = maxrows) %>%
        mutate(across(everything(), as.character))
}
def labkey_to_df(sql_query, database, maxrows):

ver = labkey.__version__

if ver == '1.2.0' or ver == '1.4.0' or ver == '1.4.1':
  server_context = labkey.utils.create_server_context(
      domain = "labkey-embassy.gel.zone",
      container_path = database,
      context_path = "labkey",
      use_ssl = True
  )

  results =  labkey.query.execute_sql(
      server_context,
      schema_name = "lists",
      sql = sql_query,
      max_rows = maxrows
  )

if ver == '2.4.0':
  from labkey.api_wrapper import APIWrapper

  labkey_server = "labkey-embassy.gel.zone"
  project_name = database
  contextPath = "labkey"
  schema = 'lists'
  api = APIWrapper(
    labkey_server,
    project_name,
    contextPath,
    use_ssl=True
    )
  results = api.query.execute_sql(sql=sql_query, schema_name=schema, max_rows = maxrows)

return(pd.DataFrame(results['rows']))
labkey_to_df <- function(sql_query, database, maxrows){
  DBNAME = "gel_clinical_cb_sql_pro"
  HOST = "clinical-cb-sql-pro.cfe5cdx3wlef.eu-west-2.rds.amazonaws.com"
  PORT = 5432
  PASSWORD = 'anXReTz36Q5r'
  USER = 'jupyter_notebook'

  connection <- DBI::dbConnect(
      RPostgres::Postgres(),
      dbname = DBNAME,
      host = HOST,
      port = PORT,
      password = PASSWORD,
      user = USER,
      options = paste("-c search_path=", version, sep="")
      )

  dbGetQuery(connection, sql_query, n = maxrows)
  }

To run my queries, I'll need to set up my database version:

version <- "/main-programme/main-programme_v17_2023-03-30"
version = "/main-programme/main-programme_v17_2023-03-30"
version <- "source_data_100kv16_covidv4"

Case cohort

Recruited disease

We can find participants recruited for a particular disease in the rare_diseases_participant_disease table. Here we define an SQL query. We will use as our example ADPKD (autosomal dominant polycystic kidney disease).

We need to start by searching for the normalised_specific_disease "Cystic kidney disease".

disease <- "Cystic kidney disease"

recruited_sql <- paste("SELECT participant_id,
        normalised_specific_disease,
        normalised_disease_sub_group,
        normalised_disease_group
    FROM rare_diseases_participant_disease
    WHERE normalised_specific_disease = '",
    disease, "'", sep="")

recruited_query <- labkey_to_df(recruited_sql, version, 100000)
disease = "Cystic kidney disease"

recruited_sql = (f'''
    SELECT participant_id,
        normalised_specific_disease,
        normalised_disease_sub_group,
        normalised_disease_group
    FROM rare_diseases_participant_disease
    WHERE normalised_specific_disease = '{disease}'
    ''')
recruited_query = labkey_to_df(recruited_sql, version, 10000)
disease <- "Cystic kidney disease"

recruited_sql <- paste("SELECT participant_id,
        normalised_specific_disease,
        normalised_disease_sub_group,
        normalised_disease_group
    FROM rare_diseases_participant_disease
    WHERE normalised_specific_disease = '",
    disease, "'", sep="")

recruited_query <- labkey_to_df(recruited_sql, version, 100000)

We will create a list of the participants found with this recruited disease:

HPO terms

We can also find participants who have HPO terms linked to our phenotype of interest. You should independently research a relevant list of HPO terms for your phenotype.

For ADPKD we will query the rare_diseases_participant_phenotype table for the following HPO terms:

  • Polycystic kidney dysplasia (HP:0000113)
  • Multiple renal cysts (HP:0005562)

Make sure you include the column hpo_present in any query, as this table contains all HPO terms that were checked in the participant, not just those that are present.

1
2
3
4
5
6
7
8
9
hpo_codes <- c("HP:0000113", "HP:0005562")

hpo_sql <- paste("SELECT participant_id, hpo_id, hpo_term
    FROM rare_diseases_participant_phenotype
    WHERE hpo_id IN ('", paste(hpo_codes, collapse = "', '"),
    "') AND hpo_present = 'Yes'",
    sep = "")

hpo_query <- labkey_to_df(hpo_sql, version, 10000)
hpo_codes = ["HP:0000113", "HP:0005562"]

hpo_sql = (f'''
    SELECT participant_id, hpo_id, hpo_term
    FROM rare_diseases_participant_phenotype
    WHERE hpo_id IN {*hpo_codes,}
    AND hpo_present = 'Yes'
    ''')

hpo_query = labkey_to_df(hpo_sql, version, 10000)
1
2
3
4
5
6
7
8
9
hpo_codes <- c("HP:0000113", "HP:0005562")

hpo_sql <- paste("SELECT participant_id, hpo_id, hpo_term
    FROM rare_diseases_participant_phenotype
    WHERE hpo_id IN ('", paste(hpo_codes, collapse = "', '"),
    "') AND hpo_present = 'Yes'",
    sep = "")

hpo_query <- labkey_to_df(hpo_sql, version, 100000)

ICD10 codes

Lastly, we will search the hospital episode statistics to identify participants with diagnoses in their medical records that indicate your phenotype. Again, you should research the list of ICD10 codes independently.

For ADPKD we will search for the following codes:

  • N28.1 (cyst of kidney)
  • Q61 (cystic kidney disease) 
  • Q61.2 (polycystic kidney, autosomal dominant)

There are three tables of hospital episode statistics with ICD-10 in the GEL data: - hes_apc - hes_op - hes_ae

(There is also the hes_cc table, but this does not include diagnoses.)

icd_codes <- c('N281', 'Q61', 'Q612')
hes_tables = c("apc", "op", "ae")

icd_concat <- data.frame()

diag_statement <- paste((icd_codes), collapse = "%' OR diag_all LIKE '%")

for (hes_table in hes_tables) {
    sqlstr = paste(
        "SELECT participant_id, diag_all
        FROM hes_", hes_table,
        " WHERE diag_all LIKE '%", diag_statement, "%'", sep = "")
    icd_query <- labkey_to_df(sqlstr, version, 1000000)
    if (nrow(icd_query) > 0){
        icd_concat = rbind(icd_query, icd_concat)
    }
  }
icd_codes = ['N281', 'Q61', 'Q612']
hes_tables = ["apc", "op", "ae"]

icd_concat = pd.DataFrame()

diag_statement = "%' OR diag_all LIKE '%".join(icd_codes)

for hes_table in hes_tables:
    sqlstr = (
        f'''
        SELECT participant_id, diag_all
        FROM hes_{hes_table}
        WHERE diag_all LIKE '%{diag_statement}%'
        ''')
    icd_query = labkey_to_df(sqlstr, version, 1000000)
    if not icd_query.empty:
        icd_concat = pd.concat([icd_query, icd_concat])
icd_codes <- c('N281', 'Q61', 'Q612')
hes_tables = c("apc", "op", "ae")

icd_concat <- data.frame()

diag_statement <- paste((icd_codes), collapse = "%' OR diag_all LIKE '%")

for (hes_table in hes_tables) {
    sqlstr = paste(
        "SELECT participant_id, diag_all
        FROM hes_", hes_table,
        " WHERE diag_all LIKE '%", diag_statement, "%'", sep = "")
    icd_query <- labkey_to_df(sqlstr, version, 100000)
    if (nrow(icd_query) > 0){
        icd_concat = rbind(icd_query, icd_concat)
    }
  }

You can create a complete cohort by pulling out the lists of participants from the three queries above, combining the lists together and removing all duplicates.

Unsolved cases

We can find cases that have been solved, and this has been approved by the GLHs, in the gmc_exit_questionnaire table.

The following SQL query uses a list of participants (rd_participants) and checks to see if the case is solved. The code has been written to see if case_solved_family is not yes as opposed to checking if it is no, as this will bring back families where the case is partially solved or if this is unknown. Alternatively, you might want solved cases only for eligibility to participate in a clinical trial.

1
2
3
4
5
6
7
8
9
solved <- "yes"

gmc_sql <- paste(
    "SELECT participant_id, case_solved_family
    FROM gmc_exit_questionnaire
    WHERE participant_id in (", participants,
    ") AND case_solved_family != '", solved, "'", sep = "")

gmc_query <- labkey_to_df(gmc_sql, version, 1000)
solved = "yes"

gmc_sql = (f'''
    SELECT participant_id, case_solved_family
    FROM gmc_exit_questionnaire
    WHERE participant_id in {*participants,}
    AND case_solved_family != '{solved}'
    ''')

gmc_query = labkey_to_df(gmc_sql, version, 1000)
1
2
3
4
5
6
7
8
9
solved <- "yes"

gmc_sql <- paste(
    "SELECT participant_id, case_solved_family
    FROM gmc_exit_questionnaire
    WHERE participant_id in (", participants,
    ") AND case_solved_family != '", solved, "'", sep = "")

gmc_query <- labkey_to_df(gmc_sql, version, 100000)

Cases that have been solved by RE researchers, but this solution has not yet been approved by the GLHs, can be found in the submitted_diagnostic_discovery table. We will now query our list to ensure that none appear in this table.

1
2
3
4
5
6
7
dd_sql <- paste("SELECT participant_id
    FROM submitted_diagnostic_discovery
    WHERE participant_id IN (",
    toString(gmc_query$participant_id),
    ")", sep ="")

dd_query <- labkey_to_df(dd_sql, version, 1000)
1
2
3
4
5
6
7
dd_sql = (f'''
    SELECT participant_id
    FROM submitted_diagnostic_discovery
    WHERE participant_id IN {*gmc_query['participant_id'].tolist(),}
    ''')

dd_query = labkey_to_df(dd_sql, version, 1000)
1
2
3
4
5
6
7
dd_sql <- paste("SELECT participant_id
    FROM submitted_diagnostic_discovery
    WHERE participant_id IN (",
    toString(gmc_query$participant_id),
    ")", sep ="")

dd_query <- labkey_to_df(dd_sql, version, 100000)

We can now generate a list of participants who match our cohort query and do not yet have a genetic diagnosis.

cohort <- as.list((gmc_query %>% filter(!participant_id %in% dd_query$participant_id))$participant_id)
cohort = [i for i in gmc_query['participant_id'].tolist() if i not in dd_query['participant_id'].tolist()]
cohort <- as.list((gmc_query %>% filter(!participant_id %in% dd_query$participant_id))$participant_id)

Control cohort

NOT phenotype

To build our control cohort, we need to exclude all participants who have the phenotype we're interested in. It is best to expand our criteria beyond that that we used to create cohort so that we exclude all phenotypes related to the phenotype of interest. We're going to query many of the same tables as before but at a higher level:

  • For recruited disease, we will query a higher level, the disease_group in rare_diseases_participant_disease for an umbrella term.
  • To exclude participants by HPO terms, we will use the terms we built our cohort with, plus additional more general terms.
  • To exclude participants by events in their medical records, we will exclude:
    • all participants with an expanded list of ICD10-codes in their diagnoses in the hes tables
    • All participants who have received treatments relating to the disease of interest
    • All participants with related causes of death, who may never have been diagnosed with the disease of interest in their lifetime
  • We will also exclude all participants with kidney cancer, by searching the cancer tables for "RENAL"

First we will make a list of all participants who DO have these diagnoses, codes and causes of death.

NOT recuited disease

Starting with recruited rare disease, where we will query normalised_disease_group for a general term: "Renal and urinary tract disorders".

not_disease <- "Renal and urinary tract disorders"

recruited_not_sql <- paste("SELECT participant_id,
        normalised_specific_disease,
        normalised_disease_sub_group,
        normalised_disease_group
    FROM rare_diseases_participant_disease
    WHERE normalised_disease_group = '",
    not_disease, "'", sep="")

recruited_not_query <- labkey_to_df(recruited_not_sql, version, 100000)
recruited_not_list <- as.list( recruited_not_query$participant_id)
not_disease = "Renal and urinary tract disorders"

recruited_not_sql = (f'''
    SELECT participant_id,
        normalised_specific_disease,
        normalised_disease_sub_group,
        normalised_disease_group
    FROM rare_diseases_participant_disease
    WHERE normalised_disease_group = '{not_disease}'
    ''')

recruited_not_query = labkey_to_df(recruited_not_sql, version, 100000)
recruited_not_list = recruited_not_query['participant_id'].tolist()
not_disease <- "Renal and urinary tract disorders"

recruited_not_sql <- paste("SELECT participant_id,
        normalised_specific_disease,
        normalised_disease_sub_group,
        normalised_disease_group
    FROM rare_diseases_participant_disease
    WHERE normalised_disease_group = '",
    not_disease, "'", sep="")

recruited_not_query <- labkey_to_df(recruited_not_sql, version, 100000)
recruited_not_list <- as.list( recruited_not_query$participant_id)

NOT HPO phenotypes

Now we search the rare_diseases_participant_phenotype table for the expanded list of HPO terms:

  • Polycystic kidney dysplasia (HP:0000113)
  • Multicystic kidney dysplasia (HP:0000003)
  • Multiple renal cysts (HP:0005562)
  • Abnormality of the kidney (HP:0000077)
hpo_not_codes <- c("HP:0000113", "HP:0000003", "HP:0005562", "HP:0000077")

hpo_not_sql <- paste("SELECT participant_id, hpo_id, hpo_term
    FROM rare_diseases_participant_phenotype
    WHERE hpo_id IN ('", paste(hpo_not_codes, collapse = "', '"),
    "') AND hpo_present = 'Yes'",
    sep = "")

hpo_not_query <- labkey_to_df(hpo_not_sql, version, 100000)
hpo_not_list <- as.list( hpo_not_query$participant_id)
hpo_not_codes = ["HP:0000113", "HP:0000003", "HP:0005562", "HP:0000077"]

hpo_not_sql = (f'''
    SELECT participant_id, hpo_id, hpo_term
    FROM rare_diseases_participant_phenotype
    WHERE hpo_id IN {*hpo_not_codes,}
    AND hpo_present = 'Yes'
    ''')

hpo_not_query = labkey_to_df(hpo_not_sql, version, 100000)
hpo_not_list = hpo_not_query['participant_id'].tolist()
hpo_not_codes <- c("HP:0000113", "HP:0000003", "HP:0005562", "HP:0000077")

hpo_not_sql <- paste("SELECT participant_id, hpo_id, hpo_term
    FROM rare_diseases_participant_phenotype
    WHERE hpo_id IN ('", paste(hpo_not_codes, collapse = "', '"),
    "') AND hpo_present = 'Yes'",
    sep = "")

hpo_not_query <- labkey_to_df(hpo_not_sql, version, 100000)
hpo_not_list <- as.list( hpo_not_query$participant_id)

NOT diagnoses, treatments or causes of death in secondary clinical data

We will search for diagnoses in the secondary clinical data, again using an expanded list of terms:

  • N19 Unspecified renal failure
  • Y84.1 Kidney dialysis
  • N18.5 Chronic kidney disease, stage 5
  • N28.1 (cyst of kidney)
  • Q61 (cystic kidney disease) 
  • Q61.2 (polycystic kidney, autosomal dominant) 
  • Q61.3 (polycystic kidney, unspecified) 
  • Q61.8 (other cystic kidney diseases)
  • Q61.9 (cystic kidney disease, unspecified)
icd_not_codes <- c('N281', 'Q61', 'Q612', 'Q613', 'Q618', 'Q619', 'N19', 'Y841', 'N185')
hes_tables = c("apc", "op", "ae")

icd_not_concat <- data.frame()

diag_not_statement <- paste((icd_not_codes), collapse = "%' OR diag_all LIKE '%")

for (hes_table in hes_tables) {
    sqlstr = paste(
        "SELECT participant_id, diag_all
        FROM hes_", hes_table,
        " WHERE diag_all LIKE '%", diag_not_statement, "%'", sep = "")
    icd_query <- labkey_to_df(sqlstr, version, 1000000)
    if (nrow(icd_query) > 0){
        icd_not_concat <- rbind(icd_query, icd_not_concat)
    }
  }

icd_not_list <- as.list( icd_not_concat$participant_id)
icd_not_codes = ['N281', 'Q61', 'Q612', 'Q613', 'Q618', 'Q619', 'N19', 'Y841', 'N185']

hes_tables = ["apc", "op", "ae"]

icd_not_concat = pd.DataFrame()

diag_not_statement = "%' OR diag_all LIKE '%".join(icd_not_codes)

for hes_table in hes_tables:
    sqlstr = (
        f'''
        SELECT participant_id, diag_all
        FROM hes_{hes_table}
        WHERE diag_all LIKE '%{diag_not_statement}%'
        ''')
    icd_query = labkey_to_df(sqlstr, version, 1000000)
    if not icd_query.empty:
        icd_not_concat = pd.concat([icd_query, icd_not_concat])

icd_not_list = icd_not_concat['participant_id'].tolist()
icd_not_codes <- c('N281', 'Q61', 'Q612', 'Q613', 'Q618', 'Q619', 'N19', 'Y841', 'N185')
hes_tables = c("apc", "op", "ae")

icd_not_concat <- data.frame()

diag_not_statement <- paste((icd_not_codes), collapse = "%' OR diag_all LIKE '%")

for (hes_table in hes_tables) {
    sqlstr = paste(
        "SELECT participant_id, diag_all
        FROM hes_", hes_table,
        " WHERE diag_all LIKE '%", diag_not_statement, "%'", sep = "")
    icd_query <- labkey_to_df(sqlstr, version, 100000)
    if (nrow(icd_query) > 0){
        icd_not_concat <- rbind(icd_query, icd_not_concat)
    }
  }

icd_not_list <- as.list( icd_not_concat$participant_id)

We will search the same hes tables for treatments relating to kidney disease:

  • M01 Transplantation of kidney
  • M01.1 Autotransplantation of kidney
  • M01.2 Allotransplantation of kidney from live donor
  • M01.3 Allotransplantation of kidney from cadaver NEC
  • M01.4 Allotransplantation of kidney from cadaver heart beating
  • M01.5 Allotransplantation of kidney from cadaver heart non-beating
  • M01.8 Other specified transplantation of kidney
  • M01.9 Unspecified transplantation of kidney
  • X40.1 Renal dialysis
not_opcs <- c('M01', 'M011', 'M012', 'M013', 'M014', 'M015', 'M018', 'M019', 'X401')
opcs_hes = c("apc", "op")

opcs_not_concat <- data.frame()
opcs_not_statement <- paste((not_opcs), collapse = "%' OR opertn_all LIKE '%")

for (hes_table in opcs_hes) {
    sqlstr = paste(
        "SELECT participant_id, opertn_all
        FROM hes_", hes_table,
        " WHERE opertn_all LIKE '%", opcs_not_statement, "%'", sep = "")
    opcs_query <- labkey_to_df(sqlstr, version, 1000000)
    if (nrow(opcs_query) > 0){
        opcs_not_concat = rbind(opcs_query, opcs_not_concat)
    }
  }

opcs_not_list <- as.list( opcs_not_concat$participant_id)
not_opcs = ['M01', 'M011', 'M012', 'M013', 'M014', 'M015', 'M018', 'M019', 'X401']
opcs_hes = ["apc", "op"]

opcs_not_statement = "%' OR opertn_all LIKE '%".join(not_opcs)
opcs_not_concat = pd.DataFrame()

for hes_table in opcs_hes:
    sqlstr = (
        f'''
        SELECT participant_id, opertn_all
        FROM hes_{hes_table}
        WHERE opertn_all LIKE '%{opcs_not_statement}%'
        ''')
    opcs_query = labkey_to_df(sqlstr, version, 1000000)
    if not opcs_query.empty:
        opcs_not_concat = pd.concat([opcs_query, opcs_not_concat])

opcs_not_list = opcs_not_concat['participant_id'].tolist()
not_opcs <- c('M01', 'M011', 'M012', 'M013', 'M014', 'M015', 'M018', 'M019', 'X401')
opcs_hes = c("apc", "op")

opcs_not_concat <- data.frame()
opcs_not_statement <- paste((not_opcs), collapse = "%' OR opertn_all LIKE '%")

for (hes_table in opcs_hes) {
    sqlstr = paste(
        "SELECT participant_id, opertn_all
        FROM hes_", hes_table,
        " WHERE opertn_all LIKE '%", opcs_not_statement, "%'", sep = "")
    opcs_query <- labkey_to_df(sqlstr, version, 100000)
    if (nrow(opcs_query) > 0){
        opcs_not_concat = rbind(opcs_query, opcs_not_concat)
    }
  }

opcs_not_list <- as.list( opcs_not_concat$participant_id)

We also want to exclude participants who have died of kidney related diseases, so we will use the mortality table to exclude:

  • N19 Unspecified renal failure
  • N28.1 Cyst of kidney
  • Q61 Cystic kidney disease
  • Q61.2 Polycystic kidney, autosomal dominant
  • Q61.3 Polycystic kidney, unspecified
  • Q61.8 Other cystic kidney diseases
  • Q61.9 Cystic kidney disease, unspecified
not_death_icd = c('N19', 'N281', 'Q61', 'Q612', 'Q613', 'Q618', 'Q619')

mortality_not_statement <- paste((not_death_icd), collapse = "%' OR icd10_multiple_cause_all LIKE '%")

mortality_not_sql <- paste(
        "SELECT participant_id, icd10_multiple_cause_all
        FROM mortality
        WHERE icd10_multiple_cause_all LIKE '%", mortality_not_statement, "%'", sep = "")

mortality_not_query <- labkey_to_df(mortality_not_sql, version, 100000)
mortality_not_list <- as.list( mortality_not_query$participant_id)
length(mortality_not_list)
not_death_icd = ['N19', 'N281', 'Q61', 'Q612', 'Q613', 'Q618', 'Q619']

mortality_not_statement = "%' OR icd10_multiple_cause_all LIKE '%".join(not_death_icd)
mortality_not_sql = (
    f'''
    SELECT participant_id, icd10_multiple_cause_all
    FROM mortality
    WHERE icd10_multiple_cause_all LIKE '%{mortality_not_statement}%'
    ''')
mortality_not_query = labkey_to_df(mortality_not_sql, version, 10000)
mortality_not_list = mortality_not_query['participant_id'].tolist()
not_death_icd = c('N19', 'N281', 'Q61', 'Q612', 'Q613', 'Q618', 'Q619')

mortality_not_statement <- paste((not_death_icd), collapse = "%' OR icd10_multiple_cause_all LIKE '%")

mortality_not_sql <- paste(
        "SELECT participant_id, icd10_multiple_cause_all
        FROM mortality
        WHERE icd10_multiple_cause_all LIKE '%", mortality_not_statement, "%'", sep = "")

mortality_not_query <- labkey_to_df(mortality_not_sql, version, 100000)
mortality_not_list <- as.list( mortality_not_query$participant_id)
length(mortality_not_list)

We will exclude all cancer participants with related cancers.

1
2
3
4
5
6
7
cancer_not_type <- "RENAL"
cancer_not_sql <- paste("SELECT   participant_id, disease_type, study_abbreviation ",
 "FROM cancer_analysis ",
 "WHERE disease_type='", cancer_not_type, "'", sep="")

cancer_not_query <- labkey_to_df(cancer_not_sql, version, 100000)
cancer_not_list <- as.list( cancer_not_query$participant_id)
1
2
3
4
5
6
7
8
cancer_not_type = "RENAL"
cancer_not_sql = (f'''
    SELECT participant_id, disease_type, study_abbreviation
    FROM cancer_analysis
    WHERE disease_type='{cancer_not_type}'
    ''')
cancer_not_query = labkey_to_df(cancer_not_sql, version, 10000)
cancer_not_list = cancer_not_query['participant_id'].tolist()
1
2
3
4
5
6
7
cancer_not_type <- "RENAL"
cancer_not_sql <- paste("SELECT   participant_id, disease_type, study_abbreviation ",
 "FROM cancer_analysis ",
 "WHERE disease_type='", cancer_not_type, "'", sep="")

cancer_not_query <- labkey_to_df(cancer_not_sql, version, 100000)
cancer_not_list <- as.list( cancer_not_query$participant_id)

Combine and exclude

You can now combine all these into a single list. Then pull out all the participants from the participant table and remove your exclusion list to give only the participants who do NOT match the criteria. Note that you must pull the whole participant list then excluded afterwards, rather than including your exclusion list in your SQL statement, as this SQL would be too long to be processed.

Match demographics

You can match on age and sex. To do this, you can the year of birth for the case and control cohorts, and calculate their mean years of birth. You can also pull out the phenotyped and genotyped sex.

case_sql <- paste("SELECT participant_id, yob,
    participant_phenotyped_sex, participant_karyotyped_sex
    FROM participant_summary
    WHERE participant_id IN (", toString(cohort), ")",
    sep = "")
case_query <- unique(labkey_to_df(case_sql, version, 10000))
case_yob_mean = mean(strtoi(case_query$yob))

control_sql <- paste("SELECT participant_id, yob,
    participant_phenotyped_sex, participant_karyotyped_sex
    FROM participant_summary
    WHERE participant_id IN (", toString(control), ")",
    sep = "")
control_query <- unique(labkey_to_df(control_sql, version, 10000))
control_yob_mean = mean(strtoi(control_query$yob))

cat("case mean = ", case_yob_mean, "\ncontrol mean = ", control_yob_mean)

cat("case XX = ", sum(case_query$participant_karyotyped_sex == "XX")/nrow(case_query),
    "\ncase Female = ", sum(case_query$participant_phenotyped_sex == "Female")/nrow(case_query),
    "\ncase XY = ", sum(case_query$participant_karyotyped_sex == "XY")/nrow(case_query),
    "\ncase Male = ", sum(case_query$participant_phenotyped_sex == "Male")/nrow(case_query),
    "\ncontrol XX = ", sum(control_query$participant_karyotyped_sex == "XX")/nrow(control_query),
    "\ncontrol Female = ", sum(control_query$participant_phenotyped_sex == "Female")/nrow(control_query),
    "\ncontrol XY = ", sum(control_query$participant_karyotyped_sex == "XY")/nrow(control_query),
    "\ncontrol Male = ", sum(control_query$participant_phenotyped_sex == "Male")/nrow(control_query))
from statistics import mean

case_sql = (f'''
    SELECT participant_id, yob,
    participant_phenotyped_sex, participant_karyotyped_sex
    FROM participant_summary
    WHERE participant_id IN {*cohort,}
    ''')
case_query = labkey_to_df(case_sql, version, 10000).drop_duplicates()
case_yob_mean = mean(case_query['yob'].tolist())

control_sql = (f'''
    SELECT participant_id, yob,
    participant_phenotyped_sex, participant_karyotyped_sex
    FROM participant_summary
    WHERE participant_id IN {*control,}
    ''')
control_query = labkey_to_df(control_sql, version, 100000).drop_duplicates()
control_yob_mean = mean(control_query['yob'].tolist())

print("case mean = ", case_yob_mean,
      "\ncontrol mean = ", control_yob_mean)

print("case XX = ", case_query['participant_karyotyped_sex'].value_counts()['XX']/len(case_query),
     "\ncase Female = ", case_query['participant_phenotyped_sex'].value_counts()['Female']/len(case_query),
     "\ncase XY = ", case_query['participant_karyotyped_sex'].value_counts()['XY']/len(case_query),
     "\ncase Male = ", case_query['participant_phenotyped_sex'].value_counts()['Male']/len(case_query),
     "\ncontrol XX = ", control_query['participant_karyotyped_sex'].value_counts()['XX']/len(control_query),
     "\ncontrol Female = ", control_query['participant_phenotyped_sex'].value_counts()['Female']/len(control_query),
     "\ncontrol XY = ", control_query['participant_karyotyped_sex'].value_counts()['XY']/len(control_query),
     "\ncontrol Male = ", control_query['participant_phenotyped_sex'].value_counts()['Male']/len(control_query))
case_sql <- paste("SELECT participant_id, yob,
    participant_phenotyped_sex, participant_karyotyped_sex
    FROM key_columns
    WHERE participant_id IN (", toString(cohort), ")",
    sep = "")
case_query <- unique(labkey_to_df(case_sql, version, 100000))
case_yob_mean = mean(strtoi(case_query$yob))

control_sql <- paste("SELECT participant_id, yob,
    participant_phenotyped_sex, participant_karyotyped_sex
    FROM key_columns
    WHERE participant_id IN (", toString(control), ")",
    sep = "")
control_query <- unique(labkey_to_df(control_sql, version, 100000))
control_yob_mean = mean(strtoi(control_query$yob))

cat("case mean = ", case_yob_mean, "\ncontrol mean = ", control_yob_mean)

cat("case XX = ", sum(case_query$participant_karyotyped_sex == "XX")/nrow(case_query),
    "\ncase Female = ", sum(case_query$participant_phenotyped_sex == "Female")/nrow(case_query),
    "\ncase XY = ", sum(case_query$participant_karyotyped_sex == "XY")/nrow(case_query),
    "\ncase Male = ", sum(case_query$participant_phenotyped_sex == "Male")/nrow(case_query),
    "\ncontrol XX = ", sum(control_query$participant_karyotyped_sex == "XX")/nrow(control_query),
    "\ncontrol Female = ", sum(control_query$participant_phenotyped_sex == "Female")/nrow(control_query),
    "\ncontrol XY = ", sum(control_query$participant_karyotyped_sex == "XY")/nrow(control_query),
    "\ncontrol Male = ", sum(control_query$participant_phenotyped_sex == "Male")/nrow(control_query))

Compare the mean years of birth and sex ratios, to see if they are suitably close to continue

To ensure that the cohort is ethnically matched, we will filter to include only a single ethnicity for both case and control. In this case we will use European ethnicity. There are three columns on ethnicity: genetically_inferred_ancestry_thr which gives the ancestry that the participant has ≥ 0.8 similarity to, genetically_inferred_ancestry_nothr the ancestry the participant has the highest similarity and genetically_inferred_ancestry_value the value of the highest similarity.

Stated and genomic ethnicity do not necessarily agree, so we recommend using the genomic ethnicity for cohort building.

eth <- "European"

case_ethnicity_sql <- paste("SELECT participant_id, genetically_inferred_ancestry_thr
    FROM participant_summary
    WHERE genetically_inferred_ancestry_thr = '", eth, "'
    AND participant_id IN (", toString(cohort), ")", sep ="")
case_ethnicity_query <- unique(labkey_to_df(case_ethnicity_sql, version, 10000))
filtered_case <- case_ethnicity_query$participant_id

control_ethnicity_sql <- paste("SELECT participant_id, genetically_inferred_ancestry_thr
    FROM participant_summary
    WHERE genetically_inferred_ancestry_thr = '", eth, "'
    AND participant_id IN (", toString(control), ")", sep ="")
control_ethnicity_query <- unique(labkey_to_df(control_ethnicity_sql, version, 100000))
filtered_control <- control_ethnicity_query$participant_id
eth = "European"

case_ethnicity_sql = (f'''
    SELECT participant_id, genetically_inferred_ancestry_thr
    FROM participant_summary
    WHERE genetically_inferred_ancestry_thr = '{eth}'
    AND participant_id IN {*cohort,}
    ''')
case_ethnicity_query = labkey_to_df(case_ethnicity_sql, version, 10000).drop_duplicates()
filtered_case = case_ethnicity_query['participant_id'].tolist()

control_ethnicity_sql = (f'''
    SELECT participant_id, genetically_inferred_ancestry_thr
    FROM participant_summary
    WHERE genetically_inferred_ancestry_thr = '{eth}'
    AND participant_id IN {*control,}
    ''')
control_ethnicity_query = labkey_to_df(control_ethnicity_sql, version, 100000).drop_duplicates()
filtered_control = control_ethnicity_query['participant_id'].tolist()
eth <- "European"

case_ethnicity_sql <- paste("SELECT participant_id, genetically_inferred_ancestry_thr
    FROM key_columns
    WHERE genetically_inferred_ancestry_thr = '", eth, "'
    AND participant_id IN (", toString(cohort), ")", sep ="")
case_ethnicity_query <- unique(labkey_to_df(case_ethnicity_sql, version, 100000))
filtered_case <- case_ethnicity_query$participant_id

control_ethnicity_sql <- paste("SELECT participant_id, genetically_inferred_ancestry_thr
    FROM key_columns
    WHERE genetically_inferred_ancestry_thr = '", eth, "'
    AND participant_id IN (", toString(control), ")", sep ="")
control_ethnicity_query <- unique(labkey_to_df(control_ethnicity_sql, version, 100000))
filtered_control <- control_ethnicity_query$participant_id

General inclusion criteria

We're now going to filter both case and control cohorts by some general criteria. First we will join the two cohorts to a single dataframe with case or control added as column.

1
2
3
4
case_ethnicity_query$case <- "case"
control_ethnicity_query$case <- "control"

case_control_table <- subset( rbind(case_ethnicity_query, control_ethnicity_query), select = c(participant_id, case))
1
2
3
4
5
6
case_df = pd.DataFrame(filtered_case, columns = ['participant_id'])
case_df['case'] = "case"
control_df = pd.DataFrame(filtered_control, columns = ['participant_id'])
control_df['case'] = "control"

case_control_table = pd.concat([case_df, control_df]).drop_duplicates()
1
2
3
4
case_ethnicity_query$case <- "case"
control_ethnicity_query$case <- "control"

case_control_table <- subset( rbind(case_ethnicity_query, control_ethnicity_query), select = c(participant_id, case))

We will first do some QC filtering. We want to include only participants who are XX or XY and whose phenotyped sex matches their genetic sex. We will include a filter for XX or XY in our SQL query, then drop any rows with XX Male or XY Female.

sex_sql <- paste("SELECT participant_id, participant_phenotyped_sex, participant_karyotyped_sex
    FROM participant_summary
    WHERE participant_id IN (",  toString(as.list(case_control_table$participant_id)),
    ") AND (participant_karyotyped_sex = 'XX'
    OR participant_karyotyped_sex = 'XY')", sep= "")

sex_query <- labkey_to_df(sex_sql, version, 100000)

female_query <- subset (sex_query, participant_phenotyped_sex == 'Female' & participant_karyotyped_sex == 'XX')
male_query <- subset (sex_query, participant_phenotyped_sex == 'Male' & participant_karyotyped_sex == 'XY')
sex_list <- c(as.list(female_query$participant_id), as.list(male_query$participant_id))

case_control_table <- filter(case_control_table, participant_id %in% sex_list)
sex_sql = (f'''
    SELECT participant_id, participant_phenotyped_sex, participant_karyotyped_sex
    FROM participant_summary
    WHERE participant_id IN {*case_control_table['participant_id'].tolist(),}
    AND (participant_karyotyped_sex = 'XX'
    OR participant_karyotyped_sex = 'XY')
    ''')
sex_query = labkey_to_df(sex_sql, version, 100000).drop_duplicates()

sex_query = sex_query.drop(sex_query[ (sex_query['participant_phenotyped_sex'] == 'Male') & (sex_query['participant_karyotyped_sex'] == 'XX') ].index)
sex_query = sex_query.drop(sex_query[ (sex_query['participant_phenotyped_sex'] == 'Female') & (sex_query['participant_karyotyped_sex'] == 'XY') ].index)

case_control_table = case_control_table.loc[
    case_control_table['participant_id'].isin
    (sex_query['participant_id'].tolist())]
sex_sql <- paste("SELECT participant_id, participant_phenotyped_sex, participant_karyotyped_sex
    FROM key_columns
    WHERE participant_id IN (",  toString(as.list(case_control_table$participant_id)),
    ") AND (participant_karyotyped_sex = 'XX'
    OR participant_karyotyped_sex = 'XY')", sep= "")

sex_query <- labkey_to_df(sex_sql, version, 100000)

female_query <- subset (sex_query, participant_phenotyped_sex == 'Female' & participant_karyotyped_sex == 'XX')
male_query <- subset (sex_query, participant_phenotyped_sex == 'Male' & participant_karyotyped_sex == 'XY')
sex_list <- c(as.list(female_query$participant_id), as.list(male_query$participant_id))

case_control_table <- filter(case_control_table, participant_id %in% sex_list)

We will further filter by sample extraction methods. We are interested in samples taken as blood, using EDTA extraction and with PCR-free sequencing.

We will do this using the aggregate_gvcf_sample_stats table. This has the added benefit that we will also filter to only include participants included in the AggV2 aggregate VCF files, which may be very useful if you want to use AggV2, which also means all participants have their genomes aligned to GRCh38.

1
2
3
4
5
6
7
8
aggregate_sql = paste("SELECT participant_id, platekey
    FROM aggregate_gvcf_sample_stats
    WHERE participant_id IN(",  toString(case_control_table$participant_id),
    ") AND sample_source = 'BLOOD'
    AND sample_preparation_method = 'EDTA'
    AND sample_library_type = 'TruSeq PCR-Free High Throughput'", sep = "")
aggregate_query <- labkey_to_df(aggregate_sql, version, 100000)
case_control_table <- unique(merge(case_control_table, aggregate_query))
aggregate_sql = (f'''
    SELECT participant_id, platekey
    FROM aggregate_gvcf_sample_stats
    WHERE participant_id IN {*case_control_table['participant_id'].tolist(),}
    AND sample_source = 'BLOOD'
    AND sample_preparation_method = 'EDTA'
    AND sample_library_type = 'TruSeq PCR-Free High Throughput'
    ''')
aggregate_query = labkey_to_df(aggregate_sql, version, 100000)
case_control_table = aggregate_query.merge(case_control_table).drop_duplicates()
1
2
3
4
5
6
7
8
aggregate_sql = paste("SELECT participant_id, platekey
    FROM aggregate_gvcf_sample_stats
    WHERE participant_id IN(",  toString(case_control_table$participant_id),
    ") AND sample_source = 'BLOOD'
    AND sample_preparation_method = 'EDTA'
    AND sample_library_type = 'TruSeq PCR-Free High Throughput'", sep = "")
aggregate_query <- labkey_to_df(aggregate_sql, version, 100000)
case_control_table <- unique(merge(case_control_table, aggregate_query))

We need to check the relatedness between members of our cohort. To do this we can load the relatedness files that were calculated for the aggregated gVCFs.

relatedness_table = read_tsv('/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/PCs_relatedness/relatedness/GEL_aggV2_MAF5_mp10_0.0442.kin0')
relatedness_table = pd.read_csv('/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/PCs_relatedness/relatedness/GEL_aggV2_MAF5_mp10_0.0442.kin0',sep = '\t')
relatedness_table

You will need to mount the following file to your Jupyter session: GEL data resources > aggregations > gel_mainProgramme > aggV2 > genomic > additional_data > PCs_relatedness > relatedness > GEL_aggV2_MAF5_mp10_0.0442.kin0

relatedness_table = read_tsv('mounted-data/GEL_aggV2_MAF5_mp10_0.0442.kin0')    

We're going to find all the monozygotic twins in the table by pulling out all of those where kinship is greater than 0.345.

mz_twins_table <- relatedness_table %>% filter(KINSHIP > 0.354)
mz_twins_table = relatedness_table[ (relatedness_table['KINSHIP'] > 0.354) ]
mz_twins_table <- relatedness_table %>% filter(KINSHIP > 0.354)

Now we need to identify if any of these twins are in our cohort. If they are, we need to determine if either or both of the twins are affected by the phenotype of interest. If one of them is, we will keep that twin in our case cohort and discard the other. If both are affected, we will discard one at random.

twins_list <- split((mz_twins_table %>% select('IID1','IID2')), seq(nrow(mz_twins_table %>% select('IID1','IID2'))))

case_control_list <- as.list(case_control_table$platekey)

for (twins in twins_list) {
  if (twins$IID1 %in% case_control_list) {
    row <- case_control_table[case_control_table$platekey == twins$IID1 ,]
    if (row$case == 'case'){
      case_control_table <- case_control_table[case_control_table$platekey != twins$IID2,]

    }
  } else if (twins$IID2 %in% case_control_list){
    row <- case_control_table[case_control_table$platekey == twins$IID2 ,]
    if (row$case == 'case'){
      case_control_table <- case_control_table[case_control_table$platekey != twins$IID1,]
    }
  }
}
twins_list = mz_twins_table[['IID1', 'IID2']].values.tolist()

case_control_list = case_control_table['platekey'].tolist()

for twins in twins_list:
    if twins[0] in case_control_list:
        if (case_control_table.query('platekey == @twins[0]')['case']).values == 'case':
            case_control_table = case_control_table.drop(case_control_table[case_control_table['platekey'] == twins[1]].index)
            print (twins[1])
    elif twins[1] in case_control_list:
        if (case_control_table.query('platekey == @twins[1]')['case']).values == 'case':
            case_control_table = case_control_table.drop(case_control_table[case_control_table['platekey'] == twins[0]].index)
            print(twins[0])
twins_list <- split((mz_twins_table %>% select('IID1','IID2')), seq(nrow(mz_twins_table %>% select('IID1','IID2'))))

case_control_list <- as.list(case_control_table$platekey)

for (twins in twins_list) {
  if (twins$IID1 %in% case_control_list) {
    row <- case_control_table[case_control_table$platekey == twins$IID1 ,]
    if (row$case == 'case'){
      case_control_table <- case_control_table[case_control_table$platekey != twins$IID2,]

    }
  } else if (twins$IID2 %in% case_control_list){
    row <- case_control_table[case_control_table$platekey == twins$IID2 ,]
    if (row$case == 'case'){
      case_control_table <- case_control_table[case_control_table$platekey != twins$IID1,]
    }
  }
}

Filepaths

For many kinds of downstream analysis, you may wish to work with the genome files for that individual, such as the genomic VCFs. If you're working with the aggregate VCFs, you will need the platekeys.

In this query, we're going to get the platekey and filepath of the gVCF for these participants.

filetype <- "Genomic VCF"

path_sql <- paste("SELECT participant_id,
      platekey, filename, file_path
    FROM genome_file_paths_and_types
    WHERE participant_id IN (", toString(case_control_table$participant_id), ")
    AND file_sub_type = '", filetype, "'" , sep="")

path_query <- labkey_to_df(path_sql, version, 100000)
case_control_table_paths <- merge(case_control_table, path_query)
filetype = "Genomic VCF"

path_sql = (f'''
    SELECT participant_id, platekey, filename, file_path
    FROM genome_file_paths_and_types
    WHERE participant_id IN {*case_control_table['participant_id'].tolist(),}
    AND file_sub_type = '{filetype}'
''')

path_query = labkey_to_df(path_sql, version, 100000)
case_control_paths = path_query.merge(case_control_table)
filetype <- "Genomic VCF"

path_sql <- paste("SELECT participant_id,
      platekey, filename, file_path
    FROM genome_file_paths_and_types
    WHERE participant_id IN (", toString(case_control_table$participant_id), ")
    AND file_sub_type = '", filetype, "'" , sep="")

path_query <- labkey_to_df(path_sql, version, 100000)
case_control_table_paths <- merge(case_control_table, path_query)

Last update: November 17, 2023