There are a huge number of tables that you can use to build rare disease cohorts. This tutorial will take you through some of the tables and give you some example scripts that you can use in python or R, or using R in an interactive session on CloudOS, to access the data.
The advantage of building cohorts programmatically is that the underlying tables are exposed, allowing you to verify the data in bulk. This also allows you to do very complex queries. You can save and tweak scripts to re-use in new releases or for similar queries.
Common disease cohorts
To build common disease cohorts programmatically, we recommend you follow the steps here, omitting the steps on recruited disease, HPO terms and unsolved cases.
Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.
Before you get started with using the LabKey APIs, you need to ensure you have it set up correctly, following this document. We recommend running your LabKey API scripts on the HPC, as detailed here, as they can be quite large.
To query our tables using CloudOS, you will need to set up an interactive session. You will need to use conda to install the required R packages:
LabKey has an underlying SQL database. To access it you will need to have the labkey module/library loaded. You will also be working with tables and dataframes. Here you can see the modules/libraries you should load in your scripts:
We have here a helper function called labkey_to_df that you can use to access the LabKey API in the RE and return data as a dataframe. You need to give it:
sql_query: an SQL query to access the LabKey tables
database: the version of the database you want to access for example /main-programme/main-programme_v19_2024-10-31
maxrows: maximum number of rows you want to return. This parameter defaults to 100,000,000, but if the output contains exactly the default value, then we suggest increasing this value.
Feel free to include this function in all your scripts and invoke it every time you want to query the data.
We can find participants recruited for a particular disease in the rare_diseases_participant_disease table. Here we define an SQL query. We will use as our example ADPKD (autosomal dominant polycystic kidney disease).
We need to start by searching for the normalised_specific_disease "Cystic kidney disease".
disease<-"Cystic kidney disease"recruited_sql<-paste("SELECT participant_id, normalised_specific_disease, normalised_disease_sub_group, normalised_disease_group FROM rare_diseases_participant_disease WHERE normalised_specific_disease = '",disease,"'",sep="")recruited_query<-labkey_to_df(recruited_sql,version,100000)
We can find participants recruited for a particular disease in the referral table. Here we define an SQL query. We will use as our example ADPKD (autosomal dominant polycystic kidney disease).
We need to start by searching for the clinical_indication_full_name "Cystic renal disease".
However this table does not include the participant_id, so we will need to link this to the referral_participant table with the referral_id. We will also filter this table by referral_participant_is_proband column to ensure we get probands and not family members.
gms_disease<-"Cystic renal disease"gms_recruited_sql<-paste("SELECT r.referral_id, rp.participant_id, r.clinical_indication_code, r.clinical_indication_full_name FROM referral as r, referral_participant as rp WHERE r.clinical_indication_full_name = '",gms_disease,"' AND rp.referral_participant_is_proband = TRUE AND r.referral_id = rp.referral_id",sep="")gms_recruited_query<-labkey_to_df(gms_recruited_sql,gms_version,10000)gms_recruited_query
gms_disease="Cystic renal disease"gms_recruited_sql=(f''' SELECT r.referral_id, rp.participant_id, r.clinical_indication_code, r.clinical_indication_full_name FROM referral as r, referral_participant as rp WHERE r.clinical_indication_full_name = '{gms_disease}' AND rp.referral_participant_is_proband = TRUE AND r.referral_id = rp.referral_id ''')gms_recruited_query=labkey_to_df(gms_recruited_sql,gms_version,10000)
We will create a list of the participants found with this recruited disease.
We can also find participants who have HPO terms linked to our phenotype of interest. You should independently research a relevant list of HPO terms for your phenotype.
For ADPKD we will query the rare_diseases_participant_phenotype table for the following HPO terms:
Polycystic kidney dysplasia (HP:0000113)
Multiple renal cysts (HP:0005562)
Make sure you include the column hpo_present in any query, as this table contains all HPO terms that were checked in the participant, not just those that are present.
hpo_codes<-c("HP:0000113","HP:0005562")hpo_sql<-paste("SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM rare_diseases_participant_phenotype WHERE normalised_hpo_id IN ('",paste(hpo_codes,collapse="', '"),"') AND hpo_present = 'Yes'",sep="")hpo_query<-labkey_to_df(hpo_sql,version,10000)
hpo_codes=["HP:0000113","HP:0005562"]hpo_sql=(f''' SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM rare_diseases_participant_phenotype WHERE normalised_hpo_id IN {*hpo_codes,} AND hpo_present = 'Yes' ''')hpo_query=labkey_to_df(hpo_sql,version,10000)
hpo_codes<-c("HP:0000113","HP:0005562")hpo_sql<-paste("SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM rare_diseases_participant_phenotype WHERE normalised_hpo_id IN ('",paste(hpo_codes,collapse="', '"),"') AND hpo_present = 'Yes'",sep="")hpo_query<-labkey_to_df(hpo_sql,version,100000)
For ADPKD we will query the observation table for the following HPO terms:
Polycystic kidney dysplasia (HP:0000113)
Multiple renal cysts (HP:0005562)
Make sure you include the column value_code in any query, as this table contains all HPO terms that were checked in the participant, not just those that are present.
hpo_codes<-c("HP:0000113","HP:0005562")gms_hpo_sql<-paste("SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM observation WHERE normalised_hpo_id IN ('",paste(hpo_codes,collapse="', '"),"') AND value_code = 'present'",sep="")gms_hpo_query<-labkey_to_df(gms_hpo_sql,gms_version,10000)
hpo_codes=["HP:0000113","HP:0005562"]gms_hpo_sql=(f''' SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM observation WHERE normalised_hpo_id IN {*hpo_codes,} AND value_code = 'present' ''')gms_hpo_query=labkey_to_df(gms_hpo_sql,gms_version,10000)
Lastly, we will search the hospital episode statistics to identify participants with diagnoses in their medical records that indicate your phenotype. Again, you should research the list of ICD10 codes independently.
For ADPKD we will search for the following codes:
N28.1 (cyst of kidney)
Q61 (cystic kidney disease)
Q61.2 (polycystic kidney, autosomal dominant)
There are three tables of hospital episode statistics with ICD-10 in the GEL data:
- hes_apc
- hes_op
- hes_ae
(There is also the hes_cc table, but this does not include diagnoses.)
icd_codes<-c('N281','Q61','Q612')hes_tables=c("apc","op","ae")icd_concat<-data.frame()diag_statement<-paste((icd_codes),collapse="%' OR diag_all LIKE '%")for (hes_tableinhes_tables){sqlstr=paste("SELECT participant_id, diag_all FROM hes_",hes_table," WHERE diag_all LIKE '%",diag_statement,"%'",sep="")icd_query<-labkey_to_df(sqlstr,version,1000000)if (nrow(icd_query)>0){icd_concat=rbind(icd_query,icd_concat)}}
icd_codes=['N281','Q61','Q612']hes_tables=["apc","op","ae"]icd_concat=pd.DataFrame()diag_statement="%' OR diag_all LIKE '%".join(icd_codes)forhes_tableinhes_tables:sqlstr=(f''' SELECT participant_id, diag_all FROM hes_{hes_table} WHERE diag_all LIKE '%{diag_statement}%' ''')icd_query=labkey_to_df(sqlstr,version,1000000)ifnoticd_query.empty:icd_concat=pd.concat([icd_query,icd_concat])
icd_codes<-c('N281','Q61','Q612')hes_tables=c("apc","op","ae")icd_concat<-data.frame()diag_statement<-paste((icd_codes),collapse="%' OR diag_all LIKE '%")for (hes_tableinhes_tables){sqlstr=paste("SELECT participant_id, diag_all FROM hes_",hes_table," WHERE diag_all LIKE '%",diag_statement,"%'",sep="")icd_query<-labkey_to_df(sqlstr,version,100000)if (nrow(icd_query)>0){icd_concat=rbind(icd_query,icd_concat)}}
Longitudinal medical history is not available for NHS GMS.
You can create a complete cohort by pulling out the lists of participants from the three queries above, combining the lists together and removing all duplicates.
We can find cases that have been solved, and this has been approved by the GLHs, in the gmc_exit_questionnaire table.
The following SQL query uses a list of participants (rd_participants) and checks to see if the case is solved. The code has been written to see if case_solved_family is not yes as opposed to checking if it is no, as this will bring back families where the case is partially solved or if this is unknown. Alternatively, you might want solved cases only for eligibility to participate in a clinical trial.
solved<-"!= 'yes'"gmc_sql<-paste("SELECT participant_id, case_solved_family FROM gmc_exit_questionnaire WHERE participant_id in (",participants,") AND case_solved_family ",solved,sep="")gmc_query<-labkey_to_df(gmc_sql,version,1000)
solved="!= 'yes'"gmc_sql=(f''' SELECT participant_id, case_solved_family FROM gmc_exit_questionnaire WHERE participant_id in {*participants,} AND case_solved_family '{solved}' ''')gmc_query=labkey_to_df(gmc_sql,version,1000)
solved<-"yes"gmc_sql<-paste("SELECT participant_id, case_solved_family FROM gmc_exit_questionnaire WHERE participant_id in (",participants,") AND case_solved_family != '",solved,"'",sep="")gmc_query<-labkey_to_df(gmc_sql,version,100000)
Cases that have been solved by RE researchers, but this solution has not yet been approved by the GLHs, can be found in the submitted_diagnostic_discovery table. We will now query our list to ensure that none appear in this table.
dd_sql<-paste("SELECT participant_id FROM submitted_diagnostic_discovery WHERE participant_id IN (",toString(gmc_query$participant_id),")",sep="")dd_query<-labkey_to_df(dd_sql,version,1000)
dd_sql=(f''' SELECT participant_id FROM submitted_diagnostic_discovery WHERE participant_id IN {*gmc_query['participant_id'].tolist(),} ''')dd_query=labkey_to_df(dd_sql,version,1000)
dd_sql<-paste("SELECT participant_id FROM submitted_diagnostic_discovery WHERE participant_id IN (",toString(gmc_query$participant_id),")",sep="")dd_query<-labkey_to_df(dd_sql,version,100000)
We can now generate a list of participants who match our cohort query and do not yet have a genetic diagnosis.
solved<-"!= 'yes'"gmc_sql<-paste("SELECT participant_id, case_solved_family FROM report_outcome_questionnaire WHERE participant_id in ('",paste(gms_participants,collapse="', '")"') AND case_solved_family ",solved,sep="")gmc_query<-labkey_to_df(gmc_sql,version,1000)
solved="!= 'yes'"gmc_sql=(f''' SELECT participant_id, case_solved_family FROM report_outcome_questionnaire WHERE participant_id in {*participants,} AND case_solved_family '{solved}' ''')gmc_query=labkey_to_df(gmc_sql,version,1000)
To build our control cohort, we need to exclude all participants who have the phenotype we're interested in. It is best to expand our criteria beyond that that we used to create cohort so that we exclude all phenotypes related to the phenotype of interest. We're going to query many of the same tables as before but at a higher level:
For recruited disease, we will query a higher level, the disease_group in rare_diseases_participant_disease for an umbrella term.
To exclude participants by HPO terms, we will use the terms we built our cohort with, plus additional more general terms.
To exclude participants by events in their medical records, we will exclude:
all participants with an expanded list of ICD10-codes in their diagnoses in the hes tables
All participants who have received treatments relating to the disease of interest
All participants with related causes of death, who may never have been diagnosed with the disease of interest in their lifetime
We will also exclude all participants with kidney cancer, by searching the cancer tables for "RENAL"
For recruited disease, we will exclude all participants with the clinical_indication.
To exclude participants by HPO terms, we will use the terms we built our cohort with, plus additional more general terms.
We will also exclude all participants with kidney cancer, by searching the cancer tables for "Renal"
First we will make a list of all participants who DO have these diagnoses, codes and causes of death.
not_disease<-"Renal and urinary tract disorders"recruited_not_sql<-paste("SELECT participant_id, normalised_specific_disease, normalised_disease_sub_group, normalised_disease_group FROM rare_diseases_participant_disease WHERE normalised_disease_group = '",not_disease,"'",sep="")recruited_not_query<-labkey_to_df(recruited_not_sql,version,100000)recruited_not_list<-as.list(recruited_not_query$participant_id)
not_disease<-"Renal and urinary tract disorders"recruited_not_sql<-paste("SELECT participant_id, normalised_specific_disease, normalised_disease_sub_group, normalised_disease_group FROM rare_diseases_participant_disease WHERE normalised_disease_group = '",not_disease,"'",sep="")recruited_not_query<-labkey_to_df(recruited_not_sql,version,100000)recruited_not_list<-as.list(recruited_not_query$participant_id)
We will use our list from our first query of the referral and referral_particapant tables.
hpo_not_codes<-c("HP:0000113","HP:0000003","HP:0005562","HP:0000077")hpo_not_sql<-paste("SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM rare_diseases_participant_phenotype WHERE normalised_hpo_id IN ('",paste(hpo_not_codes,collapse="', '"),"') AND hpo_present = 'Yes'",sep="")hpo_not_query<-labkey_to_df(hpo_not_sql,version,100000)hpo_not_list<-as.list(hpo_not_query$participant_id)
hpo_not_codes=["HP:0000113","HP:0000003","HP:0005562","HP:0000077"]hpo_not_sql=(f''' SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM rare_diseases_participant_phenotype WHERE normalised_hpo_id IN {*hpo_not_codes,} AND hpo_present = 'Yes' ''')hpo_not_query=labkey_to_df(hpo_not_sql,version,100000)hpo_not_list=hpo_not_query['participant_id'].tolist()
hpo_not_codes<-c("HP:0000113","HP:0000003","HP:0005562","HP:0000077")hpo_not_sql<-paste("SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM rare_diseases_participant_phenotype WHERE normalised_hpo_id IN ('",paste(hpo_not_codes,collapse="', '"),"') AND hpo_present = 'Yes'",sep="")hpo_not_query<-labkey_to_df(hpo_not_sql,version,100000)hpo_not_list<-as.list(hpo_not_query$participant_id)
Now we search the observation table for the expanded list of HPO terms:
hpo_not_codes<-c("HP:0000113","HP:0000003","HP:0005562","HP:0000077")gms_hpo_not_sql<-paste("SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM observation WHERE normalised_hpo_id IN ('",paste(hpo_not_codes,collapse="', '"),"') AND value_code = 'present'",sep="")gms_hpo_not_query<-labkey_to_df(gms_hpo_not_sql,gms_version,100000)gms_hpo_not_list=as.list(gms_hpo_not_query$participant_id)
hpo_not_codes=["HP:0000113","HP:0000003","HP:0005562","HP:0000077"]gms_hpo_not_sql=(f''' SELECT participant_id, normalised_hpo_id, normalised_hpo_term FROM observation WHERE normalised_hpo_id IN {*hpo_not_codes,} AND value_code = 'present' ''')gms_hpo_not_query=labkey_to_df(gms_hpo_not_sql,gms_version,100000)gms_hpo_not_list=gms_hpo_not_query['participant_id'].tolist()len(gms_hpo_not_list)
NOT diagnoses, treatments or causes of death in secondary clinical data¶
We will search for diagnoses in the secondary clinical data, again using an expanded list of terms:
icd_not_codes<-c('N281','Q61','Q612','Q613','Q618','Q619','N19','Y841','N185')hes_tables=c("apc","op","ae")icd_not_concat<-data.frame()diag_not_statement<-paste((icd_not_codes),collapse="%' OR diag_all LIKE '%")for (hes_tableinhes_tables){sqlstr=paste("SELECT participant_id, diag_all FROM hes_",hes_table," WHERE diag_all LIKE '%",diag_not_statement,"%'",sep="")icd_query<-labkey_to_df(sqlstr,version,1000000)if (nrow(icd_query)>0){icd_not_concat<-rbind(icd_query,icd_not_concat)}}icd_not_list<-as.list(icd_not_concat$participant_id)
icd_not_codes=['N281','Q61','Q612','Q613','Q618','Q619','N19','Y841','N185']hes_tables=["apc","op","ae"]icd_not_concat=pd.DataFrame()diag_not_statement="%' OR diag_all LIKE '%".join(icd_not_codes)forhes_tableinhes_tables:sqlstr=(f''' SELECT participant_id, diag_all FROM hes_{hes_table} WHERE diag_all LIKE '%{diag_not_statement}%' ''')icd_query=labkey_to_df(sqlstr,version,1000000)ifnoticd_query.empty:icd_not_concat=pd.concat([icd_query,icd_not_concat])icd_not_list=icd_not_concat['participant_id'].tolist()
icd_not_codes<-c('N281','Q61','Q612','Q613','Q618','Q619','N19','Y841','N185')hes_tables=c("apc","op","ae")icd_not_concat<-data.frame()diag_not_statement<-paste((icd_not_codes),collapse="%' OR diag_all LIKE '%")for (hes_tableinhes_tables){sqlstr=paste("SELECT participant_id, diag_all FROM hes_",hes_table," WHERE diag_all LIKE '%",diag_not_statement,"%'",sep="")icd_query<-labkey_to_df(sqlstr,version,100000)if (nrow(icd_query)>0){icd_not_concat<-rbind(icd_query,icd_not_concat)}}icd_not_list<-as.list(icd_not_concat$participant_id)
We will search the same hes tables for treatments relating to kidney disease:
M01 Transplantation of kidney
M01.1 Autotransplantation of kidney
M01.2 Allotransplantation of kidney from live donor
M01.3 Allotransplantation of kidney from cadaver NEC
M01.4 Allotransplantation of kidney from cadaver heart beating
M01.5 Allotransplantation of kidney from cadaver heart non-beating
not_opcs<-c('M01','M011','M012','M013','M014','M015','M018','M019','X401')opcs_hes=c("apc","op")opcs_not_concat<-data.frame()opcs_not_statement<-paste((not_opcs),collapse="%' OR opertn_all LIKE '%")for (hes_tableinopcs_hes){sqlstr=paste("SELECT participant_id, opertn_all FROM hes_",hes_table," WHERE opertn_all LIKE '%",opcs_not_statement,"%'",sep="")opcs_query<-labkey_to_df(sqlstr,version,1000000)if (nrow(opcs_query)>0){opcs_not_concat=rbind(opcs_query,opcs_not_concat)}}opcs_not_list<-as.list(opcs_not_concat$participant_id)
not_opcs=['M01','M011','M012','M013','M014','M015','M018','M019','X401']opcs_hes=["apc","op"]opcs_not_statement="%' OR opertn_all LIKE '%".join(not_opcs)opcs_not_concat=pd.DataFrame()forhes_tableinopcs_hes:sqlstr=(f''' SELECT participant_id, opertn_all FROM hes_{hes_table} WHERE opertn_all LIKE '%{opcs_not_statement}%' ''')opcs_query=labkey_to_df(sqlstr,version,1000000)ifnotopcs_query.empty:opcs_not_concat=pd.concat([opcs_query,opcs_not_concat])opcs_not_list=opcs_not_concat['participant_id'].tolist()
not_opcs<-c('M01','M011','M012','M013','M014','M015','M018','M019','X401')opcs_hes=c("apc","op")opcs_not_concat<-data.frame()opcs_not_statement<-paste((not_opcs),collapse="%' OR opertn_all LIKE '%")for (hes_tableinopcs_hes){sqlstr=paste("SELECT participant_id, opertn_all FROM hes_",hes_table," WHERE opertn_all LIKE '%",opcs_not_statement,"%'",sep="")opcs_query<-labkey_to_df(sqlstr,version,100000)if (nrow(opcs_query)>0){opcs_not_concat=rbind(opcs_query,opcs_not_concat)}}opcs_not_list<-as.list(opcs_not_concat$participant_id)
We also want to exclude participants who have died of kidney related diseases, so we will use the mortality table to exclude:
not_death_icd=c('N19','N281','Q61','Q612','Q613','Q618','Q619')mortality_not_statement<-paste((not_death_icd),collapse="%' OR icd10_multiple_cause_all LIKE '%")mortality_not_sql<-paste("SELECT participant_id, icd10_multiple_cause_all FROM mortality WHERE icd10_multiple_cause_all LIKE '%",mortality_not_statement,"%'",sep="")mortality_not_query<-labkey_to_df(mortality_not_sql,version,100000)mortality_not_list<-as.list(mortality_not_query$participant_id)length(mortality_not_list)
not_death_icd=['N19','N281','Q61','Q612','Q613','Q618','Q619']mortality_not_statement="%' OR icd10_multiple_cause_all LIKE '%".join(not_death_icd)mortality_not_sql=(f''' SELECT participant_id, icd10_multiple_cause_all FROM mortality WHERE icd10_multiple_cause_all LIKE '%{mortality_not_statement}%' ''')mortality_not_query=labkey_to_df(mortality_not_sql,version,10000)mortality_not_list=mortality_not_query['participant_id'].tolist()
not_death_icd=c('N19','N281','Q61','Q612','Q613','Q618','Q619')mortality_not_statement<-paste((not_death_icd),collapse="%' OR icd10_multiple_cause_all LIKE '%")mortality_not_sql<-paste("SELECT participant_id, icd10_multiple_cause_all FROM mortality WHERE icd10_multiple_cause_all LIKE '%",mortality_not_statement,"%'",sep="")mortality_not_query<-labkey_to_df(mortality_not_sql,version,100000)mortality_not_list<-as.list(mortality_not_query$participant_id)length(mortality_not_list)
Longitudinal medical history is not available for NHS GMS.
cancer_not_type="RENAL"cancer_not_sql=(f''' SELECT participant_id, disease_type, study_abbreviation FROM cancer_analysis WHERE disease_type='{cancer_not_type}' ''')cancer_not_query=labkey_to_df(cancer_not_sql,version,10000)cancer_not_list=cancer_not_query['participant_id'].tolist()
gms_cancer_not_type<-"Renal"gms_cancer_not_sql<-paste("SELECT participant_id, clinical_indication_full_name FROM cancer_analysis WHERE clinical_indication_full_name LIKE '%",gms_cancer_not_type,"%'",sep="")gms_cancer_not_query<-labkey_to_df(gms_cancer_not_sql,gms_version,10000)gms_cancer_not_list=as.list(gms_cancer_not_query$participant_id)
gms_cancer_not_type="Renal"gms_cancer_not_sql=(f''' SELECT participant_id, clinical_indication_full_name FROM cancer_analysis WHERE clinical_indication_full_name LIKE '%{gms_cancer_not_type}%' ''')gms_cancer_not_query=labkey_to_df(gms_cancer_not_sql,gms_version,10000)gms_cancer_not_list=gms_cancer_not_query['participant_id'].tolist()
You can now combine all these into a single list. Then pull out all the participants from the participant table and remove your exclusion list to give only the participants who do NOT match the criteria. Note that you must pull the whole participant list then excluded afterwards, rather than including your exclusion list in your SQL statement, as this SQL would be too long to be processed.
You can match on age and sex. To do this, you can the year of birth for the case and control cohorts, and calculate their mean years of birth. You can also pull out the phenotyped and genotyped sex.
case_sql<-paste("SELECT participant_id, yob, participant_phenotyped_sex, participant_karyotyped_sex FROM participant_summary WHERE participant_id IN (",toString(cohort),")",sep="")case_query<-unique(labkey_to_df(case_sql,version,10000))case_yob_mean=mean(strtoi(case_query$yob))control_sql<-paste("SELECT participant_id, yob, participant_phenotyped_sex, participant_karyotyped_sex FROM participant_summary WHERE participant_id IN (",toString(control),")",sep="")control_query<-unique(labkey_to_df(control_sql,version,10000))control_yob_mean=mean(strtoi(control_query$yob))cat("case mean = ",case_yob_mean,"\ncontrol mean = ",control_yob_mean)cat("case XX = ",sum(case_query$participant_karyotyped_sex=="XX")/nrow(case_query),"\ncase Female = ",sum(case_query$participant_phenotyped_sex=="Female")/nrow(case_query),"\ncase XY = ",sum(case_query$participant_karyotyped_sex=="XY")/nrow(case_query),"\ncase Male = ",sum(case_query$participant_phenotyped_sex=="Male")/nrow(case_query),"\ncontrol XX = ",sum(control_query$participant_karyotyped_sex=="XX")/nrow(control_query),"\ncontrol Female = ",sum(control_query$participant_phenotyped_sex=="Female")/nrow(control_query),"\ncontrol XY = ",sum(control_query$participant_karyotyped_sex=="XY")/nrow(control_query),"\ncontrol Male = ",sum(control_query$participant_phenotyped_sex=="Male")/nrow(control_query))
fromstatisticsimportmeancase_sql=(f''' SELECT participant_id, yob, participant_phenotyped_sex, participant_karyotyped_sex FROM participant_summary WHERE participant_id IN {*cohort,} ''')case_query=labkey_to_df(case_sql,version,10000).drop_duplicates()case_yob_mean=mean(case_query['yob'].tolist())control_sql=(f''' SELECT participant_id, yob, participant_phenotyped_sex, participant_karyotyped_sex FROM participant_summary WHERE participant_id IN {*control,} ''')control_query=labkey_to_df(control_sql,version,100000).drop_duplicates()control_yob_mean=mean(control_query['yob'].tolist())print("case mean = ",case_yob_mean,"\ncontrol mean = ",control_yob_mean)print("case XX = ",case_query['participant_karyotyped_sex'].value_counts()['XX']/len(case_query),"\ncase Female = ",case_query['participant_phenotyped_sex'].value_counts()['Female']/len(case_query),"\ncase XY = ",case_query['participant_karyotyped_sex'].value_counts()['XY']/len(case_query),"\ncase Male = ",case_query['participant_phenotyped_sex'].value_counts()['Male']/len(case_query),"\ncontrol XX = ",control_query['participant_karyotyped_sex'].value_counts()['XX']/len(control_query),"\ncontrol Female = ",control_query['participant_phenotyped_sex'].value_counts()['Female']/len(control_query),"\ncontrol XY = ",control_query['participant_karyotyped_sex'].value_counts()['XY']/len(control_query),"\ncontrol Male = ",control_query['participant_phenotyped_sex'].value_counts()['Male']/len(control_query))
case_sql<-paste("SELECT participant_id, yob, participant_phenotyped_sex, participant_karyotyped_sex FROM key_columns WHERE participant_id IN (",toString(cohort),")",sep="")case_query<-unique(labkey_to_df(case_sql,version,100000))case_yob_mean=mean(strtoi(case_query$yob))control_sql<-paste("SELECT participant_id, yob, participant_phenotyped_sex, participant_karyotyped_sex FROM key_columns WHERE participant_id IN (",toString(control),")",sep="")control_query<-unique(labkey_to_df(control_sql,version,100000))control_yob_mean=mean(strtoi(control_query$yob))cat("case mean = ",case_yob_mean,"\ncontrol mean = ",control_yob_mean)cat("case XX = ",sum(case_query$participant_karyotyped_sex=="XX")/nrow(case_query),"\ncase Female = ",sum(case_query$participant_phenotyped_sex=="Female")/nrow(case_query),"\ncase XY = ",sum(case_query$participant_karyotyped_sex=="XY")/nrow(case_query),"\ncase Male = ",sum(case_query$participant_phenotyped_sex=="Male")/nrow(case_query),"\ncontrol XX = ",sum(control_query$participant_karyotyped_sex=="XX")/nrow(control_query),"\ncontrol Female = ",sum(control_query$participant_phenotyped_sex=="Female")/nrow(control_query),"\ncontrol XY = ",sum(control_query$participant_karyotyped_sex=="XY")/nrow(control_query),"\ncontrol Male = ",sum(control_query$participant_phenotyped_sex=="Male")/nrow(control_query))
gms_case_sql<-paste("SELECT participant_id, participant_year_of_birth, administrative_gender FROM participant WHERE participant_id IN ('",paste(gms_case_cohort,collapse="', '"),"')",sep="")gms_case_query<-unique(labkey_to_df(gms_case_sql,gms_version,10000))gms_case_yob_mean=mean(strtoi(gms_case_query$participant_year_of_birth))gms_control_sql<-paste("SELECT participant_id, participant_year_of_birth, administrative_gender FROM participant WHERE participant_id IN ('",paste(gms_control,collapse="', '"),"')",sep="")gms_control_query<-unique(labkey_to_df(gms_control_sql,gms_version,10000))gms_control_yob_mean=mean(strtoi(gms_control_query$participant_year_of_birth))cat("case mean = ",gms_case_yob_mean,"\ncontrol mean = ",gms_control_yob_mean)cat("case Female = ",sum(gms_case_query$administrative_gender=="female")/nrow(gms_case_query),"\ncase Male = ",sum(gms_case_query$administrative_gender=="male")/nrow(gms_case_query),"\ncontrol Female = ",sum(gms_control_query$administrative_gender=="female")/nrow(gms_control_query),"\ncontrol Male = ",sum(gms_control_query$administrative_gender=="male")/nrow(gms_control_query))
fromstatisticsimportmeangms_case_sql=(f''' SELECT participant_id, participant_year_of_birth, administrative_gender FROM participant WHERE participant_id IN {*gms_case_cohort,} ''')gms_case_query=labkey_to_df(gms_case_sql,gms_version,10000).drop_duplicates()gms_case_yob_mean=mean(gms_case_query['participant_year_of_birth'].tolist())gms_control_sql=(f''' SELECT participant_id, participant_year_of_birth, administrative_gender FROM participant WHERE participant_id IN {*gms_control,} ''')gms_control_query=labkey_to_df(gms_control_sql,gms_version,100000).drop_duplicates()gms_control_yob_mean=mean(gms_control_query['participant_year_of_birth'].tolist())print("case mean = ",gms_case_yob_mean,"\ncontrol mean = ",gms_control_yob_mean)print("case Female = ",gms_case_query['administrative_gender'].value_counts()['female']/len(gms_case_query),"\ncase Male = ",gms_case_query['administrative_gender'].value_counts()['male']/len(gms_case_query),"\ncontrol Female = ",gms_control_query['administrative_gender'].value_counts()['female']/len(gms_control_query),"\ncontrol Male = ",gms_control_query['administrative_gender'].value_counts()['male']/len(gms_control_query))
Compare the mean years of birth and sex ratios, to see if they are suitably close to continue
To ensure that the cohort is ethnically matched, we will filter to include only a single ethnicity for both case and control. In this case we will use European ethnicity.
There are three columns on ethnicity: genetically_inferred_ancestry_thr which gives the ancestry that the participant has ≥ 0.8 similarity to, genetically_inferred_ancestry_nothr the ancestry the participant has the highest similarity and genetically_inferred_ancestry_value the value of the highest similarity.
Stated and genomic ethnicity do not necessarily agree, so we recommend using the genomic ethnicity for cohort building.
eth<-"European"case_ethnicity_sql<-paste("SELECT participant_id, genetically_inferred_ancestry_thr FROM participant_summary WHERE genetically_inferred_ancestry_thr = '",eth,"' AND participant_id IN (",toString(cohort),")",sep="")case_ethnicity_query<-unique(labkey_to_df(case_ethnicity_sql,version,10000))filtered_case<-case_ethnicity_query$participant_idcontrol_ethnicity_sql<-paste("SELECT participant_id, genetically_inferred_ancestry_thr FROM participant_summary WHERE genetically_inferred_ancestry_thr = '",eth,"' AND participant_id IN (",toString(control),")",sep="")control_ethnicity_query<-unique(labkey_to_df(control_ethnicity_sql,version,100000))filtered_control<-control_ethnicity_query$participant_id
eth="European"case_ethnicity_sql=(f''' SELECT participant_id, genetically_inferred_ancestry_thr FROM participant_summary WHERE genetically_inferred_ancestry_thr = '{eth}' AND participant_id IN {*cohort,} ''')case_ethnicity_query=labkey_to_df(case_ethnicity_sql,version,10000).drop_duplicates()filtered_case=case_ethnicity_query['participant_id'].tolist()control_ethnicity_sql=(f''' SELECT participant_id, genetically_inferred_ancestry_thr FROM participant_summary WHERE genetically_inferred_ancestry_thr = '{eth}' AND participant_id IN {*control,} ''')control_ethnicity_query=labkey_to_df(control_ethnicity_sql,version,100000).drop_duplicates()filtered_control=control_ethnicity_query['participant_id'].tolist()
eth<-"European"case_ethnicity_sql<-paste("SELECT participant_id, genetically_inferred_ancestry_thr FROM key_columns WHERE genetically_inferred_ancestry_thr = '",eth,"' AND participant_id IN (",toString(cohort),")",sep="")case_ethnicity_query<-unique(labkey_to_df(case_ethnicity_sql,version,100000))filtered_case<-case_ethnicity_query$participant_idcontrol_ethnicity_sql<-paste("SELECT participant_id, genetically_inferred_ancestry_thr FROM key_columns WHERE genetically_inferred_ancestry_thr = '",eth,"' AND participant_id IN (",toString(control),")",sep="")control_ethnicity_query<-unique(labkey_to_df(control_ethnicity_sql,version,100000))filtered_control<-control_ethnicity_query$participant_id
We only have stated ethnicity for NHS GMS, not genetic ethnicity.
We're now going to filter both case and control cohorts by some general criteria. First we will join the two cohorts to a single dataframe with case or control added as column.
We will first do some QC filtering. We want to include only participants who are XX or XY and whose phenotyped sex matches their genetic sex. We will include a filter for XX or XY in our SQL query, then drop any rows with XX Male or XY Female.
sex_sql<-paste("SELECT participant_id, participant_phenotyped_sex, participant_karyotyped_sex FROM participant_summary WHERE participant_id IN (",toString(as.list(case_control_table$participant_id)),") AND (participant_karyotyped_sex = 'XX' OR participant_karyotyped_sex = 'XY')",sep="")sex_query<-labkey_to_df(sex_sql,version,100000)female_query<-subset (sex_query,participant_phenotyped_sex=='Female'&participant_karyotyped_sex=='XX')male_query<-subset (sex_query,participant_phenotyped_sex=='Male'&participant_karyotyped_sex=='XY')sex_list<-c(as.list(female_query$participant_id),as.list(male_query$participant_id))case_control_table<-filter(case_control_table,participant_id%in%sex_list)
sex_sql=(f''' SELECT participant_id, participant_phenotyped_sex, participant_karyotyped_sex FROM participant_summary WHERE participant_id IN {*case_control_table['participant_id'].tolist(),} AND (participant_karyotyped_sex = 'XX' OR participant_karyotyped_sex = 'XY') ''')sex_query=labkey_to_df(sex_sql,version,100000).drop_duplicates()sex_query=sex_query.drop(sex_query[(sex_query['participant_phenotyped_sex']=='Male')&(sex_query['participant_karyotyped_sex']=='XX')].index)sex_query=sex_query.drop(sex_query[(sex_query['participant_phenotyped_sex']=='Female')&(sex_query['participant_karyotyped_sex']=='XY')].index)case_control_table=case_control_table.loc[case_control_table['participant_id'].isin(sex_query['participant_id'].tolist())]
sex_sql<-paste("SELECT participant_id, participant_phenotyped_sex, participant_karyotyped_sex FROM key_columns WHERE participant_id IN (",toString(as.list(case_control_table$participant_id)),") AND (participant_karyotyped_sex = 'XX' OR participant_karyotyped_sex = 'XY')",sep="")sex_query<-labkey_to_df(sex_sql,version,100000)female_query<-subset (sex_query,participant_phenotyped_sex=='Female'&participant_karyotyped_sex=='XX')male_query<-subset (sex_query,participant_phenotyped_sex=='Male'&participant_karyotyped_sex=='XY')sex_list<-c(as.list(female_query$participant_id),as.list(male_query$participant_id))case_control_table<-filter(case_control_table,participant_id%in%sex_list)
We will further filter by sample extraction methods. We are interested in samples taken as blood, using EDTA extraction and with PCR-free sequencing.
We will do this using the aggregate_gvcf_sample_stats table. This has the added benefit that we will also filter to only include participants included in the AggV2 aggregate VCF files, which may be very useful if you want to use AggV2, which also means all participants have their genomes aligned to GRCh38.
aggregate_sql=paste("SELECT participant_id, platekey FROM aggregate_gvcf_sample_stats WHERE participant_id IN(",toString(case_control_table$participant_id),") AND sample_source = 'BLOOD' AND sample_preparation_method = 'EDTA' AND sample_library_type = 'TruSeq PCR-Free High Throughput'",sep="")aggregate_query<-labkey_to_df(aggregate_sql,version,100000)case_control_table<-unique(merge(case_control_table,aggregate_query))
aggregate_sql=(f''' SELECT participant_id, platekey FROM aggregate_gvcf_sample_stats WHERE participant_id IN {*case_control_table['participant_id'].tolist(),} AND sample_source = 'BLOOD' AND sample_preparation_method = 'EDTA' AND sample_library_type = 'TruSeq PCR-Free High Throughput' ''')aggregate_query=labkey_to_df(aggregate_sql,version,100000)case_control_table=aggregate_query.merge(case_control_table).drop_duplicates()
aggregate_sql=paste("SELECT participant_id, platekey FROM aggregate_gvcf_sample_stats WHERE participant_id IN(",toString(case_control_table$participant_id),") AND sample_source = 'BLOOD' AND sample_preparation_method = 'EDTA' AND sample_library_type = 'TruSeq PCR-Free High Throughput'",sep="")aggregate_query<-labkey_to_df(aggregate_sql,version,100000)case_control_table<-unique(merge(case_control_table,aggregate_query))
We need to check the relatedness between members of our cohort. To do this we can load the relatedness files that were calculated for the aggregated gVCFs.
You will need to mount the following file to your Jupyter session: GEL data resources > aggregations > gel_mainProgramme > aggV2 > genomic > additional_data > PCs_relatedness > relatedness > GEL_aggV2_MAF5_mp10_0.0442.kin0
Now we need to identify if any of these twins are in our cohort. If they are, we need to determine if either or both of the twins are affected by the phenotype of interest. If one of them is, we will keep that twin in our case cohort and discard the other. If both are affected, we will discard one at random.
You must filter participants in your analysis to ensure they all have active consent for research. To do this, use the programme_consent_status column of the participant table in the 100kGP. Here is example SQL using a participant_id list called list.
consent_sql<-paste("SELECT participant_id, programme_consent_status FROM participant WHERE programme_consent_status = 'Consenting' AND participant_id IN (",list,")",sep="")consent_query<-labkey_to_df(consent_sql,version,100000)
consent_sql=(f''' SELECT participant_id, programme_consent_status FROM participant WHERE programme_consent_status = 'Consenting' AND participant_id IN {list}''')consent_query=labkey_to_df(consent_sql,version,100000)
consent_sql<-paste("SELECT participant_id, programme_consent_status FROM participant WHERE programme_consent_status = 'Consenting' AND participant_id IN (",list,")",sep="")consent_query<-labkey_to_df(consent_sql,version,100000)
We do not currently have participants who have withdrawn consent in NHS GMS.
For many kinds of downstream analysis, you may wish to work with the genome files for that individual, such as the genomic VCFs. If you're working with the aggregate VCFs, you will need the platekeys.
In this query, we're going to get the platekey and filepath of the gVCF for these participants.
filetype<-"Genomic VCF"path_sql<-paste("SELECT participant_id, platekey, filename, file_path FROM genome_file_paths_and_types WHERE participant_id IN (",toString(case_control_table$participant_id),") AND file_sub_type = '",filetype,"'",sep="")path_query<-labkey_to_df(path_sql,version,100000)case_control_table_paths<-merge(case_control_table,path_query)
filetype="Genomic VCF"path_sql=(f''' SELECT participant_id, platekey, filename, file_path FROM genome_file_paths_and_types WHERE participant_id IN {*case_control_table['participant_id'].tolist(),} AND file_sub_type = '{filetype}'''')path_query=labkey_to_df(path_sql,version,100000)case_control_paths=path_query.merge(case_control_table)
filetype<-"Genomic VCF"path_sql<-paste("SELECT participant_id, platekey, filename, file_path FROM genome_file_paths_and_types WHERE participant_id IN (",toString(case_control_table$participant_id),") AND file_sub_type = '",filetype,"'",sep="")path_query<-labkey_to_df(path_sql,version,100000)case_control_table_paths<-merge(case_control_table,path_query)
filetype<-"Genomic VCF"gms_path_sql<-paste("SELECT participant_id, platekey, filename, path FROM genome_file_paths_and_types WHERE participant_id IN ('",paste(gms_case_control_table$participant_id,collapse="', '"),"') AND file_sub_type = '",filetype,"'",sep="")gms_path_query<-labkey_to_df(gms_path_sql,gms_version,100000)gms_case_control_table_paths<-merge(gms_case_control_table,gms_path_query)
filetype="Genomic VCF"gms_path_sql=(f''' SELECT participant_id, platekey, filename, path FROM genome_file_paths_and_types WHERE participant_id IN {*gms_case_control_table['participant_id'].tolist(),} AND file_sub_type = '{filetype}'''')gms_path_query=labkey_to_df(gms_path_sql,gms_version,100000)gms_case_control_paths=gms_path_query.merge(gms_case_control_table)
Both the GWAS and AVT workflows require a phenofile. This is a space or tab separated text file with the sample information, with headers. We can create this from our case/control cohort that we have created here.
The phenoFile must include a minimum of three columns: platekey, sex specification (0 for male, 1 for female) and phenotype specification (0 for control, 1 for cases). It can optionally also include other columns to specify other covariates, such as age and principal components.
In the following example, we will use the case_control_table we created earlier for our cohort, the year of birth query to get age, and the aggregate_gvcf_sample_stats table to get sex and covariate information.