Searching the Data discovery portal¶
There are several methods to create a cohort or to select participants that meet your criteria:
Search with the drop-down filters¶
Use the drop-down filters when available on the left hand side of a dashboard:
At the top of the screen you'll see your selection criteria highlighted in blue:
To remove selection criteria, place your cursor anywhere on the highlighted text in blue and the highlighted text will change to a selection of icons, select the remove filter icon:
Search by viewing and interacting with the data using graphs and charts¶
Click on any part of the graphs and charts on the dashboards to focus on your area of interest, other graphs and charts will update automatically in response to your selection.
General note: For text fields, Elasticsearch maintains two references to each field e.g. hes_diagnoses.diagnosis
and hes_diagnoses.diagnosis.keyword
.
The keyword version of the field is treated like a single token even when made up of multiple words. This allows Elastic to provide the auto completion functionality. Most of the time the keyword version is the one to use in the Add Filter or Search Bar.
The non-keyword version is more amenable to wildcard searches where only part of the term is known in advance.
The keyword fields will iteratively narrow and autocomplete. This relies on Elasticsearch repeatedly querying the backend, so sometimes a slower measured typing pace will provide a better experience. See ‘Search entering a valid search term in the Search Bar’.
Search entering a valid search term in the Search Bar¶
Type in a valid search term into the Search Bar, the system will attempt to auto-complete as you enter the first few characters of your search term, with suitable matches:
You can also search for participants with tiered variants on a particular gene by entering in the Search Bar for example, "TP53". Or, find participants by entering the participant ID: (replace the generic participant ids in these examples with real ones)
e.g. 111nnnnnn
111nnnnnn or 211nnnnnn
111nnnnnn and 211nnnnnn
Build a query using Add Filter and Logic Operators¶
-
Place your cursor on the Add Filter option, either type in your filter criteria or select from the drop down menu.
-
Select an operator.
-
Enter value and press save, graphs and charts will automatically update to reflect your selection.
Queries are supported if based on the data attributes available i.e. the list of filters on the dashboard and the interactive graphs and charts. See Section 4.8 -Advanced Queries
Guided search using the Search Bar¶
Build a guided query using the Search Bar:
-
Enter search criteria into the search bar
-
Select an operator.
-
Add search term selected from drop-down list. Gradually build a complex query with this method. The graphs and charts will update automatically to reflect your search criteria.
-
This method of searching allows you to filter for items in the dataset that are not visible on the Dashboard. For example, the stated gender filter has four options:
But, if you know of other methods of defining gender you can perform a search using the Add a Filter + option for example, if you want to find out the gender of a participant defined by Karyotype enter 'Ka'
Select participant_karyotypic_sex.keyword, "is" and the value you're interested in from the dropdown list and select save:
The system will update the dashboard with your selection, you can then output the participants matching your search criteria to a CSV file.
Queries are supported if based on the data attributes available i.e. the list of filters on the dashboard and the interactive graphs and charts. See Section 4.8 Advanced Queries
Fuzzy searching¶
Elastic search provides support for fuzzy queries. In order to make use of this functionality:
-
Go to the "Options" link in the search bar and once the dashboard has reset you'll be able to use fuzzy queries such as the two below:
-
De-select “Turn on query features”
Example 1. Enter: "autosomal deafness"~3
fuzziness in word order returns matches for Autosomal dominant deafness that would not otherwise be matched.
Example 2. Enter ehler~
Fuzzy search returns matches for Ehler-Danlos that would not otherwise be matched.
Note that spacing and quotes is important in fuzzy search.
Please remember to re-select the "Turn on query features" indicator or, your use of the filters and other search methods will return incorrect results
Not in the top 20.¶
Searching for ICD, OPCS, HPO and Gene codes that are not visible in the top 20 charts.
You can use the “type to search” drop-downs to find other items such as Diagnoses, Procedures and HPO terms. Additionally, a “type to search” drop-down for genes with tiered variants is available on the Rare Disease dashboard.
You can use the Search Bar for alternative searches for example, in the Search Bar, start typing diagnosis
and select hes_diagnoses.diagnosis.keyword
from the drop-down menu immediately below the Search Bar.
For these codes, it’s best to start by entering a double quote before the code to help the autocomplete cope with special chars such as ‘.’
- To search for a term such as
Sensorineural
, use the non-keyword version e.g.hes_diagnoses.diagnosis : Senso*(no quotes)
orhes_diagnoses.diagnosis : senso*
- To search Procedures outside the top 20, start with
procedure
to findhes_procedures.procedure.keyword
. - To search for HPO terms, start with
hpo
to findrare_diseases_participant_phenotypes.hpo_term.keyword
.
Advanced queries¶
Considerations when using the Search Bar and Add Filter function¶
There are several considerations to take into account when creating your own queries outside of the provided drop-down filters and graphs and charts. These limitations may apply when using the Add Filter or using the Search bar to create your own queries.
Queries in Data Discovery Portal do not work the same way queries are usually performed in SQL.
Filters are applied independently on the full data set and the output is the intersection of these separate filters.
This is best illustrated through a worked example:
Question: I would like a count of all participants that have tiered variants in gene XXX with genotype ‘alternate homozygous'
Filters applied: variants.gene_symbol.keyword : "XXX" and variants.genotype.keyword : "alternate_homozygous"
Results Returned
- As the Filters are applied independently on the full data set, a subset of individuals with a variant gene of XXX are returned and intersected with the subset of individuals with a genotype of alternate_homozygous.
- The resultant dataset may include individuals that have:
- Individual 1: Gene: XXX + Genotype: alternate_homozygous (the desired match)
- Individual 2: Gene: XXX + Genotype: alternate_homozygous; Gene: YYY + Genotype: heterozygous (additional valid match)
- Individual 3: Gene: XXX + Genotype: heterozygous; Gene: YYY + Genotype: alternate_homozygous (no match within the variant)
Affected queries¶
- Not affected:
- Queries that include only one filter.
- Queries that include multiple fields where the individual can only have one permissible value per field (e.g. Year of Birth, Ethnic Category)
- Affected:
- Queries to match two or more fields within the same sub-document (see Section 5.0 Structure of JSON). Examples where there are multiple fields combinations in a sub-document are Variants/Gene, Rare Disease/Age of Onset, Cancer Disease/Cancer Disease sub-type, HES diagnoses/diagnosis date, HES procedures/procedure date.
Understanding the query and filter logic¶
Queries in Data Discovery Portal are applied at the participant document* level i.e. each filter (clause) is run against the entire document for a match.
For the example below, note that each participant document may have zero or many nested 'variants' sub-documents**.
Example:
variants.gene_symbol.keyword : "XXX" and variants.genotype.keyword : "alternate_homozygous"
-
The 'XXX' filter is applied returning all participant documents that contain this gene.
-
Next, the 'alternate_homozygous' filter is applied to all of the returned documents returned in step 1.
-
As the participant json documents for some of the participants with variants on 'XXX' also have variants with genotype 'alternate_homozygous' (though on unrelated genes) these are returned in the count.
Another way of reading the query is to say: Find the participant documents where variants.gene_symbol.keyword having value "XXX" exists and within that set of documents, find participant documents where variants.genotype.keyword having value alternate_homozygous exists and return the matching participant documents (or count of them).
The query logic in Data Discovery/Elasticsearch does not make the assertion that the results to be returned must only include those where there is an alternate_homozygous variant in the XXX gene. Instead, it returns a count of the participant json documents where XXX and alternate_homozygous exist (though there is no relationship between the two).
*The participant json document contains all the data for the participant that can be queried. The structure of the json doc is given in the help.
**The json document contains single value attributes such as participant_id, participant_stated_gender and year_of_birth which are at the top level of the json structure i.e. top-level attributes.
The document also contains several nested sub document arrays for multi-valued fields e.g. rare_diseases_participant_diseases, cancer_participant_diseases, hes_diagnoses, hes_procedures and variants. These are collections of one or many sub-documents with potentially multiple fields per sub-document. It is important to note that Elasticsearch treats values within these sub-documents independently of each other i.e. you may not query the values of multiple attributes within the sub-document in the same way you can query the top-level attributes.
The table below illustrates conceptually how the queries work within JSON documents: _
_
How to reformat exported CSV files¶
- Navigate to Libreoffice.
- Open the Calc Spreadsheet application.
- Open the CSV file containing your cohort.
-
Your file appears in the import screen Fig 1, click ok to display your file.
Fig 1:
-
The participant ID, Genome Build and Plate details, separated by a comma, appear in a single column, Fig 2
Fig 2:
-
Highlight column A displaying the participant ID, genome build and sample id, Fig 3.
-
Select the Data option from the menu bar at the top of the screen, Fig 3.
-
Scroll down to the 'Text to Column' Option and select it, Fig 3.
Fig 3:
-
Text to column screen appears, Fig 4.
Fig 4:
-
Check comma separator box and click OK, Fig 5.
Fig 5:
-
A warning message appears asking if you want to overwrite the existing data appearing in a single column - click yes, Fig 6
Fig 6:
-
Your CSV file is reformatted, Fig 7.
Fig 7:
Elastic search capabilities¶
Visit the links below for Elastic Search capabilities, note this is for users with coding experience: