100kGP Rare disease-specific clinical data

Some tables in LabKey contain data specific to the rare disease participants. Some of these tables include data only on rare disease probands, others contain data for their relatives too. All rare disease tables are prefixed by rare_diseases_ at the beginning of the table name.

CloudOS files

Names of the equivalent file in CloudOS are stated after the table name in brackets.

Primary and secondary data tables

Primary clinical data were collected when participants were enrolled in the programme; tables are tagged with .

Secondary clinical data were obtained from third parties such as NHSE; tables are tagged with .

The central table is rare_disease_analysis (gel_rare_disease_analysis_100k.tsv) , which contains the latest sample for each participant that have been sequenced, had variants called and successfully passed through our interpretation pipeline. Samples are uniquely identified by their platekey number. Note that, one participant may have more than one sample, one built for genome GRCh37 and GRCh38.

Rare disease phenotypes

Rare disease phenotypes are described in a number of tables.

Rare disease phenotype tables

  • rare_diseases_participant_disease (gel_rare_participant_disease_100k.tsv) describes the proband's rare diseases. This is as for rare_diease_pedigree_member, with the addition of a date of diagnosis.
  • rare_diseases_participant_phenotype (gel_rare_participant_phenotype_100k.tsv) describes the probands' phenotype(s). The phenotypic abnormality are defined as whether an HPO term that is present or not and what the HPO term is, as well as the age of onset, the severity of manifestation, the spatial pattern in the body and whether it is progressive or not. See details below on how HPO terms are assigned.
  • rare_diseases_early_childhood_observation (gel_rare_disease_childhood_100k.tsv) contains, for rare disease participants in the 100,000 Genomes Project, measurements and milestones provided by the GMCs, related to childhood development.
  • rare_diseases_gen_measurement (gel_rare_disease_general_measurement_100k.tsv) contains, for rare disease participants in the 100,000 Genomes Project, general measurements relevant to the disease, alongside the date that the measurements were taken on.
  • rare_diseases_imaging (gel_rare_disease_imaging_100k.tsv) contains, for rare disease participants in the 100,000 Genomes Project, various data and measurements from past scans, alongside the date of the scans.
  • rare_diseases_invest_blood_laboratory_test_report (gel_rare_disease_blood_test_results_100k.tsv) contains, for a proportion of rare disease participants in the 100,000 Genomes Project, the results of any blood tests carried out. Over 400 blood values are recorded alongside type and technique of testing and the status of the participating patient in the care pathway.
  • rare_diseases_invest_genetic (gel_rare_disease_genetic_test_100k.tsv) contains, for a proportion of rare disease participants in the 100,000 Genomes Project, information on any genetic tests carried out. Data characterising the genetic investigation is recorded alongside records of the sample tissue source and the type of testing laboratory.
  • rare_diseases_invest_genetic_test_result (gel_rare_disease_genetic_test_result_100k.tsv) contains, for a proportion of rare disease participants in the 100,000 Genomes Project,the results of any genetic tests carried out. Following on from the rare_diseases_invest_genetic table, a summary of the results is presented and contextualised by testing method and scope.

HPO terms in the rare_diseases_participant_phenotype table

Rare disease participants are phenotyped using the Human Phenotype Ontology (HPO).

The ontology comprises over 10,000 terms for describing phenotypic abnormalities and also contains over 50,000 annotations of HPO terms to hereditary diseases. The terms collectively form a relational network (connected by is-a connections), such that each term is a more specific, or more limited, instance of its parent term - eg abnormality of the foot is-a abnormality of the lower limbs.

The presence or absence of particular HPO terms in each of the rare disease participants are given in the HPO dataset, along with modifiers that give further specifics on how that phenotype is manifested in that individual. These modifiers are the laterality, age of onset, progression, severity and spatial pattern. Clicking on the field name in LabKey and selecting Filter will give you an idea of the values present for each of these modifiers.

Genomics England has developed what we call 'disease data models', a list of HPO terms that define each disease, to aid the process of phenotyping at the GMC end. The list of rare disease data models can be found in the Rare Disease Data Models pdf.

In addition, for each participant the GMC can also enter additional HPO terms that are not necessarily listed in the clinical data model, if they think it is useful to drive the selection of additional panels. The above will typically comprise the list of HPO terms that have been specifically assessed for each participant.

Rare disease families

Rare Disease data are presented at the level of rare disease families (families of probands), rare disease pedigrees and participants:

  • Participants are individuals who have consented to be a part of the project with the expectation that a sample of their DNA will be obtained and their genome sequenced. Participants can be proband or relatives.
  • Probands are the affected individuals that started the participation of that family into the programme, and for who most of the analyses are done.
  • Relatives are other participants that may or may not be affected.
  • Pedigree members are extended members of the proband's family, which will include some participants (relatives) as well as a number of other individuals who will have no contact with the project, have not consented, but for whom a small amount of data are recorded to allow a full picture of the proband's extended family to be gathered.

Rare diseases families tables

  • rare_diseases_family (gel_rare_disease_family_100k.tsv) describes the families of rare disease probands participating in the 100,000 Genomes Project, making family members participants of the Project. It includes the family group type, the status of the family's pre-interpretation clinical review and the settings that were chosen for the interpretation pipeline at the clinical review.
  • rare_diseases_pedigree (gel_rare_pedigree_100k.tsv) describes the Rare Disease participants, linking pedigrees to probands and their family members.
  • rare_diseases_pedigree_member (gel_rare_pedigree_member_100k.tsv) describes the Rare Disease pedigree members, similar to the data about each individual participant in the common data view. It includes some additional data, such as the age of onset of predominant clinical features, data on links to other family members, as well as data collected only for phenotypes.

Rare disease bioinformatic analyses

These tables contain data from and information about Genomics England interpretation pipelines.

  • denovo_cohort_information (gel_denovo_cohort_information_100k.tsv) contains cohort information for all participants included in the de novo variant (DNV) research dataset. Attributes within this table include: participant ID, sex, affection status, family ID, pedigree ID, and the path to each family's multi-sample VCF with flagged DNVs.
  • denovo_flagged_variants (gel_denovo_flagged_variants_100k.tsv) contains all variants that pass base_filter for all trios within the DNV dataset. The table does not include variants that fail the base_filter due to size restrictions, but these can be found in the annotated multi-sample VCFs. This table includes all flags from the DNV annotation pipeline for each variant.
  • exomiser (gel_exomiser_100k.tsv) contains, for each participant of the Rare Disease programme, the results of the Exomiser variant prioritisation framework.
  • gmc_exit_questionnaire (gel_genomic_medical_centre_exit_questionnaire_100k.tsv) contains, for each family with a closed case, information extracted from the GMC exit questionnaire. Data reporting back from the Genomic Medicine Centres, for variants reported to them by Genomics England, to what extent a family's presenting case can be explained by the combined variants reported to them (including any segregation testing performed); confidence in the identification and pathogenicity of each variant; and the clinical validity of each variant or variant pair in general and clinical utility in a specific case (only the most recent update will be shown and only one questionnaire per report).
  • tiered_variants_frequency (gel_tiered_variants_frequency_100k.tsv) contains, for each pathogenic variant found on the Genomics England rare disease database, information of the variant consequence, as well as annotation results from gnomAD and 1000 genomes.
  • tiered_data (mild.tsv) describes, for each rare disease participant of the 100,000 Genomes Project who has been through the Genomics England interpretation pipeline and each tiered variant found for each of these participants, the consequences of the variant and few other geneticinformation. More information on tiering here.

Last update: November 20, 2023