100kGP Release V3 (30/04/2018)¶
Quick View tables¶
Data views that bring together data from several LabKey tables for convenient access:
Name of Table / Data View | Description |
---|---|
family_members | Data for each participant with available genomes describing which family members also have available genomes and where to find the latest, quality-controlled version |
rare_disease_analysis | Data for all rare disease participants including sex, ethnicity, disease recruited for, build of latest genome, QC status of latest genome, path to latest genomes and whether tiering data are available |
cancer_analysis | Data for all Cancer participants including sex, ethnicity, disease recruited for, build of latest genome, QC status of latest genome and path to latest genomes |
Family members¶
For each Participant ID:
- Participant type (proband/relative), biological relationship to the proband, phenotypic sex, Rare Diseases family ID
- Intake status (QC failed/passed, awaiting delivery, etc.)
- Genome availability, genome build (reference genomes version 37/38), file path to genome on storage and latest genome delivery (Y/N)
Rare Disease Analysis¶
For each Participant ID:
- Participant type (proband/relative), phenotypic sex, ethnic category, Rare Diseases family ID
- Normalised specific disease, tiering availability for proband (Y/N)
- Genome availability, genome build (reference genomes version 37/38), file path to genome on storage and latest genome delivery (Y/N)
Cancer Analysis¶
For each Participant ID:
- Phenotypic sex, cancer disease type and subtype, ethnic category
- Sample type, intake status (QC failed/passed, awaiting delivery, etc.)
- Genome availability, genome build (reference genomes version 37/38), file path to genome on storage and latest genome delivery (Y/N)
Common tables¶
Data views that are common to both the rare disease and the cancer domains. This data pertains to sample handling, genome sequencing, participant data and domain assignment.
Data Relating to Participants:
Name of Table / Data View | Description |
---|---|
participant | Data on each individual participant in the 100,000 Genomes Project: * personal information (such as relatives or ethnicity) * points of contact with the Project (e.g. handling Genomic Medicine Centre or Trust) * a record of the status of their clinical review |
sequencing_report | For each participant in the 100,000 Genomes Project: * data describing the sequencing of their genome(s) and associated output, as well as the sample type that the sequence is from |
domain_assignment | For each participant in the 100,000 Genomes Project: * data describing the disease type to which they were recruited and the GECIP domain to which their genome has been assigned for the purposes of administering the publication moratorium |
tiering | Data describing the variants identified and their pathogenicity for each participant who has been through the Genomics England interpretation pipeline |
genome_file_paths_and_types | Data that specifies the genomic files and their folder locations for a given a participant. |
Participant¶
Participant ID | Genomics England case reference (identifier for a Genomics England case, which is a family in the Rare Diseases programme and a participant in the Cancer programme) |
Variations on Genomics England Case Programme ID | * which programme the proband was recruited for – 1 for Cancer or 2 for Rare Diseases * programme consent status (“false” if consent fully withdrawn or, in the case of primary findings feedback consent, if fully OR partially withdrawn) |
Handling GMC (or its Local Delivery Partner) and GMC Trust by name and code | * the registering GMC and Local Delivery Partner * the registration date (the date of reported clinical even or observation, e.g. biopsy taken) |
Names and versions of used information sheets, consent and withdrawal form(s) | * consent given (yes/no) * date of consent / withdrawal of consent (same as registration date, i.e. also refers to date of clinical event / observation) * whether or not participant wants additional reproductive or health-related findings * full / partial withdrawal (Withdrawal Option) |
Stated gender (at time of registration) | phenotypic sex (at birth), karyotypic sex; ethnicity (as specified by participant using 16+1 categories as per 2001 census); year of birth |
Mother’s and father’s ethnic category | * in supplied enumerations or “other” * other ancestry relevant to clinical findings (e.g. Ashkenazi) * consanguinity |
Type of participant | * proband or participating relative * family Identifier assigned to the proband and their relatives (unique to this duo / trio within each GMC and links related participants) * participant’s biological relationship to the proband (e.g. full sib / father / mother / monozygous twin, etc.) * the total number of brothers and sisters the participant has |
Penetrance | * indication that the disease is not fully penetrant in this family * mother / father / number of brothers /sisters affected by same condition as proband |
Participant’s medical review date and QC status | * “passed and suitable / not suitable for diagnostic interpretation” * “query to Genomics England” / QMC or “no QC” |
Genome File Paths and Types¶
For every participant (marked by Participant ID):
- All the different file types and subtypes received from Illumina (various types of VCF, CSVs, report PDF, genotyping report, CAM, various index files – e.g. BAM index, genotyping VCF index), with respective:
- Path and full path to the file and the name of the file
- Delivery date, delivery ID and platekey (see above definitions)
Sequencing Report¶
- Participant is marked by Participant ID and has corresponding delivery ID. For each one, BAM file size (BAM size) and creation date (BAM date); date the genome was delivered (Delivery Date); version of Illumina pipeline used, i.e. v2 or v4 to process and analyse sequence data (Delivery Version); and the genome build (reference genome version 37 / 38)
- Also a sequence number, QC / intake status (passed / upload failed)
- Also sample type (CA tumour, CA germline, Rare Diseases) and platekey (Plate ID x Well ID – see those entries)
Domain Assignment¶
- Participant is marked by Participant ID and has corresponding disease type to which they were recruited (Recruited Disease Type); the GECIP domain to which the genome has been assigned (Domain); whether the participant’s data is, at dataset issue, under GECIP moratorium; and the date at which the moratorium ends
- Also has Pedigree Family ID, the Genomics England family identifier assigned to a proband and their relatives
Tiering¶
For every participant (marked by Participant ID):
- Biological relationship to the proband, phenotype to report variants for; proband’s father / mother affected (Y/N) and number of full brothers / sisters affected; penetrance
- Panel name (of gene panel used from Panel App) and panel version – can be several different ones (e.g. 9 for Intellectual Disability, including Intellectual Disability)
=> Several different rows, one per panel but values that are participant-linked (above) are the same for each
- For each variant / gene per panel (i.e. more rows): reference allele sequence and alternate allele sequence (as per vcf), chromosome and position; genotype (zygosity as per vcf e.g. alternate homozygous, heterozygous), genomic feature type (gene, transcript, regulatory region), type of consequence (e.g. missense variation, splice region variant, stop gained…) and mode of inheritance (e.g. biallelic or monoallelic paternally / maternally imprinted or x-linked bi-/monoallelic)
- Tier (1/2/3) and event justification (why the variant would be reported, e.g. Tier + effect on protein + publications + de novo / simple recessive segregation filter…)
- Also the ensembl ID of the genomic features (transcript used), the variant ID in dbSNP (per variant, if applicable), other gene IDs (HGNC), a report event ID (unique for each report event, i.e. per variant, within each report) and group of variants (for each group of variants that together could explain phenotype using inheritance mode)
Each participant is also linked back to Family ID and Sample ID
Data relating to samples:
Name of table / data view | Description |
---|---|
clinic_sample | Data describing the taking and handling of participant samples at the Genomic Medicine Centres, i.e. in the clinic, as well as the type of samples obtained. Because of the complexities of handling and managing tumour tissues samples in a clinical setting, there are many fields that are cancer-specific. |
clinic_sample_quality_check_result | Data describing the quality control of obtaining and handling participant samples at the Genomic Medicine Centres, i.e. in the clinic. |
laboratory_sample | Data describing the handling of samples at the biorepository and in preparation for sequencing, as well as the type of sample. |
Clinic sample¶
Each sample marked by Clinic Sample ID used for sample collection at GMC clinic. Can be different for different clinics but within each, this ID identifies samples at that clinic and must be unique. Linked to Participant ID.
For each sample:
clinic ID (ODS code) to mark clinic that took it | * codes for the GMC (and its Local Delivery Partner) / GMC Trust * date and time for sample taken * reason no sample was sent (sample not taken, tumour ineligible, cellularity < 40% neoplastic cells, insufficient tumour post neoadjuvant, insufficient DNA, no cancer diagnosis, FFPE not well fixed / processed; necrosis > 20%, other) only reason for not blood is that no tumour was sent or germline successfully sequenced previously, as blood should go together with tumour |
Tumour ID | * unique to each tumour even from same patient * tumour content |
Cancer type and subtype | * pre-invasive in situ elements (if present) * any other previous cancer treatment * diagnostic codes for the cancer in structured clinical vocabulary – ICD03 or Snomed CT (clinical terms) / RT (precursor) |
Size and type of tumour sampled for sequencing (primary, recurrent, metastatic recurrent at same site, metastatic) | * tissue source of sample (e.g. different types of biopsy, resection, bone marrow aspirate) * topographical tumour site (ICD or Snomed) |
Excision margin | * by how far margin was clear of tumour, or unknown * macro-dissection for enrichment (yes / no) and comments |
Descriptors of the sample, e.g. fixation date / time start and end (with comments) | * number of biopsies / cores / blocks / scrolls / sections * gauge of biopsy / core diameter / scroll or section thickness * snap freezing time and date / time in formalin / fixing time (processing schedule – e.g. overnight, rapid run, etc.) * type of fixative (for pre-processing) and prolonged sample storage method; tumour sample type for FFPE that DNA was extracted from |
Whether Genomics England QCs passed | * the GMC (or its LDP) that did the QC, the laboratory used for sample processing (as with the Laboratory Sample table) * date and time when clinical QC test results were received by Genomics England |
Clinic Sample Quality Check Result¶
For each clinical sample (also linked to Participant ID):
- Type of QC test and corresponding value of test result (can be several – e.g. tumour content – medium; cellularity – medium; % necrosis – 0; Qubit – 30; Nanodrop OD 260/280 – 1.9; summary QC – pass)
- Can go to laboratory sample table to check for product type or QCs, then to QC table
Laboratory Sample¶
Data describing the handling of samples at the biorepository and in preparation for sequencing, as well as the type of sample.
For each sample:
Unique lab sample identifier (2D barcode on tube for dispatch from the lab to the biorepository) | * laboratory ID (could be any laboratory used for any type or stage of sample processing usually a GMC laboratory but could be a blood extraction facility for QC data) * GMC (or its Local Delivery Partner) and GMC Trust associated with the sample * Participant ID * Genomics England plate ID (sent from biorepository to Illumina) and plate well ID (well identifier / location of sample on the plate distributed from biorepository for sequencing to Illumina) * plate key (plate ID x well ID = unique ID for a processed well) * GMC rack ID and rack well (barcode on the rack containing sample dispatched from GMC and position on the rack (A-H x 1-12); consignment dispatch date (when sample was dispatched to biorepository) and whether and when sample was received at biorepository |
Sample source | * for bioinformatics interpretation, cancer sample type – blood, saliva, tissue, tumour, bone marrow, fibroblast * whether sample is tumour, germline or –omics; sample type / type description (specific enumerations, e.g. for tumour FF or FFPE * for omics RNA blood or for germline constitutional DNA / DNA blood germline) * whether sample is from DNA or –omics source (“is DNA sample type”) * the sample product (blood, DNA, plasma, serum, ctDNA plasma) and sample preparation method (EDTA, FF, FFPE, LIHEP, STRECK, PAXGENE, etc. i.e. Genomics England protocols used for sample handling and processing) |
Laboratory method | Genomics England protocol used for sample handling / processing |
Laboratory sample volume (in the laboratory sample tubes as dispatched) | volume of sample leaving biorepository to sequencer after processing |
DNA amount (sample yield / concentration) | * biorepository status of DNA quantity (green / red) * biorepository DNA concentration (final concentration sent to Illumina post biorepository processing) * biorepository degradation of DNA * sample quality (% of DNA that is >23kb length) and DNA Integrity value * DNA extraction protocol(for tumour sample FFPE, must distinguish between different ones) |
Sex of the participant associated with the sample | * sample concentration * QC status (passed / not passed sequencer’s criteria) * concentration and purity of FFPE samples (delta QC) * all as provided by Illumina |
Delta QC biorepository | * biorepository QC status (overall result from QC testing – pass / fail) * biorepository purity of sample measure (OD 260) * all as provided by biorepository |
Genomics England dates and times | for receiving of QC test results file from biorepository, Sample Metadata File from GMC and test results file from Illumina |
Rare Disease tables¶
Rare Disease data are presented at the level of Rare Disease families (families of probands), Rare Disease pedigrees and participants. Participants are individuals who have consented to be a part of the project with the expectation that a sample of their DNA will be obtained and their genome sequenced. Pedigree members are extended members of the proband’s family, which will include some participants as well as a number of other individuals who will have no contact with the project, have not consented, but for whom a small amount of data are recorded to allow a full picture of the proband’s extended family to be gathered.
All Rare Disease tables are prefixed by “Rare_diseases_” at the beginning of the table name.
Data at the Level of Rare Disease Families:
Name of Table / Data View | Description |
---|---|
rare_diseases_family | Data describing the families of rare disease probands participating in the 100,000 Genomes Project. It includes the family group type, the status of the family’s pre-interpretation clinical review and the settings that were chosen for the interpretation pipeline at the clinical review. |
Rare Diseases Family¶
Each family tagged by locally-allocated identifier assigned to a proband and their relatives. Should be unique to this duo / trio within a GMC and is the mechanism for linking patients.
For each family:
The family group type | * trio / duo with mother and / or father * singleton |
Family clinical review QC status codes and description | * whether family has a) passed medical review and then b) deemed suitable for diagnostic interpretation * GMC (or its LDP) at which the medical review happened and date and time of the participant medical review |
Monogenic likelihood | outdated setting because as per eligibility criteria, all families should like have monogenic disease basis |
Non-penetrance suspected | condition is known to be non-penetrant or pedigree shows potential non-penetrance (if yes, bioinformatics pipeline runs in incomplete penetrance mode rather than complete penetrance mode) – only after this is signalled at medical review |
Data at the Level of Rare Diseases Pedigrees:
Name of Table / Data View | Description |
---|---|
rare_diseases_pedigree | Data describing the Rare Disease participants, linking pedigrees to probands and their family members. |
rare_diseases_pedigree_member | Data describing the Rare Disease pedigree members, similar to the data about each individual participant in the COMMON data view. It includes some additional data, such as the age of onset of predominant clinical features; data on links to other family members; as well as data collected only for Phenotypes. |
Rare Diseases Pedigree¶
- Pedigree-Family ID (Genomics England family, identifier assigned to the proband and their relatives (this is the Proband Participant ID) and Pedigree Proband Participant ID (Genomics England participant identifier)
- As for Common Participant table data for: consanguinity (population / relationship), ethnicity, karyotypic / phenotypic sex, biological relationship to proband, year of birth, ancestries
Rare Diseases Pedigree Member¶
For each Rare Diseases pedigree member (marked by Pedigree Member ID, enabling Genomics England identification of family relationships between pedigree members):
- Member ID of mother and father (to reconstruct pedigree); adopted status (into or out of family) and whether contact lost with family; alive / deceased / aborted / stillborn / miscarriage status; whether member is the proband (true / false); affection status (affected / unaffected / uncertain); also link to member ID (to identify family relationship between pedigree members) and Genomics England Super Family ID for the family
- Age of onset of predominant features (or neonatal), age at / date of death
- Which twin group a member of (if applicable), whether monozygotic (if twin)
- Data collected for Phenotips: Reason for childlessness, gestational age, hereditary status of individual without children, whether family has been clinically evaluated, node number
Data at the Level of Rare Disease Participants.
Name of Table / Data View | Description |
---|---|
rare_diseases_participant_disease | Data describing the Rare Disease participants’ rare diseases. This is as for pedigree_member_diseases_level_data, with the addition of a date of diagnosis |
rare_diseases_participant_phenotype | Data describing the Rare Disease participants’ phenotypes. For each Rare Disease participant in the 100,000 Genomes Project, there are data about whether a phenotypic abnormality as defined by an HPO term is present and what the HPO term is, as well as the age of onset, the severity of manifestation, the spatial pattern in the body and whether it is progressive or not. |
Rare Diseases Participant Disease¶
For each Participant ID:
- Disease group and sub-group (for the Genomics England Rare Diseases list); and normalised terms for those
- Age of onset
- Diagnosis date
Rare Diseases Participant Phenotype¶
For each Participant ID:
- Whether an HPO-defined phenotypic abnormality is present and its HPO term (and the HPO version number)
- Age of onset, progressive / non-progressive status and severity of manifestation (ranging from “borderline” to “profound” in five steps
- Spatial pattern in body and laterality
Cancer tables¶
Cancer data are presented for either the patient level cancer diagnosis or “disease type” or the tumour specific sample details of participants in the Cancer arm of the 100,000 Genomes Project. All Cancer tables are prefixed by “Cancer_participant_” at the beginning of the table name.
Data Relating to Cancer Participants:
Name of Table / Data View | Description |
---|---|
cancer_participant_disease | For each cancer participant in the 100,000 Genomes Project: * this table includes data about their cancer disease type and subtype. |
cancer_participant_tumour | For each cancer participant’s tumour in the 100,000 Genomes Project: * this table contains data that characterises the tumour, e.g. staging and grading * morphology and location * recurrence at time of enrolment * the basis of diagnosis. |
cancer_participant_tumour_metastatic_site | For each cancer participant in the 100,000 Genomes Project: * this table contains the site of their metastatic disease in the body (if applicable) at diagnosis. |
Cancer Participant Disease¶
For each Cancer participant (marked by Participant ID):
- Cancer disease type (breast, sarcoma, etc.) and subtype (e.g. ductal / lobular breast or Myxofibrosarcoma)
Cancer Participant Tumour¶
For every tumour:
Tumour ID (as in other tables) and Participant ID | |
Diagnosis ICD codes | * ICD or Snomed CT / RT codes for the morphology of the diagnosed cancer and the topographical site of the tumour (and Snomed version used) * number of lesions radiologically determined (liver / upper GI); tumour laterality (for cancer relating to paired organs) * portal invasion (whether portal vein affected) |
Basis of diagnosis | e.g. certain clinical / histology tests – see NHS Dictionary * Grading (I-IV) and staging (TNM and Integrated TNM version or separately of no integrated TNM supplied) * staging for particular cancers: * Duke’s stage for colon cancer * AJCC for skin cancer * extranodal metastases – notes liver / brain / neck / lung involvement for testicular cancer, plus sub-stages for lung metastases * testicular cancer state anatomical groupings * pancreatic cancer clinical staging to indicate resectability * International Neoblastoma Risk Group * BCLC stage-based on anatomic and non-anatomic factors * Child-Pugh score for liver FIGO * alongside classifications versions for these Recurrence indicator (whether recurrence has been recorded that requires new care plan) * whether TACE was previously carried out |
Cancer Participant Tumour Metastatic Site¶
For each Cancer participant (marked by Participant ID) and tumour (marked by cancer participant tumour SK):
- The site of the metastatic disease, if any, at diagnosis, including “multiple” and “unknown”