Where and how to access de novo data¶
The DNV dataset are presented in two formats: two LabKey tables and annotated multi-samples VCFs per family on the file-system.
General use of DNV dataset
We recommend to use DNVs that pass the
stringent_filter for general analysis as these are more likely to represent true DNVs.
There are two LabKey tables containing the DNV dataset:
denovo_flagged_variants. These can be accessed from the LabKey Desktop application and found within the 100kGP folder for release 9 and onwards. The full column schema and column definitions can be found in the 100kGP Data Dictionary.
denovo_cohort_information LabKey table¶
This table contains the cohort meta-data for every participant within the DNV dataset. Each row comprises a unique participant per genome assembly (there are a very small number of families that exist on both the GRCh37 and GRCh38 cohorts). The columns detail the information used to perform Rare Disease Interpretation by Genomics England such as the:
affection_status (relative to the recruited disease of the proband), and
The trio_id column is a unique identifier for each trio within a the DNV dataset. It is composed of three parts and follows the format:
familyid_interpretationid.trionumber. For example:
familyidis the unique identifier per family
interpretationidis the unique identifier for each Rare Disease Interpretation run
trionumberis a numeric identifier for each trio within a family. For families consisting of a single trio, this will always be .1, for multiplex families with more than one trio, this will be: .1, .2, .3, etc). |
The additional columns in the
denovo_cohort_information LabKey table are defined as below:
||Flags whether the family contains nested trios. 0: the family is a simple trio; 1: the family contains nested trios (more than one trio).|
||The path to the annotated multi-sample VCF by family on the file-system. This VCF contains the flagged DNVs.|
||The total number of distinct variants that pass the base_filter (for offspring only - is blank for mother and father).|
||The total number of distinct variants that pass the stringent_filter (for offspring only - is blank for mother and father).|
Path to annotated multi-sample VCFs
The file-paths to the annotated multi-sample VCFs (in the column
vcf_path_flagged_denovo) represent the path on the HPC environment. If you need to access these files on the Desktop environment, please add the
~ suffix, such as
denovo_flagged_variants LabKey table¶
This table comprises all putative DNVs that pass the
base_filter per trio. Each row comprises a unique variant per trio per assembly. The variant-level information is found in the columns:
assembly columns are included so that it is possible to join this table onto the
denovo_cohort_information table (which contains the cohort meta-data).
DNVs in the denovo_flagged_variants LabKey table
Due to size restrictions, only variants that pass the
base_filter per trio are included in the LabKey table
denovo_flagged_variants. For all Mendelian inconsistencies per trio with flagged DNVs, please use the annotated multi-sample VCFs.
Updated stringent filters
Following data release version 9, a fix was implemented to the problematic genomic regions flags of the de novo dataset due to a bug in the annotations of segmental duplication, simple repeat, and patch filters from UCSC. This was fixed in the LabKey tables but is not reflected in the annotated VCFs. The flags in the Labkey tables
denovo_cohort_information are correct, while the issue still remains in the annotated VCFs as they have not been updated.
All columns ending with the suffix
filter contain the flags (coded as 0 or 1) indicating whether or not the variant within the trio has failed (0) or passed (1) the respective filter. Please see the above section for the list of filters as their pass criteria.
The genotypes for all members of the trio are included in columns
father_gt. The remaining columns contain the VCF FORMAT attributes (such as depth, genotype quality, genotype likelihood) from each sample within the trio. Again, the full column schema and column definitions can be found in the 100kGP data dictionary.
denovo_flagged_variants LabKey table contains the column
bayes_factor. This is the output from the Platypus bayesiandenovofilter.py python script. We decided not to include this metric as a filter criteria in the DNV annotation pipeline in this release. Users can use the Bayes Factor, although it is only available for non-duplicated variants from single (non multiplex) trio families. For such variants, it is not possible to attribute the Bayes Factor (coded in the
INFO field) to the correct variant ~ sample combination.
Querying the LabKey tables¶
The two LabKey tables can be queried for families and DNVs of interest using either the LabKey graphical interface (desktop application) or the LabKey APIs. The Code Book section below shows examples of how to query the DNV dataset in LabKey. Note that the variants in the
denovo_flagged_variants table does not include genomic annotation by Ensembl VEP as this causes large amount of row duplication (due to each variant having annotations per transcript).
Flags in LabKey Table
For all filter columns within the LabKey table,
denovo_flagged_variants, a value of 0 indicates that the filter criteria has not been met (FAIL) and a value of 1 means that the filter criteria has been met (PASS).
Annotated multi-sample VCFs¶
The output of the DNV annotation pipeline is an annotated multi-sample VCF per
family containing all Mendelian inconsistencies with putative DNVs flagged in the FORMAT column and genomic annotation in the INFO column (Step 5 in the DNV pipeline).
The annotated multisample VCFs can be found in the filesystem separated by genome assembly under the folders:
The files themselves are named by family_id and interpretation_id with the suffix
.denovo.reheader.vcf.gz (for example:
These paths correspond to the file-paths on the HPC environment. If you need to access these files on the Desktop environment, please add the
~ suffix, such as
Use the LabKey table
denovo_cohort_information to associate the annotated multi-sample VCF file-path (under the column
vcf_path_flagged_denovo) with a particular
Querying the annotated multi-sample VCFs¶
The annotated multi-sample VCFs by family (containing all Mendelian inconsistencies) have the FORMAT column populated with the results of the DNV annotation pipeline under the
DE_NOVO_FLAG attribute as shown below.
##FORMAT=<ID=DE_NOVO_FLAG,Number=.,Type=String,Description="Flag from the Genomics England De Novo Flagging Pipeline">
There is a certain order to how this field is populated:
- If a variant fails any
base_filters, it is marked as
base_failwithout the annotation of the subsequent
- If the variant passes the
base_filterbut fails a
stringent_filter, it is marked with a list of the
- If the variant passes all
stringent_filters, it is marked with
Note that only the offspring will have the flags populated in their
DE_NOVO_FLAG FORMAT fields. The mother and father are set to missing
. for this attribute.
Use the Code Book
Please use the Code Book section for example scripts on how to query the annotated multi-sample VCFs. We recommend using the command line tool, bcftools, to interrogate the annotated multi-sample VCFs.
||The variant fails a
||These flags indicate variants that pass the
For example if a variant fails just the
If a variant fails the
||The variant passes the
Annotated multi-sample VCF flagging
To create the annotated multi-sample VCFs (Step 5), all original FILTER field values from the Platypus joint-calling step (Step 3) are stripped and replaced with the flags from the DNV annotation pipeline. |
We adopt an inclusive approach for researchers to analyse DNVs by
flagging likely DNVs and generally not
filtering any variants out. If necessary, you can make use of the additional attributes within the annotated multi-sample VCF to perform custom filtering. Please see the Code Book below for example on how to do this. Below are a list of important attributes one can make use of:
|VCF Field||Attribute||Description||Used in filter|
||The FILTER column of the annotated multi-sample VCF has not been modified from the original Platypus VCF. As only PASS variants are included in the annotated multi-sample VCF from the Platypus bayesiandenovofilter.py script, the values in the FILTER field will always be set to 'PASS'.||Not used.|
||Number of reads covering variant location in this sample.||
||Number of reads containing variant in this sample.||
||Genotype log10-likelihoods for AA,AB and BB genotypes, where A = ref and B = variant.||Used to calculate Bayes Factor.|
||Genotype quality as phred score.||Not used in filters.|
||Goodness of fit value.||Not used in filters.|
||Bayes Factor for the de novo model calculated by the Platypus bayesiandenovofilter.py python script.||Not used in filters.|
||Flag of the the
||Not used in filters.
1: not multidenovo
Adjusting for updated stringent filters¶
Previously there was a bug in the annotations of segmental duplication, simple repeat, and patch filters from UCSC. As a result, we implemented a fix to these problematic regions.
This fix has only been carried out on the LabKey tables but is not reflected in the annotated VCFs. If you're working with the VCFs, we suggest two approaches:
- Only use the LabKey tables
denovo_cohort_informationto filter for candidate DNVs. If you need additional genomic annotation such as VEP (which is in the annotated VCFs), then you can look up the variant(s) manually in the VCFs to pull out the annotation.
- Alternatively, use the annotated VCFs but be careful when filtering using the DE_NOVO_FLAG FORMAT field. For candidate DNVs, use bcftools to exclude any variants that fail the DE_NOVO_FLAG
base_filteras well as variants that contain the strings:
proximityin the DE_NOVO_FLAG field. The variants that remain will be candidate DNVs but will not be filtered for the problematic genomic regions flags: segmental duplication, simple repeat, and patch. From here the remaining variants can be manually crosschecked against the
denovo_flagged_variantsLabkey table or intersected with a genomic regions BED file to check if they are in problematic regions. The corrected problematic region files used to update the Labkey tables can be found at the following location: