Where and how to access de novo data¶
The DNV dataset are presented in two formats: two LabKey tables and annotated multi-samples VCFs per family on the file-system.
General use of DNV dataset
We recommend to use DNVs that pass the stringent_filter
for general analysis as these are more likely to represent true DNVs.
LabKey tables¶
There are two LabKey tables containing the DNV dataset: denovo_cohort_information
and denovo_flagged_variants
. These can be accessed from the LabKey Desktop application and found within the 100kGP folder for release 9 and onwards. The full column schema and column definitions can be found in the 100kGP Data Dictionary.
denovo_cohort_information
LabKey table¶
This table contains the cohort meta-data for every participant within the DNV dataset. Each row comprises a unique participant per genome assembly (there are a very small number of families that exist on both the GRCh37 and GRCh38 cohorts). The columns detail the information used to perform Rare Disease Interpretation by Genomics England such as the: trio_id
, family_id
, plate_key
, participant_id
, relationship_to_offspring
, pedigree_id
, sex
, affection_status
(relative to the recruited disease of the proband), and assembly
.
The trio_id
The trio_id column is a unique identifier for each trio within a the DNV dataset. It is composed of three parts and follows the format: familyid_interpretationid.trionumber
. For example: 080690_10002.2
.
- The
familyid
is the unique identifier per family - The
interpretationid
is the unique identifier for each Rare Disease Interpretation run - The
trionumber
is a numeric identifier for each trio within a family. For families consisting of a single trio, this will always be .1, for multiplex families with more than one trio, this will be: .1, .2, .3, etc). |
The additional columns in the denovo_cohort_information
LabKey table are defined as below:
Column | Description |
---|---|
is_multiplex family |
Flags whether the family contains nested trios. 0: the family is a simple trio; 1: the family contains nested trios (more than one trio). |
vcf_path_flagged_denovo |
The path to the annotated multi-sample VCF by family on the file-system. This VCF contains the flagged DNVs. |
base_filter_total |
The total number of distinct variants that pass the base_filter (for offspring only - is blank for mother and father). |
stringent_filter_total |
The total number of distinct variants that pass the stringent_filter (for offspring only - is blank for mother and father). |
Path to annotated multi-sample VCFs
The file-paths to the annotated multi-sample VCFs (in the column vcf_path_flagged_denovo
) represent the path on the HPC environment. If you need to access these files on the Desktop environment, please add the ~
suffix, such as ~/gel_data_resources/..
.
denovo_flagged_variants LabKey table¶
This table comprises all putative DNVs that pass the base_filter
per trio. Each row comprises a unique variant per trio per assembly. The variant-level information is found in the columns: chrom
, position
, reference
, alternate
. The trio_id
, family_id
, and assembly
columns are included so that it is possible to join this table onto the denovo_cohort_information
table (which contains the cohort meta-data).
DNVs in the denovo_flagged_variants LabKey table
Due to size restrictions, only variants that pass the base_filter
per trio are included in the LabKey table denovo_flagged_variants
. For all Mendelian inconsistencies per trio with flagged DNVs, please use the annotated multi-sample VCFs.
Updated stringent filters
Following data release version 9, a fix was implemented to the problematic genomic regions flags of the de novo dataset due to a bug in the annotations of segmental duplication, simple repeat, and patch filters from UCSC. This was fixed in the LabKey tables but is not reflected in the annotated VCFs. The flags in the Labkey tables denovo_flagged_variants
and denovo_cohort_information
are correct, while the issue still remains in the annotated VCFs as they have not been updated.
All columns ending with the suffix filter
contain the flags (coded as 0 or 1) indicating whether or not the variant within the trio has failed (0) or passed (1) the respective filter. Please see the above section for the list of filters as their pass criteria.
The genotypes for all members of the trio are included in columns offspring_gt
, mother_gt
, father_gt
. The remaining columns contain the VCF FORMAT attributes (such as depth, genotype quality, genotype likelihood) from each sample within the trio. Again, the full column schema and column definitions can be found in the 100kGP data dictionary.
Bayes factor
The denovo_flagged_variants
LabKey table contains the column bayes_factor
. This is the output from the Platypus bayesiandenovofilter.py python script. We decided not to include this metric as a filter criteria in the DNV annotation pipeline in this release. Users can use the Bayes Factor, although it is only available for non-duplicated variants from single (non multiplex) trio families. For such variants, it is not possible to attribute the Bayes Factor (coded in the INFO
field) to the correct variant ~ sample combination.
Querying the LabKey tables¶
The two LabKey tables can be queried for families and DNVs of interest using either the LabKey graphical interface (desktop application) or the LabKey APIs. The Code Book section below shows examples of how to query the DNV dataset in LabKey. Note that the variants in the denovo_flagged_variants
table does not include genomic annotation by Ensembl VEP as this causes large amount of row duplication (due to each variant having annotations per transcript).
Flags in LabKey Table
For all filter columns within the LabKey table, denovo_flagged_variants
, a value of 0 indicates that the filter criteria has not been met (FAIL) and a value of 1 means that the filter criteria has been met (PASS).
Annotated multi-sample VCFs¶
The output of the DNV annotation pipeline is an annotated multi-sample VCF per family
containing all Mendelian inconsistencies with putative DNVs flagged in the FORMAT column and genomic annotation in the INFO column (Step 5 in the DNV pipeline).
The annotated multisample VCFs can be found in the filesystem separated by genome assembly under the folders:
/gel_data_resources/main_programme/denovo_variant_dataset/GRCh37/20200326/flagged_vcf/
/gel_data_resources/main_programme/denovo_variant_dataset/GRCh38/20200326/flagged_vcf/
The files themselves are named by family_id and interpretation_id with the suffix .denovo.reheader.vcf.gz
(for example: 00004-RTD_10001.denovo.reheader.vcf.gz
).
These paths correspond to the file-paths on the HPC environment. If you need to access these files on the Desktop environment, please add the ~
suffix, such as ~/gel_data_resources/..
.
Use the LabKey table denovo_cohort_information
to associate the annotated multi-sample VCF file-path (under the column vcf_path_flagged_denovo
) with a particular trio_id
.
Querying the annotated multi-sample VCFs¶
The annotated multi-sample VCFs by family (containing all Mendelian inconsistencies) have the FORMAT column populated with the results of the DNV annotation pipeline under the DE_NOVO_FLAG
attribute as shown below.
##FORMAT=<ID=DE_NOVO_FLAG,Number=.,Type=String,Description="Flag from the Genomics England De Novo Flagging Pipeline">
There is a certain order to how this field is populated:
- If a variant fails any
base_filters
, it is marked asbase_fail
without the annotation of the subsequentstringent_filters
. - If the variant passes the
base_filter
but fails astringent_filter
, it is marked with a list of thestringent_filters
that fail. - If the variant passes all
base_filters
and allstringent_filters
, it is marked withDENOVO
.
Note that only the offspring will have the flags populated in their DE_NOVO_FLAG FORMAT
fields. The mother and father are set to missing .
for this attribute.
Use the Code Book
Please use the Code Book section for example scripts on how to query the annotated multi-sample VCFs. We recommend using the command line tool, bcftools, to interrogate the annotated multi-sample VCFs.
Flag in FORMAT DE_NOVO_FLAG column |
Description |
---|---|
base_fail |
The variant fails a base_filter and/or a global_filter (PASS variants on chromosomes: 1-22, X, M). The exact base_filter that fails can be found in the LabKey table: _denovo_flagged_variants` or queried from the VCF FORMAT attributes (we wanted to limit the number of flags in the FORMAT column). |
altreadparent; abratio; proximity; segmentalduplication; simplerepeat; patch |
These flags indicate variants that pass the base_filter but fail one or more of the stringent_filters . Note that variants have to pass the base_filter in order to be considered for the stringent_filter . The particular stringent_filter (s) that fail are listed using a semi-colon separator.For example if a variant fails just the altreadparent , the FORMAT format will be marked as altreadparent .If a variant fails the altreadparent and the abratio filter, the FORMAT format will be marked as altreadparent;abratio . |
DENOVO |
The variant passes the base_filter and stringent_filter . |
Annotated multi-sample VCF flagging
To create the annotated multi-sample VCFs (Step 5), all original FILTER field values from the Platypus joint-calling step (Step 3) are stripped and replaced with the flags from the DNV annotation pipeline. |
We adopt an inclusive approach for researchers to analyse DNVs by flagging
likely DNVs and generally not filtering
any variants out. If necessary, you can make use of the additional attributes within the annotated multi-sample VCF to perform custom filtering. Please see the Code Book below for example on how to do this. Below are a list of important attributes one can make use of:
VCF Field | Attribute | Description | Used in filter |
---|---|---|---|
FILTER |
FILTER |
The FILTER column of the annotated multi-sample VCF has not been modified from the original Platypus VCF. As only PASS variants are included in the annotated multi-sample VCF from the Platypus bayesiandenovofilter.py script, the values in the FILTER field will always be set to 'PASS'. | Not used. |
FORMAT |
GT |
Un-phased genotypes. | zygosity_filter. |
FORMAT |
NR |
Number of reads covering variant location in this sample. | mindepth_filter, maxdepth_filter, abratio_filter. |
FORMAT |
NV |
Number of reads containing variant in this sample. | altreadparent_filter, abratio_filter. |
FORMAT |
GL |
Genotype log10-likelihoods for AA,AB and BB genotypes, where A = ref and B = variant. | Used to calculate Bayes Factor. |
FORMAT |
GQ |
Genotype quality as phred score. | Not used in filters. |
FORMAT |
GOF |
Goodness of fit value. | Not used in filters. |
INFO |
bayesFactor |
Bayes Factor for the de novo model calculated by the Platypus bayesiandenovofilter.py python script. | Not used in filters. |
INFO |
multidenovo_filter |
Flag of the the multidenovo filter. This is only applicable for multiplex families containing nested trios as described above. |
Not used in filters. 0: multidenovo; 1: not multidenovo |
Adjusting for updated stringent filters¶
Previously there was a bug in the annotations of segmental duplication, simple repeat, and patch filters from UCSC. As a result, we implemented a fix to these problematic regions.
This fix has only been carried out on the LabKey tables but is not reflected in the annotated VCFs. If you're working with the VCFs, we suggest two approaches:
- Only use the LabKey tables
denovo_flagged_variants
anddenovo_cohort_information
to filter for candidate DNVs. If you need additional genomic annotation such as VEP (which is in the annotated VCFs), then you can look up the variant(s) manually in the VCFs to pull out the annotation. - Alternatively, use the annotated VCFs but be careful when filtering using the DE_NOVO_FLAG FORMAT field. For candidate DNVs, use bcftools to exclude any variants that fail the DE_NOVO_FLAG
base_filter
as well as variants that contain the strings:altreadparent
ORabratio
ORproximity
in the DE_NOVO_FLAG field. The variants that remain will be candidate DNVs but will not be filtered for the problematic genomic regions flags: segmental duplication, simple repeat, and patch. From here the remaining variants can be manually crosschecked against thedenovo_flagged_variants
Labkey table or intersected with a genomic regions BED file to check if they are in problematic regions. The corrected problematic region files used to update the Labkey tables can be found at the following location:
/gel_data_resources/main_programme/denovo_variant_dataset/LabKey_V9/