De novo data code book¶
This code book is to help you interrogate the DNV dataset with ease. We aim to update this with popular requests from feedback. Please make sure to submit all scripts to the High-Performance Compute cluster and to not run any scripts directly on the head-node (hpc-prod-grid-login-GECIP-01). Most jobs should be fine for submission to the short queue as they will take less than four hours to complete - or as an interactive job.
Querying the annotated multi-sample VCFs¶
The below code snippets assume that you are working in the HPC environment. Use the LabKey table _denovo_cohort_information _to fetch the file paths to the annotated multi-sample VCFs for your cohort of interest. Examples of how to do this are in the LabKey section below.
Working directory
Remember that you will need to change certain file paths to the inputs and outputs depending on your working location on the HPC. If you have any questions about the DNV dataset, please contact the Service Desk. |
Location of the annotated multi-sample VCFs¶
The annotated multi-sample VCFs are found in the directory below under flagged_vcf and split by the GRCh37 and GRCh38 cohorts. In the cohort folders, are the full file paths to all annotated multi-sample VCFs per cohort.
Basic filtering of the annotated multi-sample VCFs¶
The multi-sample VCFs can be queried using bcftools to include (-i) or exclude (-e) flagged variants that meet certain criteria. The flags are populated in the FORMAT column called DE_NOVO_FLAG as described in this document.
1. Include only stringent_filter pass DNVs in a single family, write results as compressed VCF:
2. Include only stringent_filter pass DNVs in a single family, write results as a tab-delimited text file of variants, sample IDs, and genotypes:
Output:
1 76390973 A T LP2000950-DNA_D07 1/0
1 11623896 G A LP2000950-DNA_D07 1/0
1 18103688 C T LP2000950-DNA_D07 0/1
1 23653370 A G LP2000950-DNA_D07 0/1
2 15659761 T C LP2000950-DNA_D07 1/0
2 20627434 C T LP2000950-DNA_D07 0/1
...
* data have been randomised
3. Include only base_filter pass DNVs in a single family that fail a single stringent_filter (eg simplerepeat), write results as compressed VCF:
- Include only base_filter pass DNVs in a single family that fail more than one stringent_filter (eg simplerepeat and abratio), write results as compressed VCF:
5. Exclude base_filter fail DNVs and exclude DNVs that fail a stringent filter (eg patch) in a single family, write results as compressed VCF:
- For families with nested trios (e.g. multiple offspring) and simple trios, you can print out the sample ID and also the de novo flag value for each sample. Note that the DE_NOVO_FLAG FORMAT attribute for the mother and father samples is always set to "." missing. You can filter these missing samples out by including the flag as shown below (-i 'DE_NOVO_FLAG!="."'):
Output:
2 128586362 T G LP3000128-DNA_E05 1/0 DENOVO
2 128586362 T G LP3000135-DNA_E01 0/0 base_fail
2 130215368 C A LP3000128-DNA_E05 0/0 base_fail
2 130215368 C A LP3000135-DNA_E01 1/0 altreadparent;abratio
...
* data have been randomised
- Example 6 shows the output in long-format (where the variant is duplicated by the number of samples). It is also possible to show the same data in wide-format - with each column being maintaining the same sample id and order in the VCF. Now one variant per line is shown:
Output:
2 2255217 A G 0/1 DENOVO 0/1 base_fail 0/0 . 0/0 .
2 2270459 AT A 1/0 base_fail 1/0 base_fail 0/0 . 0/0 .
2 2318407 C T 0/1 altreadparent;abratio 0/1 altreadparent;abratio 0/0 . 0/0 .
...
* data have been randomised
More advanced filtering of the annotated multi-sample VCFs¶
It is also possible to perform more advanced filtering by querying the attributes within the INFO field; such as the genomic annotation from VEP. Also, multiple filtering steps can be chained together in the same command to build complex queries.
There is a very useful plugin for bcftools called split-vep which we will make use of here to query and parse the genomic annotation from VEP (available for recent versions of bcftools). Please read their documentation for more information.
1. Include only variants in a single family for a gene of interest (eg ACTL6B), write results as a tab-delimited text file of variants with specified genomic annotation. This example will output variants annotated against all transcripts for gene of interest, there will be one annotation per line (for each transcript - using the -d flag). The annotation columns to be printed out can be manually selected (gnomAD_NFE, Consequence etc). It is also possible to query many genes at a time (such as: -i 'SYMBOL="ACTL6B" | SYMBOL="IL9R"'), or by applying a filter and using grep on the output:
Output:
7 100244267 C T ACTL6B ENST00000160382 missense_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000461605 downstream_gene_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000485601 downstream_gene_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000487125 non_coding_transcript_exon_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000487225 downstream_gene_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000489904 downstream_gene_variant . rs1131695228
2. It is possible to firstly extract stringent_filter pass variants and pipe this into +split-vep then write results as a tab-delimited text file of variants with specified genomic annotation:
3. You can also perform more complex filtering such as scanning for all stringent_filter pass variants, that are missense or worse (-s and -S flags) and rare in gnomAD (use the -c flag for numeric conversion). Write results as a tab-delimited text file of variants with specified genomic annotation. The sample ID is also printed here so that it can be joined to other tables:
Please note that when specifying severity constraints using the -s flag, you should specify the severity scale explicitly using the -S flag. This is due to a known bug in bcftools 1.10.x (outdated default severity scale), which will be resolved in bcftools 1.11 - "VEP_severity_scale_2020.txt" is an example file for the current severity scale, which will be used in the relevant examples on this page:
??? note "VEP_severity_scale_2020.txt"
# Default consequence substrings ordered in ascending order by severity.
## Consequences with the same severity can be put on the same line in arbitrary order.
## See also https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
intergenic
feature_truncation feature_elongation
regulatory
TF_binding_site TFBS
downstream upstream
non_coding_transcript
intron NMD_transcript
non_coding_transcript_exon
5_prime_utr 3_prime_utr
coding_sequence mature_miRNA
stop_retained start_retained synonymous
incomplete_terminal_codon
splice_region
missense inframe protein_altering
transcript_amplification
exon_loss
disruptive
start_lost stop_lost stop_gained frameshift
splice_acceptor splice_donor
transcript_ablation
4. The Bayes Factor for each variant from the Platypus bayesiandenovofilter.py script can also be easily extracted from the INFO field if required, write results as a tab-delimited text file of variants:
- Likewise the multidenovo_filter can also be extracted from the INFO field
Iterating through the cohort of annotated multi-sample VCFs¶
There are a few ways to perform queries across the entire cohort to extract DNVs of interest using in a single command. This can be done using an array job; however due to the small size of the annotated multi-sample VCFs, it is also feasible to run a query in a single while loop.
We have deposited a text file for each cohort (GRCh37 and GRCh38) under the folders:
/gel_data_resources/main_programme/denovo_variant_dataset/GRCh37/20200326/cohort/grch37_cohort_1762.tsv
/gel_data_resources/main_programme/denovo_variant_dataset/GRCh38/20200326/cohort/grch38_cohort_10847.tsv
These file contains the full file paths to every family's annotated multi-sample VCF in the cohort. They can be used to iterate through all annotated multi-sample VCF and extract variants of interest. If you want to do this for only a subset of the cohort, then you can use the vcf_path_flagged_denovo _column in the _denovo_cohort_information LabKey table to fetch the file paths for your custom cohort.
1. In this example, all annotated multi-sample VCFs from the GRCh37 cohort will be interrogated. For each VCF in the list, cohorts_grch37_vcfs.tsv, it will first filter for stringent_filter pass DNVs only (-i 'DE_NOVO_FLAG="DENOVO"'), it will then print out specified genomic annotation that match the criteria: DNV is missense or worse and is in gene IL6. The sample id will also be printed out to join back onto other tables:
Performing custom filtering¶
You may want to perform DNV flagging based on custom filters, starting with all Mendelian inconsistencies for all families. This is possible by extracting the necessary fields from the VCF such as coverage, number of reads supporting the variant, and genotype. The below example shows how to extract these values from the VCF for each sample which can then be later processed in R/Python etc to develop the custom filters. The LabKey table denovo_cohort_information can then be used to associate the VCF with the correct trio. Please see the VCF header for a description of each attribute within the VCFs.
Querying the LabKey tables¶
The two LabKey tables denovo_cohort_information and _denovo_flagged_variants _can be queried though either the LabKey desktop application or through the LabKey APIs. Please see the documentation for the LabKey Desktop application and the APIs.
The desktop application allows you to browse the data and perform basic filtering on an individual table. Complete tables or filtered tables can then be exported as text or .xlsx files for downstream analysis. Alternatively, you can interrogate the tables using the LabKey APIs (written in multiple languages including R and Python). We recommend using the LabKey APIs as you can perform more complex and easily reproducible queries on the data.
Examples of how to query the denovo_cohort_information and denovo_flagged_variants table using the R LabKey API (Rlabkey package) are listed here. We make use of the SQL like syntax to query the tables. You can use RStudio or R in the terminal to run these examples.
Fetching entire tables¶
These examples show how to fetch an entire table and save as an object in your R session. Tables can then be post-filtered in R. Both tables are fetched in this example.
Filtering tables and selecting columns¶
- Tables can be filtered for certain attributes and specified columns selected in Rlabkey using SQL. In this example, we can filter the denovo_flagged_variants _table for variants that pass the _stringent_filter _in a particular family. We then filter for a certain variant of interest, by chromosome and position. The _trio_id, chrom, position, reference, alternate columns are then returned.
- You can also perform a query across the entire cohort to find DNVs in a particular gene of interest. In this example, the denovo_flagged_variants _table if firstly filtered for the GRCh38 cohort and for a chromosomal range representing the coordinates of the gene. The variants are filtered for _stringent_filter _pass and the the _trio_id, chrom, position, reference, alternate columns are returned.
Joining multiple LabKey tables together¶
- The usefulness of the APIs comes when joining multiple tables together. In this example, we perform the query above to find stringent_filter pass DNVs in a gene of interest from the denovo_flagged_variants table. We then bind these results together with useful columns of the denovo_cohort_information table to fetch the offspring's sex, affection_status, and file path to the annotated multi-sample VCF.
- The remaining phenotype tables can then be joined onto the filtered DNV dataset created above. We now bind the participant's recruited disease using the normalised_specific_disease column from the rare_diseases_participant_disease table.
``` R linenums="1" library(Rlabkey)
labkey.setDefaults(baseUrl = "https://labkey-embassy.gel.zone/labkey/")
main_programme <- "/main-programme/main-programme_v17_2023-03-30"
denovo_cohort_information <- labkey.executeSql(
schemaName ="lists",
colNameOpt ="rname",
maxRows = 100000000,
folderPath = main_programme,
sql = "SELECT ci.participant_id, ci.trio_id, ci.family_id, ci.sex, ci.affection_status, ci.vcf_path_flagged_denovo,
rd.normalised_specific_disease,
fv.chrom, fv.position, fv.reference, fv.alternate
FROM denovo_flagged_variants
AS fv
LEFT JOIN denovo_cohort_information
AS ci
ON fv.trio_id = ci.trio_id
LEFT JOIN rare_diseases_participant_disease
AS rd
ON ci.participant_id = rd.participant_id
WHERE fv.assembly = 'GRCh38'
AND fv.stringent_filter = '1'
AND fv.chrom = '7'
AND fv.position >= '50303465'
AND fv.position <= '50405101'
AND ci.member = 'Offspring'"
)
```