De novo data code book¶
This code book is to help you interrogate the 100kGP DNV dataset. It includes querying the multisample VCFs and querying the relevant LabKey tables.
Submit all scripts to the High-Performance Compute cluster, either to the short or interactive queue.
Querying the annotated multi-sample VCFs¶
The below code snippets assume that you are working in the HPC environment. Use the LabKey table denovo_cohort_information
to fetch the file paths to the annotated multi-sample VCFs for your cohort of interest. Examples of how to do this are in the LabKey section below.
Working directory
Remember that you will need to change certain file paths to the inputs and outputs depending on your working location on the HPC.
Location of the annotated multi-sample VCFs¶
The annotated multi-sample VCFs are found in the directory below under flagged_vcf and split by the GRCh37 and GRCh38 cohorts. In the cohort folders, are the full file paths to all annotated multi-sample VCFs per cohort.
Basic filtering of the annotated multi-sample VCFs¶
You can query the multi-sample VCFs using bcftools to include (-i
) or exclude (-e
) variants based on flags in the FORMAT
column, labelled DE_NOVO_FLAG
.
Example 1: Include only stringent_filter
pass DNVs in a single family, write results as compressed VCF:
Example 2: Include only stringent_filter
pass DNVs in a single family, write results as a tab-delimited text file of variants, sample IDs, and genotypes:
Output:
1 76390973 A T LP2000950-DNA_D07 1/0
1 11623896 G A LP2000950-DNA_D07 1/0
1 18103688 C T LP2000950-DNA_D07 0/1
1 23653370 A G LP2000950-DNA_D07 0/1
2 15659761 T C LP2000950-DNA_D07 1/0
2 20627434 C T LP2000950-DNA_D07 0/1
...
* data have been randomised
Example 3: Include only base_filter
pass DNVs in a single family that fail a single stringent_filter
(eg simplerepeat
), write results as compressed VCF:
Example 4: Include only base_filter
pass DNVs in a single family that fail more than one stringent_filter
(eg simplerepeat
and abratio
), write results as compressed VCF:
Example 6: For families with nested trios (e.g. multiple offspring) and simple trios, you can print out the sample ID and also the de novo flag value for each sample. Note that the DE_NOVO_FLAG
FORMAT
attribute for the mother and father samples is always set to .
missing. You can filter these missing samples out by including the flag as shown below (-i 'DE_NOVO_FLAG!="."'
):
Output:
2 128586362 T G LP3000128-DNA_E05 1/0 DENOVO
2 128586362 T G LP3000135-DNA_E01 0/0 base_fail
2 130215368 C A LP3000128-DNA_E05 0/0 base_fail
2 130215368 C A LP3000135-DNA_E01 1/0 altreadparent;abratio
...
* data have been randomised
Example 7: Example 6 shows the output in long-format (where the variant is duplicated by the number of samples). It is also possible to show the same data in wide-format - with each column being maintaining the same sample id and order in the VCF. Now one variant per line is shown:
Output:
2 2255217 A G 0/1 DENOVO 0/1 base_fail 0/0 . 0/0 .
2 2270459 AT A 1/0 base_fail 1/0 base_fail 0/0 . 0/0 .
2 2318407 C T 0/1 altreadparent;abratio 0/1 altreadparent;abratio 0/0 . 0/0 .
...
* data have been randomised
More advanced filtering of the annotated multi-sample VCFs¶
It is also possible to perform more advanced filtering by querying the attributes within the INFO
field; such as the genomic annotation from VEP. Also, multiple filtering steps can be chained together in the same command to build complex queries.
We will use split-vep to query and parse the genomic annotation from VEP (available for recent versions of bcftools).
Example 1: Include only variants in a single family for a gene of interest (eg ACTL6B), write results as a tab-delimited text file of variants with specified genomic annotation. This example will output variants annotated against all transcripts for gene of interest, there will be one annotation per line (for each transcript - using the -d
flag). The annotation columns to be printed out can be manually selected (gnomAD_NFE, Consequence etc). It is also possible to query many genes at a time (such as: -i 'SYMBOL="ACTL6B" | SYMBOL="IL9R"'
), or by applying a filter and using grep on the output:
Output:
7 100244267 C T ACTL6B ENST00000160382 missense_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000461605 downstream_gene_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000485601 downstream_gene_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000487125 non_coding_transcript_exon_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000487225 downstream_gene_variant . rs1131695228
7 100244267 C T ACTL6B ENST00000489904 downstream_gene_variant . rs1131695228
Example 2: It is possible to firstly extract stringent_filter
pass variants and pipe this into +split-vep
then write results as a tab-delimited text file of variants with specified genomic annotation:
Example 3: You can also perform more complex filtering such as scanning for all stringent_filter
pass variants, that are missense or worse (-s
and -S
flags) and rare in gnomAD (use the -c
flag for numeric conversion). Write results as a tab-delimited text file of variants with specified genomic annotation. The sample ID is also printed here so that it can be joined to other tables:
When specifying severity constraints using the -s
flag, you should specify the severity scale explicitly using the -S
flag. This is due to a known bug in bcftools 1.10.x (outdated default severity scale), which will be resolved in bcftools 1.11 - "VEP_severity_scale_2020.txt" is an example file for the current severity scale, which will be used in the relevant examples on this page:
VEP_severity_scale_2020.txt
# Default consequence substrings ordered in ascending order by severity.
## Consequences with the same severity can be put on the same line in arbitrary order.
## See also https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html
intergenic
feature_truncation feature_elongation
regulatory
TF_binding_site TFBS
downstream upstream
non_coding_transcript
intron NMD_transcript
non_coding_transcript_exon
5_prime_utr 3_prime_utr
coding_sequence mature_miRNA
stop_retained start_retained synonymous
incomplete_terminal_codon
splice_region
missense inframe protein_altering
transcript_amplification
exon_loss
disruptive
start_lost stop_lost stop_gained frameshift
splice_acceptor splice_donor
transcript_ablation
Example 4: The Bayes Factor for each variant from the Platypus bayesiandenovofilter.py script can also be easily extracted from the INFO field if required, write results as a tab-delimited text file of variants:
Example 5: Likewise the multidenovo_filter
can also be extracted from the INFO field
Iterating through the cohort of annotated multi-sample VCFs¶
There are a few ways to perform queries across the entire cohort to extract DNVs of interest using in a single command. This can be done using an array job; however due to the small size of the annotated multi-sample VCFs, it is also feasible to run a query in a single while loop.
We have deposited a text file for each cohort (GRCh37 and GRCh38) under the folders:
/gel_data_resources/main_programme/denovo_variant_dataset/GRCh37/20200326/cohort/grch37_cohort_1762.tsv
/gel_data_resources/main_programme/denovo_variant_dataset/GRCh38/20200326/cohort/grch38_cohort_10847.tsv
These file contains the full file paths to every family's annotated multi-sample VCF in the cohort. They can be used to iterate through all annotated multi-sample VCF and extract variants of interest. If you want to do this for only a subset of the cohort, then you can use the vcf_path_flagged_denovo
column in the denovo_cohort_information
LabKey table to fetch the file paths for your custom cohort.
Example: In this example, all annotated multi-sample VCFs from the GRCh37 cohort will be interrogated. For each VCF in the list, cohorts_grch37_vcfs.tsv
, it will first filter for stringent_filter
pass DNVs only (-i 'DE_NOVO_FLAG="DENOVO"'
), it will then print out specified genomic annotation that match the criteria: DNV is missense or worse and is in gene IL6. The sample id will also be printed out to join back onto other tables:
Querying the LabKey tables¶
The two LabKey tables denovo_cohort_information
and denovo_flagged_variants
can be queried though either the LabKey desktop application or through the LabKey APIs.
For all queries we will use the labkey helper function labkey_to_df
which uses an SQL query, a database to access and the maximum number of rows in your output, which is defined in the LabKey documentation
Fetching entire tables¶
These examples show how to fetch an entire table and save as an object in your R session. Tables can then be post-filtered in R. Both tables are fetched in this example.
Filtering tables and selecting columns¶
Tables can be filtered for certain attributes and specified columns selected in Rlabkey using SQL.
Example 1: In this example, we can filter the denovo_flagged_variants
table for variants that pass the stringent_filter
in a particular family. We then filter for a certain variant of interest, by chromosome and position. The trio_id
, chrom
, position
, reference
, alternate
columns are then returned.
Example 2: You can also perform a query across the entire cohort to find DNVs in a particular gene of interest. In this example, the denovo_flagged_variants
table if firstly filtered for the GRCh38 cohort and for a chromosomal range representing the coordinates of the gene. The variants are filtered for stringent_filter
pass
and the trio_id
, chrom
, position
, reference
, alternate
columns are returned.
Joining multiple LabKey tables together¶
The usefulness of the APIs comes when joining multiple tables together.
Example 1: In this example, we perform the query above to find stringent_filter
pass
DNVs in a gene of interest from the denovo_flagged_variants
table. We then bind these results together with useful columns of the denovo_cohort_information
table to fetch the offspring's sex
, affection_status
, and file path to the annotated multi-sample VCF
.
Example 2: The remaining phenotype tables can then be joined onto the filtered DNV dataset created above. We now bind the participant's recruited disease using the normalised_specific_disease
column from the rare_diseases_participant_disease
table.