aggV2 Code Book - General Information¶
Due to a probable bug in BCFtools, site QC statistics for Chrom X are incorrect. We advise avoiding the use of FILTER and INFO field data until this can be corrected. All genotype data and the related VEP functional data are unaffected.
FILTER, INFO, and FORMAT fields in aggV2¶
A substantial amount of queries to aggV2 can be made using the FILTER, INFO, and FORMAT tags within the VCFs (both the genotype VCFs and the functional annotation VCFs).
- The FILTER field shows the per variant flag indicating which of a given set of filters the variant has passed or failed (such as median coverage).
- The INFO filed shows the per variant list of key-value pairs describing the variation (such as variant allele frequency and variant median depth).
- The FORMAT field shows and extensible list of fields for describing the samples per variant (such as sample genotype, sample depth).
One can extract all tags per field using the code below which uses bcftools to view the header of a single chunk then extracts the specific field:
Identifying which chunk to use¶
aggV2 is split into 1,371 'chunks' across the genome. This is true for both the genotype VCFs and the functional annotation VCFs; where the chromosome, start, and stop chunk names are identical across data types.
It is often necessary to know which chunk(s) your gene(s), variant(s), region(s) of interest are located in. The script below helps you to this.
Chunk Names
Chunks are named in the following format:
Genotype VCFs:
gel_mainProgramme_aggV2_chromosome_start_stop.vcf.gz
- for example -
gel_mainProgramme_aggV2_chr1_146620016_147701894.vcf.gz
Functional Annotation VCFs:
gel_mainProgramme_aggV2_chromosome_start_stop_VEPannot.vcf.gz
- for example -
gel_mainProgramme_aggV2_chr1_146620016_147701894_VEPannot.vcf.gz
List of chunk names and aggV2 VCF files¶
The list of chunk names and full file paths to both the genotype and functional annotation VCFs can be found here.
/gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/additional_data/chunk_names/aggV2_chunk_names.bed
Each of the 1,371 chunks is on a separate line and each line contains seven fields:
Column number | Description | Example |
---|---|---|
1 | Chromosome | chr1 |
2 | Chunk start | 1 |
3 | Chunk stop | 506426 |
4 | Chromosome, start, stop (format 1) | chr1_1_506426 |
5 | Chromosome, start, stop (format 2) | chr1:1-506426 |
6 | Full path to genotype annotation VCF | /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr1_1_506426.vcf.gz |
7 | Full path to functional annotation VCF | /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP/gel_mainProgramme_aggV2_chr1_1_506426_VEPannot.vcf.gz |
Create your own regions file¶
You firstly must create a regions file of your gene(s), variant(s), region(s) of interest. This must be a three or column tab-delimited file of chromosome, start, and stop (with an option fourth column of an identifier - i.e. a gene name). The file should have the .bed extension. There is no limit to how many lines you can have in this file.
Sort
Please pre-sort your data by chromosome and then by start position (sort -k1,1 -k2,2n in.bed in.sorted.bed
)
Example:
Intersect the two files¶
Now you can intersect the bed file of chunk names with your regions file using bedtools as shown below:
This will print out a six column tab-delimited file with the number of lines equalling the number of inputs in the regions file. It will have the following format:
Column number | Description | Example |
---|---|---|
1 | Chromosome | chr2 |
2 | Region start | 213005363 |
3 | Region stop | 213151603 |
4 | Region identifier | IKZF2 |
5 | Full path to genotype annotation VCF | /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/genomic_data/gel_mainProgramme_aggV2_chr2_211052166_213676386.vcf.gz |
6 | Full path to functional annotation VCF | /gel_data_resources/main_programme/aggregation/aggregate_gVCF_strelka/aggV2/functional_annotation/VEP/gel_mainProgramme_aggV2_chr2_211052166_213676386_VEPannot.vcf.gz |
The full array of columns can also be printed by omitting the cut command.