Long-read genomic data from PacBio¶
This is a dataset of <100 rare disease samples from the 100kGP genome project re-sequenced with Pacific Biosciences (PacBio) as an example dataset to to demonstrate the utility of their HiFi technology.
Data processing¶
Samples were sequenced on 2-4 smart cells per sample.
Data was processed by PacBio, using a snakemate workflow.
With software versions:
- pbmm2 v1.9.0
- DeepVariant v1.4
- WhatsHap v1.0
- pbsv v2.8.0
- GLnexus v1.4.1
5mCpG pileups were generated using PacBio repeat catalogues.
Phasing¶
Phasing is based on small variants in reads rather than parent-proband comparisons. Single nucleotide variants and INDELs are identified with DeepVariant and the reads that contain these variants are classified into haplotype group 1 or 2 based on the occurrence of shared variants using WhatsHap.
This read based phasing is common in long read datasets, and blocks of phasing can extend for considerable distances provided the reads are long enough to span repetitive genomic regions.
In this pipeline de-novo assembly was additionally performed on using only reads that we classified into one haplotype group or the other.
Datasets in Genomics England¶
Data is made available via the table in LabKey rare_disease_pacbio_pilot
.
Example of the files and description¶
File type | Description |
---|---|
*.asm.bp.hap1.p_ctg.fasta.gz |
De-novo assemblies - haplotype1 |
*.asm.bp.hap2.p_ctg.fasta.gz |
De-novo assemblies - haplotype2 |
*.asm.GRCh38.bam |
Alignment of de-novo assemblies to reference |
*.asm.GRCh38.bam.bai |
Alignment of de-novo assemblies to reference - index file |
*.GRCh38.combined.denovo.bed |
De-novo assemblies |
*.GRCh38.combined.denovo.bw |
De-novo assemblies |
*.GRCh38.combined.denovo.mincov10.bed |
De-novo assemblies |
*.GRCh38.combined.denovo.mincov10.bw |
De-novo assemblies |
*.GRCh38.deepvariant.haplotagged.bam |
Aligned hifi reads, phased on small variants |
*.GRCh38.deepvariant.haplotagged.bam.bai |
Aligned hifi reads, phased on small variants - index file |
*.GRCh38.deepvariant.phased.vcf.gz |
Phased small variant calls |
*.GRCh38.deepvariant.phased.vcf.gz.tbi |
Phased small variant calls - index file |
*.GRCh38.hap1.denovo.bed |
De-novo assemblies - haplotype1 |
*.GRCh38.hap1.denovo.bw |
De-novo assemblies - haplotype1 |
*.GRCh38.hap1.denovo.mincov10.bed |
De-novo assemblies - haplotype1 |
*.GRCh38.hap1.denovo.mincov10.bw |
De-novo assemblies - haplotype1 |
*.GRCh38.hap2.denovo.bed |
De-novo assemblies - haplotype2 |
*.GRCh38.hap2.denovo.bw |
De-novo assemblies - haplotype2 |
*.GRCh38.hap2.denovo.mincov10.bed |
De-novo assemblies - haplotype2 |
*.GRCh38.hap2.denovo.mincov10.bw |
De-novo assemblies - haplotype2 |
*.GRCh38.pbsv.vcf.gz |
Structural variant calls |
*.GRCh38.pbsv.vcf.gz.tbi |
Structural variant calls - index file |
*.hifi_reads.bam |
Unaligned consensus reads, the output of a single hifi sequencing run |
*.hifi_reads.bam |
Unaligned consensus reads, the output of a single hifi sequencing run |
*.hifi_reads.bam |
Unaligned consensus reads, the output of a single hifi sequencing run |
*.hifi_reads.bam |
Unaligned consensus reads, the output of a single hifi sequencing run |