somAgg sample stats¶
The page summarises information about the 16,341 samples that are present in the somAgg. All samples that are present in cancer_analysis
on the Main Programme Release v12 have been included. These samples are all the somatic samples that have been successfully sequenced and interpreted.
Single sample sequencing and variant calling¶
SomAgg combines the annotated somatic VCF files generated by Strelka and annotated with CellBase. Each tumour sample has a matched germline, both deep whole-genome sequenced with an average coverage of 100x and 30x, respectively. Few patients had more than one tumour sample sequenced.
Samples were prepared using an Illumina TruSeq DNA Nano, TruSeq DNA PCR-Free or FFPE library preparation kit and then sequenced on a HiSeq X generating 150 bp paired-end reads. Illumina’s North Star pipeline (version 2.6.53.23) was used for primary WGS analysis. Read alignment against human reference genome GRCh38-Decoy+EBV was performed with iSAAC Aligner (version iSAAC- 03.16.02.19). Small variant calling together with tumour-normal subtraction was performed using Strelka2 (version 2.4.7).
Strelka FILTERs flag the following germline variant calls as NOT PASS, they are nonetheless included in the single vcf files and somAgg:
- All calls with a sample depth three times higher than the chromosomal mean
- Site genotype conflicts with proximal indel call. This is typically a heterozygous SNV call made inside of a heterozygous deletion
- Locus read evidence displays unbalanced phasing patterns
- Genotype call from variant caller not consistent with chromosome ploidy
- The fraction of basecalls filtered out at a site > 0.4
- Locus quality score < 14 for for het or hom SNP
- Locus quality score < 6 for het, hom or het-alt indels
- Locus quality score < 30 for other small variant types or quality score is not calculated
Strelka FILTERs flag the following somatic variant calls as NOT PASS, they are nonetheless included in the single vcf files and somAgg:
- All calls with a normal sample depth three times higher than the chromosomal mean
- All calls where the site in the normal sample is not a homozygous reference
- Somatic SNV calls with empirically fitted VQSR score < 2.75 (recalibrated quality score expressing the phred scaled probability of the somatic call being a false positive observation)
- Somatic indels where fraction of basecalls filtered out in a window extending 50 bases to either side of the indel call position is > 0.3
- Somatic indels with quality score < 30 (joint probability of the somatic variant and a homo ref normal genotype)
- All calls that overlap LINE repeat region
- Variants are not removed on the basis of low read count/frequency in the current version of the analysis pipeline.
Single sample decomposition¶
The annotated small variant VCF files used as input have been decomposed. The annotated single VCF files are generated from the somatic small variant VCF files (somatic_small_variants_vcf_path in cancer_analysis) generated by the variant calling pipeline, which comprises Strelka2 and vt for the decomposition. In orther words, the somatic variant VCF files are the ones decomposed, which means that no multi-allelic entries are found, because each multi-allelic is represented by two or more bi-allelic variants. The decomposition procedure is done in three steps by vt as presented here:
- Decompose variants of the same length.
vt decompose_blocksub -p {vcf_input} -o {vcf_output}
- Split records with multiple alternate alleles into multiple bi-allelic records an e.g. 1/2 genotype will be split to 1/. and ./1. The flag -s (“smart”) option makes INFO and FORMAT fields of type A and R to be retained and decomposed appropriately.
vt decompose -s {vcf_input} -o {vcf_output}
- Left-align indels and trim redundant bases. The “non-ambiguous” reference genome is used. This file only contains A,T,G,C and N characters.
vt normalize -n -w {window_size} -r {reference} {vcf_input} -o {vcf_output}
Genotype-level metrics¶
All 16,341 samples included in somAgg have successfully passed our internal sequencing and interpretation pipeline. These sample are listed in the LabKey table cancer_analysis. Some quality control statistics for these samples are provided below.
Sample Attribute | Description |
---|---|
Tumour Cross-Contamination | less than 5% |
Germline Cross-Contamination | less than 3% |
Median Fragment Size | greater than 279bp |
Excess of Chimeric Reads | mean of 0.3% |
Percentage of Mapped Reads | mean of 93.4% |
Percentage AT Dropout | mean of 3.1% |
Sample source and library preparation¶
The vast majority of the samples has been collected using surgical resection.
tissue_source | number_of_samples | percent_of_samples |
---|---|---|
SURGICAL RESECTION | 14602 | 89.36 |
NOT SPECIFIED | 521 | 3.19 |
USS GUIDED BIOPSY | 490 | 3.00 |
ENDOSCOPIC BIOPSY | 227 | 1.39 |
NON GUIDED BIOPSY | 136 | 0.83 |
BMA TUMOUR SORTED CELLS | 133 | 0.81 |
NON STANDARD BIOPSY | 85 | 0.52 |
CT GUIDED BIOPSY | 69 | 0.42 |
STEREOTACTICALLY GUIDED BIOPSY | 49 | 0.30 |
ENDOSCOPIC ULTRASOUND GUIDED FNA | 12 | 0.07 |
LAPAROSCOPIC EXCISION | 8 | 0.05 |
MRI GUIDED BIOPSY | 6 | 0.04 |
ENDOSCOPIC ULTRASOUND GUIDED BIOPSY | 3 | 0.02 |
Also, the majority (~92%) of somAgg are from fresh-frozen (FF) and (~88%) from PCR-free.
library_type | preparation_method | number_of_samples | percent_of_samples |
---|---|---|---|
PCR-Free | FF | 13711 | 83.91 |
PCR | FF | 1285 | 7.86 |
PCR | FFPE | 602 | 3.68 |
PCR-Free | EDTA | 494 | 3.02 |
PCR-Free | ASPIRATE | 124 | 0.76 |
PCR-Free | CD128 SORTED CELLS | 70 | 0.43 |
PCR | CD128 SORTED CELLS | 25 | 0.15 |
PCR | EDTA | 19 | 0.12 |
PCR | ASPIRATE | 8 | 0.05 |
PCR-Free | FFPE | 3 | 0.02 |
As expected, we see an increased AT drop-out for FFPE samples, but overall the vast majority of samples have good mapping rate.