somAgg sample stats¶

There are 16,341 samples present in the somAgg. These include all samples that are present in cancer_analysis on the 100kGP Release v12 and are all the somatic samples that were successfully sequenced and interpreted.

Single sample sequencing and variant calling¶

somAgg combines the annotated somatic VCF files generated by Strelka and annotated with CellBase. Each tumour sample has a matched germline, both deep whole-genome sequenced with an average coverage of 100x and 30x, respectively. Few patients had more than one tumour sample sequenced.

Step	Method
Preparation	Illumina TruSeq DNA Nano, TruSeq DNA PCR-Free or FFPE library preparation kit
Sequencing	HiSeq X generating 150 bp paired-end reads
Primary WGS analysis	Illumina’s North Star pipeline (version 2.6.53.23)
Read alignment	against human reference genome GRCh38-Decoy+EBV with iSAAC Aligner (version iSAAC- 03.16.02.19)
Small variant calling together with tumour-normal subtraction	Strelka2 (version 2.4.7)

Strelka FILTERs flag the following germline variant calls as NOT PASS, they are nonetheless included in the single vcf files and somAgg:

All calls with a sample depth three times higher than the chromosomal mean
Site genotype conflicts with proximal indel call. This is typically a heterozygous SNV call made inside of a heterozygous deletion
Locus read evidence displays unbalanced phasing patterns
Genotype call from variant caller not consistent with chromosome ploidy
The fraction of basecalls filtered out at a site > 0.4
Locus quality score < 14 for for het or hom SNP
Locus quality score < 6 for het, hom or het-alt indels
Locus quality score < 30 for other small variant types or quality score is not calculated

Single sample decomposition¶

The annotated small variant VCF files used as input have been decomposed. The annotated single VCF files are generated from the somatic small variant VCF files (somatic_small_variants_vcf_path in cancer_analysis) generated by the variant calling pipeline, which comprises Strelka2 and vt for the decomposition. In other words, the somatic variant VCF files are the ones decomposed, which means that no multi-allelic entries are found, because each multi-allelic is represented by two or more bi-allelic variants.

The decomposition procedure is done in three steps by vt as presented here:

Decompose variants of the same length.

vt decompose_blocksub -p {vcf_input} -o {vcf_output}

Split records with multiple alternate alleles into multiple bi-allelic records an e.g. 1/2 genotype will be split to 1/. and ./1. The flag -s (“smart”) option makes INFO and FORMAT fields of type A and R to be retained and decomposed appropriately.

vt decompose -s {vcf_input} -o {vcf_output}

Left-align indels and trim redundant bases. The “non-ambiguous” reference genome is used. This file only contains A,T,G,C and N characters.

vt normalize -n -w {window_size} -r {reference} {vcf_input} -o {vcf_output}

Genotype-level metrics¶

All 16,341 samples included in somAgg have successfully passed our internal sequencing and interpretation pipeline. These sample are listed in the LabKey table cancer_analysis. Some quality control statistics for these samples are provided below.

Sample Attribute	Description
Tumour Cross-Contamination	less than 5%
Germline Cross-Contamination	less than 3%
Median Fragment Size	greater than 279bp
Excess of Chimeric Reads	mean of 0.3%
Percentage of Mapped Reads	mean of 93.4%
Percentage AT Dropout	mean of 3.1%

Sample source and library preparation¶

The vast majority of the samples has been collected using surgical resection.

tissue_source	number_of_samples	percent_of_samples
SURGICAL RESECTION	14602	89.36
NOT SPECIFIED	521	3.19
USS GUIDED BIOPSY	490	3.00
ENDOSCOPIC BIOPSY	227	1.39
NON GUIDED BIOPSY	136	0.83
BMA TUMOUR SORTED CELLS	133	0.81
NON STANDARD BIOPSY	85	0.52
CT GUIDED BIOPSY	69	0.42
STEREOTACTICALLY GUIDED BIOPSY	49	0.30
ENDOSCOPIC ULTRASOUND GUIDED FNA	12	0.07
LAPAROSCOPIC EXCISION	8	0.05
MRI GUIDED BIOPSY	6	0.04
ENDOSCOPIC ULTRASOUND GUIDED BIOPSY	3	0.02

Also, the majority (~92%) of somAgg are from fresh-frozen (FF) and (~88%) from PCR-free.

library_type	preparation_method	number_of_samples	percent_of_samples
PCR-Free	FF	13711	83.91
PCR	FF	1285	7.86
PCR	FFPE	602	3.68
PCR-Free	EDTA	494	3.02
PCR-Free	ASPIRATE	124	0.76
PCR-Free	CD128 SORTED CELLS	70	0.43
PCR	CD128 SORTED CELLS	25	0.15
PCR	EDTA	19	0.12
PCR	ASPIRATE	8	0.05
PCR-Free	FFPE	3	0.02

As expected, we see an increased AT drop-out for FFPE samples, but overall the vast majority of samples have good mapping rate.

somAgg sample stats¶

Single sample sequencing and variant calling¶

Single sample decomposition¶

Genotype-level metrics¶

Sample source and library preparation¶

Tumour purity and coverage¶