Transcriptomics pilot data¶
The GEL Transcriptomics Pilot comprises RNA-sequencing of a subset (>5,000) of rare disease probands from the 100,000 Genomes Project who did not receive a genetic diagnosis through the Genomics England Interpretation Pipeline. We prioritised probands who were found to carry variants of unknown significance.
Priorities were based on:
- Variants highlighted through Splice AI
- Autosomal recessive disorders with only a single pathogenic variant identified
- GMC-selected VUS AND contribution to phenotype partial / unknown AND variant type likely to affect RNA processing
- Based on outcome questionnaire and a call to clinicians
- VUS with a high Exomiser score AND variant likely to results in detectable abnormal RNA processing
- Disorder category ranking by Genomics England on the basis of likely monogenic cause (ranks 1-5) for participants from 1.1 AND no diagnosis in outcome questionnaire
- Call to GMCs / clinicians to propose cases based on strong phenotype for a monogenic disorder with no lead from WGS
- Review whether RNA sample is available or requirement for fresh RNA sample
Protocol¶
Sequencing¶
- We derived total RNA from peripheral whole blood prepared using QIAGEN PAXgene Blood RNA Kit.
- We analysed isolated total RNA on an Agilent Tapestation 4200 system for RNA integrity number (RIN) and DV200 quality check.
- We constructed libraries following the Illumina Stranded Total RNA Prep, Ligation with Ribo-Zero Plus kit protocol.
- We sequenced 95 samples per plate (one empty well) simultaneously on the NovaSeq 6000 using 2x100bp paired-end reads.
Data processing¶
We used the Illumina DRAGEN RNA Pipeline v3.8.4) for read alignment. We mapped reads to the human genome reference GRCh38 (alt-aware with HLAs) with annotation-assisted alignment, duplicate marking, gene expression quantification, and gene fusion detection enabled. We used GENCODE v32 GTF file for gene annotation. The Illumina DRAGEN aligner and gene expression quantifier are implementations of STAR and Salmon respectively.
Path to the reference genome in the RE:
/public_data_resources/reference/GRCh38DeAlt_HLA/GRCh38_full_analysis_set_plus_decoy_hla.fa
Path to the GTF file in the RE:
/public_data_resources/GENCODE/v32/GRCh38/gencode.v32.annotation.gtf
Data available in the Genomics England Research Environment¶
DRAGEN output¶
The data delivered by Illumina from running the DRAGEN RNA Pipeline is available in /gel_data_resources
with individual deliveries subdivided by delivery dates (see example delivery below). You can generate lists of file paths of interest through the transcriptome_file_paths_and_types
LabKey table by filtering for specific file type and participants.
Primary folder: /gel_data_resources/RNASeq_data/Rare_Disease/
For each sample the following output files are available:
/gel_data_resources/RNASeq_data/Rare_Disease/DELIVERY_DATE/DELIVERY_ID/
└── RNA_PLATEKEY
├── fastqs
│ ├── RNA_PLATEKEY_SNN_L001_R1_001.fastq.gz
│ ├── RNA_PLATEKEY_SNN_L001_R2_001.fastq.gz
│ ├── ..
│ ├── ..
│ ├── RNA_PLATEKEY_SNN_L00N_R1_001.fastq.gz
│ └── RNA_PLATEKEY_SNN_L00N_R2_001.fastq.gz
├── RNA_PLATEKEY.bam
├── RNA_PLATEKEY.bam.bai
├── RNA_PLATEKEY.fusion_candidates.final
├── RNA_PLATEKEY.quant.genes.sf
├── RNA_PLATEKEY.quant.sf
├── RNA_PLATEKEY.SJ.out.tab
└── md5sum.txt
Corrupted SJ.out.tab files
Due to an error in the Illumina DRAGEN RNA Pipeline v3.8.4, the RNA_PLATEKEY.SJ.out.tab
files contain incorrect information in the intron motif column. Motifs of type 4, 5 and 6 have been incorrectly assigned to types 1, 2, and 3. The paths to the affected files have been removed from the current data release, but the datasets are still available in the /RNA_PLATEKEY/
directories. Each /RNA_PLATEKEY/
directory contains a README.md
file describing the issue in more detail.
RNA-Seq QC output¶
We ran the sample-level RNA sequencing data through an internally developed pipeline, generating quality control metrics across the entire cohort. Evaluation included examining raw read quality, alignment quality, and whole genome DNA-RNA sample matching. Some sample-level output files from the below tools have been aggregated into simple tsv files which have been aggregated together (see below) to allow comparison across the full dataset.
The table below outlines the software employed by the pipeline for generating QC metrics and which output files fed into the aggregated files. FastQC, RNA-SeQC2, RSeQC were used to generate generic quality metrics, whereas Somalier was used to assess the relatedness between WGS and RNASeq data to ensure RNASeq samples matched the expected WGS data.
Tool | Version | Original output file types |
---|---|---|
FastQC | 0.11.9 | summary.txt |
RNA-SeQC 2 | 2.4.2 | *.bam.metrics.tsv |
RSeQC | 5.0.1 | *.geneBodyCoverage.txt |
Somalier | 0.2.18 | *.pairs.tsv *.samples.tsv |
Sample-level and aggregated QC files output by the pipeline can be accessed at:
/gel_data_resources/RNASeq_data/qc_results/Rare_Disease/<main_programme_data_release_version>
Accompanying QC related Labkey table¶
A subset of the QC metrics are available in the Labkey table accompanying this dataset, namely rnaseq_qc_metrics
. This table can be used to screen samples based on their quality control metrics before conducting any subsequent analysis. Below is a breakdown of the metrics included in the table and the software used to generate them. These metrics were considered the most likely to provide an insight into the quality of a given sample, and thus most valuable during sample screening.
Column name | Metric | Description | Software |
---|---|---|---|
participant_id |
Participant ID | Genomics England participant identifier | N/A |
rnaseq_platekey |
RNA-Seq platekey | Sample identifier for RNASeq data | N/A |
wgs_platekey |
WGS platekey | Sample identifier for WGS data | N/A |
rna_folder_path |
RNA (data) folder path | Path to the folder containing RNASeq data for a given sample | N/A |
rnaseqc_total_reads |
Total Reads | Total input alignments | RNA-SeQC 2 |
rnaseqc_mapping_rate |
Mapping rate | Proportion of Mapped Reads (Unique Mapping, Vendor QC Passed Reads that were mapped) to total reads | RNA-SeQC 2 |
rnaseqc_duplicate_rate_of_mapped |
Duplicate rate of mapped | Proportion of Mapped Duplicate Reads to Mapped Reads | RNA-SeQC 2 |
rnaseqc_high_quality_rate |
High quality rate | Mapped Reads that passed the following criteria: aligned as proper pairs, mismatches (NM tag) at or below threshold (--base-mismatch 6 ), passed mapping quality (MAPQ) threshold (--mapping-quality 60 ) |
RNA-SeQC 2 |
rnaseqc_exonic_rate |
Exonic rate | Proportion of Exonic Reads (mapped Reads for which all aligned segments unambiguously aligned to exons of the same gene) among Mapped Reads | RNA-SeQC 2 |
rnaseqc_intronic_rate |
Intronic rate | Proportion of Intronic Reads (mapped Reads for which all aligned segments unambiguously aligned to the same gene, without intersecting any of its exons) among Mapped Reads | RNA-SeQC 2 |
rnaseqc_intergenic_rate |
Intergenic rate | Proportion of Intergenic Reads among Mapped Reads | RNA-SeQC 2 |
rnaseqc_rrna_rate |
rRna rate | Proportion of rRNA Reads among Mapped Reads | RNA-SeQC 2 |
rnaseqc_fragment_length_median |
Fragment length median | Median insert size | RNA-SeQC 2 |
rnaseqc_median_3_prime_bias |
Median 3' bias | Median 3’ bias in read coverage | RNA-SeQC 2 |
rnaseqc_median_exon_cv |
Median exon CV | Median coefficient of variation of exon coverage, computed excluding the first and last 500bp of genes, using High Quality Reads mapping fully and exclusively to the gene | RNA-SeQC 2 |
rnaseqc_genes_detected |
Genes detected | Number of genes with at least five (default) High Quality Exonic Reads (--detection-threshold 5 ) |
RNA-SeQC 2 |
rseqc_gene_body_coverage_skewness |
Gene body coverage skewness | Fishers coefficient of skewness of the gene body coverage calculated by RSeQC | RSeQC (further calculated with "moments") |
somalier_relatedness |
Relatedness | A measure of the degree the similarity of the genotypes at polymorphic loci between matched transcriptomic and WGS DNA samples. | Somalier |
somalier_x_depth_mean |
X depth mean | Mean depth of sites on X chromosome | Somalier |
somalier_y_depth_mean |
Y depth mean | Mean depth of sites on Y chromosome | Somalier |
somalier_x_n |
X n (sites) | Number of sites available on X chromosome in RNAseq sample. This is based on a maximum number of chrX sites present in /public_data_resources/somalier/v0.2.18/sites.*.vcf.gz | Somalier |
somalier_y_n |
Y n (sites) | Number of sites available on Y chromosome in RNAseq sample. This is based on a maximum number of chrY sites present in /public_data_resources/somalier/v0.2.18/sites.*.vcf.gz | Somalier |
somalier_x_hom_ref |
X hom ref | Number of chrX homozygote reference sites found in RNAseq sample. | Somalier |
somalier_x_hom_alt |
X hom alt | Number of chrX homozygote alternate sites found in RNAseq sample. | Somalier |
somalier_x_het |
X het | Number of chrX heterozygote sites found in RNAseq sample. | Somalier |
illumina_rin |
RIN | RNA Integrity Number as measured by an Agilent Tapestation 4200. Values were not used to remove low quality samples before sequencing. All samples were sequenced regardless of their quality in order to define RIN/DV200 thresholds to be used in future RNA-seq projects. | Illumina Provided |
illumina_rin |
DV200 | DV200 (percentage of fragments >200 nt) as measured by an Agilent Tapestation 4200. Values were not used to remove low quality samples before sequencing. All samples were sequenced regardless of their quality in order to define RIN/DV200 thresholds to be used in future RNA-seq projects. | Illumina Provided |
Quality Assessment Summary¶
From a subset of the QC metrics we provide a summarised overview below. Two samples were observed with extremely high number of reads, and have been excluded from the figures below.
Summary results of the RNASeQ QC pipeline with alignment metrics. Also here, the two samples with extremely high number of reads have been excluded from this table.
Metric | Min | Q1 | Median | Mean | Q3 | Max |
---|---|---|---|---|---|---|
rnaseqc_total_reads |
112,939,429 | 217,359,043 | 248,518,554 | 255,393,099 | 285,917,493 | 956,819,802 |
rnaseqc_duplicate_rate_of_mapped |
0.112 | 0.376 | 0.489 | 0.504 | 0.615 | 0.979 |
rnaseqc_exonic_rate |
0.172 | 0.27 | 0.289 | 0.293 | 0.311 | 0.694 |
rnaseqc_fragment_length_median |
145 | 176 | 186 | 186.878 | 196 | 264 |
rnaseqc_genes_detected |
10,731 | 24,943 | 25,999 | 25,685.400 | 26,818.750 | 41,243 |
rnaseqc_high_quality_rate |
0.777 | 0.924 | 0.928 | 0.925 | 0.93 | 0.94 |
rnaseqc_intergenic_rate |
0.023 | 0.049 | 0.051 | 0.051 | 0.053 | 0.256 |
rnaseqc_intronic_rate |
0.25 | 0.597 | 0.62 | 0.616 | 0.638 | 0.727 |
rnaseqc_mapping_rate |
0.64 | 0.953 | 0.966 | 0.96 | 0.975 | 0.995 |
rnaseqc_median_3_prime_bias |
0.336 | 0.47 | 0.491 | 0.489 | 0.507 | 0.643 |
rnaseqc_median_exon_cv |
0.27 | 0.328 | 0.348 | 0.355 | 0.373 | 0.751 |
rnaseqc_rrna_rate |
0 | 0 | 0 | 0 | 0 | 0.024 |
rseqc_gene_body_coverage_skewness |
-2.141 | -1.158 | -1.083 | -1.091 | -1.014 | -0.494 |
Finally, FastQC results are summarised in the table below showing the proportion of flagged samples per category across ten fastQC modules.
Category | FAIL | WARN | PASS |
---|---|---|---|
Adapter Content | 0.5 | 1.2 | 98.3 |
Overrepresented sequences | 1.3 | 7.6 | 91 |
Per base N content | NA | 0.1 | 99.9 |
Per base sequence content | 9.3 | 90.7 | NA |
Per base sequence quality | 93.5 | NA | 6.5 |
Per sequence GC content | 53.3 | 44.4 | 2.3 |
Per sequence quality scores | NA | NA | 100 |
Per tile sequence quality | 0.6 | 0.4 | 98.9 |
Sequence Duplication Levels | 99.5 | 0.5 | NA |
Sequence Length Distribution | NA | 100 | NA |
Help and support¶
Please reach out via the Genomics England Service Desk for any issues related to the RNA-Seq datasets and tables, including "RNASeq" in the title/description of your inquiry.