Transcriptomics data - 100kGP Extension¶
These are the further 2294 samples from 2286 probands:
- Three of these overlap with the pilot project
- for a further eight probands, two samples are provided in this cohort
Data available in the Genomics England Research Environment¶
DRAGEN output¶
The data delivered by Illumina from running the DRAGEN RNA Pipeline is available in /gel_data_resources
with individual deliveries subdivided by delivery dates (see example delivery below). You can generate lists of file paths of interest through the transcriptome_file_paths_and_types
LabKey table by filtering for specific file type and participants.
Primary folder: /gel_data_resources/RNASeq_data/Rare_Disease/
For each sample the following output files are available:
/gel_data_resources/RNASeq_data/Rare_Disease/DELIVERY_DATE/DELIVERY_ID/
└── RNA_PLATEKEY
├── fastqs
│ ├── RNA_PLATEKEY_SNN_L001_R1_001.fastq.gz
│ ├── RNA_PLATEKEY_SNN_L001_R2_001.fastq.gz
│ ├── ..
│ ├── ..
│ ├── RNA_PLATEKEY_SNN_L00N_R1_001.fastq.gz
│ └── RNA_PLATEKEY_SNN_L00N_R2_001.fastq.gz
├── RNA_PLATEKEY.bam
├── RNA_PLATEKEY.bam.bai
├── RNA_PLATEKEY.bam.md5sum
├── RNA_PLATEKEY.Chimeric.out.junction
├── RNA_PLATEKEY.fusion_candidates.features.csv
├── RNA_PLATEKEY.fusion_candidates.filter_info
├── RNA_PLATEKEY.fusion_candidates.final
├── RNA_PLATEKEY.fusion_candidates.preliminary
├── RNA_PLATEKEY.fusion_candidates.vcf.gz
├── RNA_PLATEKEY.fusion_candidates.vcf.gz.md5sum
├── RNA_PLATEKEY.fusion_candidates.vcf.gz.tbi
├── RNA_PLATEKEY.fusion_metrics.csv
├── RNA_PLATEKEY.hard-filtered.vcf.gz
├── RNA_PLATEKEY.hard-filtered.vcf.gz.md5sum
├── RNA_PLATEKEY.hard-filtered.vcf.gz.tbi
├── RNA_PLATEKEY.insert-stats.tab
├── RNA_PLATEKEY.mapping_metrics.csv
├── RNA_PLATEKEY.metrics.json
├── RNA_PLATEKEY.quant.eq_classes.txt
├── RNA_PLATEKEY.quant.genes.sf
├── RNA_PLATEKEY.quant_metrics.csv
├── RNA_PLATEKEY.quant.sf
├── RNA_PLATEKEY.quant.transcript_coverage.txt
├── RNA_PLATEKEY.quant.transcript_fragment_lengths.txt
├── RNA_PLATEKEY.SJ.out.tab
├── RNA_PLATEKEY.SJ.saturation.txt
├── RNA_PLATEKEY.stats.json
├── RNA_PLATEKEY.time_metrics.csv
├── RNA_PLATEKEY.trimmer_metrics.csv
├── RNA_PLATEKEY.unfiltered.SJ.out.tab
├── RNA_PLATEKEY.vcf.gz
├── RNA_PLATEKEY.vcf.gz.md5sum
├── RNA_PLATEKEY.vcf.gz.tbi
├── RNA_PLATEKEY.vc_hethom_ratio_metrics.csv
├── RNA_PLATEKEY.vc_metrics.csv
├── RNA_PLATEKEY.wgs_contig_mean_cov.csv
├── RNA_PLATEKEY.wgs_coverage_metrics.csv
├── RNA_PLATEKEY.wgs_fine_hist.csv
├── RNA_PLATEKEY.wgs_hist.csv
├── RNA_PLATEKEY.wgs_overall_mean_cov.csv
└── md5sum.txt
You can find indications on the content of these files at the Illumina Dragen pipeline documentation pages:
- Dragen 4.2 RNA pipeline Outputs page
- Dragen 4.2 RNA pipeline Gene fusion page
- Dragen 4.2 RNA pipeline Gene Expression quantification page
- Other pages in the same menu within the Dragen 4.2 DNA pipeline section including various metrics pages such as the QC pages
RNA-Seq QC output¶
We ran the sample-level RNA sequencing data through an internally developed pipeline, generating quality control metrics across the entire cohort. Evaluation included examining raw read quality, alignment quality, and whole genome DNA-RNA sample matching. Some sample-level output files from the below tools have been aggregated into simple tsv files which have been aggregated together (see below) to allow comparison across the full dataset.
The table below outlines the software employed by the pipeline for generating QC metrics and which output files fed into the aggregated files. FastQC, RNA-SeQC2, RSeQC were used to generate generic quality metrics, whereas Somalier was used to assess the relatedness between WGS and RNASeq data to ensure RNASeq samples matched the expected WGS data.
Tool | Version | Original output file types |
---|---|---|
FastQC | 0.12.1 | summary.txt |
RNA-SeQC 2 | 2.4.2 | *.bam.metrics.tsv |
RSeQC | 5.0.1 | *.geneBodyCoverage.txt |
Somalier | 0.2.18 | *.pairs.tsv *.samples.tsv |
Help and support¶
Please reach out via the Genomics England Service Desk for any issues related to the RNA-Seq datasets and tables, including "RNASeq" in the title/description of your inquiry.