Transcriptomics data - 100kGP Extension¶

The extension consists of a further 2,294 samples from 2,286 probands:

Three of these probands overlap with the pilot project
For a further eight probands, two samples are provided in this cohort

Data available in the Genomics England Research Environment¶

DRAGEN output¶

The data delivered by Illumina from running the DRAGEN RNA Pipeline is available in /gel_data_resources with individual deliveries subdivided by delivery dates (see example delivery below). You can generate lists of file paths of interest through the transcriptome_file_paths_and_types LabKey table by filtering for specific file type and participants.

Primary folder: /gel_data_resources/RNASeq_data/Rare_Disease/

For each sample the following output files are available:

/gel_data_resources/RNASeq_data/Rare_Disease/DELIVERY_DATE/DELIVERY_ID/
└── RNA_PLATEKEY
    ├── fastqs
    │   ├── RNA_PLATEKEY_SNN_L001_R1_001.fastq.gz
    │   ├── RNA_PLATEKEY_SNN_L001_R2_001.fastq.gz
    │   ├── ..
    │   ├── ..
    │   ├── RNA_PLATEKEY_SNN_L00N_R1_001.fastq.gz
    │   └── RNA_PLATEKEY_SNN_L00N_R2_001.fastq.gz
    ├── RNA_PLATEKEY.bam
    ├── RNA_PLATEKEY.bam.bai
    ├── RNA_PLATEKEY.bam.md5sum
    ├── RNA_PLATEKEY.Chimeric.out.junction
    ├── RNA_PLATEKEY.fusion_candidates.features.csv
    ├── RNA_PLATEKEY.fusion_candidates.filter_info
    ├── RNA_PLATEKEY.fusion_candidates.final
    ├── RNA_PLATEKEY.fusion_candidates.preliminary
    ├── RNA_PLATEKEY.fusion_candidates.vcf.gz
    ├── RNA_PLATEKEY.fusion_candidates.vcf.gz.md5sum
    ├── RNA_PLATEKEY.fusion_candidates.vcf.gz.tbi
    ├── RNA_PLATEKEY.fusion_metrics.csv
    ├── RNA_PLATEKEY.hard-filtered.vcf.gz
    ├── RNA_PLATEKEY.hard-filtered.vcf.gz.md5sum
    ├── RNA_PLATEKEY.hard-filtered.vcf.gz.tbi
    ├── RNA_PLATEKEY.insert-stats.tab
    ├── RNA_PLATEKEY.mapping_metrics.csv
    ├── RNA_PLATEKEY.metrics.json
    ├── RNA_PLATEKEY.quant.eq_classes.txt
    ├── RNA_PLATEKEY.quant.genes.sf
    ├── RNA_PLATEKEY.quant_metrics.csv
    ├── RNA_PLATEKEY.quant.sf
    ├── RNA_PLATEKEY.quant.transcript_coverage.txt
    ├── RNA_PLATEKEY.quant.transcript_fragment_lengths.txt
    ├── RNA_PLATEKEY.SJ.out.tab
    ├── RNA_PLATEKEY.SJ.saturation.txt
    ├── RNA_PLATEKEY.stats.json
    ├── RNA_PLATEKEY.time_metrics.csv
    ├── RNA_PLATEKEY.trimmer_metrics.csv
    ├── RNA_PLATEKEY.unfiltered.SJ.out.tab
    ├── RNA_PLATEKEY.vcf.gz
    ├── RNA_PLATEKEY.vcf.gz.md5sum
    ├── RNA_PLATEKEY.vcf.gz.tbi
    ├── RNA_PLATEKEY.vc_hethom_ratio_metrics.csv
    ├── RNA_PLATEKEY.vc_metrics.csv
    ├── RNA_PLATEKEY.wgs_contig_mean_cov.csv
    ├── RNA_PLATEKEY.wgs_coverage_metrics.csv
    ├── RNA_PLATEKEY.wgs_fine_hist.csv
    ├── RNA_PLATEKEY.wgs_hist.csv
    ├── RNA_PLATEKEY.wgs_overall_mean_cov.csv
    └── md5sum.txt

You can find indications on the content of these files at the Illumina DRAGEN pipeline documentation pages:

DRAGEN 4.2 RNA pipeline Outputs page
DRAGEN 4.2 RNA pipeline Gene fusion page
DRAGEN 4.2 RNA pipeline Gene Expression quantification page
Other pages in the same menu within the DRAGEN 4.2 DNA pipeline section including various metrics pages such as the QC pages

RNA-Seq QC output¶

We ran the sample-level RNA sequencing data through an internally developed pipeline, generating quality control metrics across the entire cohort. Evaluation included examining raw read quality, alignment quality, and whole genome DNA-RNA sample matching. Some sample-level output files from the below tools have been aggregated into simple tsv files which have been aggregated together (see below) to allow comparison across the full dataset.

The table below outlines the software employed by the pipeline for generating QC metrics and which output files fed into the aggregated files. FastQC, RNA-SeQC2, RSeQC were used to generate generic quality metrics, whereas Somalier was used to assess the relatedness between WGS and RNASeq data to ensure RNASeq samples matched the expected WGS data.

Tool	Version	Original output file types
FastQC	0.12.1	`summary.txt`
RNA-SeQC 2	2.4.2	`*.bam.metrics.tsv`
RSeQC	5.0.1	`*.geneBodyCoverage.txt`
Somalier	0.2.18	`.pairs.tsv` `.samples.tsv`

Sample-level and aggregated QC files output by the pipeline can be accessed at:

/gel_data_resources/RNASeq_data/qc_results/main_programme_100kGP/rare_disease_transcriptomics_extension_dataset/

Help and support¶

Please reach out via the Genomics England Service Desk for any issues related to the RNA-Seq datasets and tables, including "RNASeq" in the title/description of your inquiry.