Everything you need to know about reference-based RNA-seq data analysis
RNA sequencing has become the go-to technique for studying differential gene expression, variations in the cellular transcriptome, and the entire population of varying ribonucleic acid population in a cell type. Apart from these, RNA sequencing data analysis also provides a close look at the alternative splicing of genes, gene fusion, post-transcriptional modifications, IN/DEL mutations, SNPs, and differences in gene expression under varying conditions.
What is reference-based RNA sequencing?
Broadly speaking, RNA sequencing can be of two types –
- Reference genome-based: when the genomic data of an organism or a particular population of cells already exist, it enables the alignment of the raw RNA sequencing data against the reference genome. It makes the mapping process faster and accurate.
- Reference genome-free: genome-free methods demand de novo approaches for the mapping and assembly of transcripts from a sample organism whose genome library does not exist. It is not only challenging but highly parameter dependent.
Reference-based RNA sequencing refers to using a known set of RNA seq data (reference genome/transcriptome) from a particular organism or cell population to identify the sets of genes or exons influenced by the activity of a known transcription regulator.
Reference-based RNA sequencing in recent transcriptome studies
In a recent study by Brooks et al., the research team aimed to study the regulatory effects of the Pasilla (PS) gene (in Drosophila melanogaster) on the exons. The experimental procedure involved the depletion of the PS gene in the target organism via RNA interference (RNAi) technique.
The isolation of the total RNA population led to the preparation of single-end and paired-end RNA sequence libraries for the PS depleted and the control (untreated) samples. The sequencing of this library offers RNA sequence reads for both the samples.
The sequence data from each sample was then compared to determine the effects of the PS gene on the regulation of exons. In other words, Brooks et al. successfully identified the impact of the depletion of the PS gene on alternative splicing events.
You can still find the original data of this experiment complete with the entire information on its RNA seq data analysis in the NCBI Gene Expression Omnibus.
What makes reference RNA seq data analysis stand out?
The typical workflow for a reference-based RNA sequencing experiment is as follows –
- The purification of the ribonucleic acid from cell extract
- Generation of complementary DNA or cDNA via reverse transcription using reverse transcriptase (RT) enzyme
- Using DNA polymerase for the synthesis of the complementary strand of cDNA.
- The preparation of the library for the sequencing process.
Up until now, the steps are akin to any other RNA sequencing experiment.
During the reference genome-based sequencing of RNA, the two most important factors of the experiment design will affect the quality of the results –
- The priming of cDNA strand during synthesis – the team can rely on the oligo-dT method to restrict the synthesis of cDNA to complete mRNA strands. Or, they can use random oligonucleotides to prime the reverse transcriptase at several internal sites. This choice will influence the outcome of the sequencing reaction.
- The non-stranded vs. the stranded RNA libraries – most experiments depend on Illumina pair-end method. Therefore, the team is likely to deal with unstranded RNA sequencing data or stranded RNA sequencing data created by the TrueSeq Kits.
To avoid confusion, always rely on a comprehensive sequencing and analysis platform that can handle all kinds of stranded and unstranded libraries generated by the initial reverse transcription step.
What are you doing for QC?
Another important consideration for all teams is the number of replicates. Any RNA seq experiment requires a sufficient number of replicates for reproducibility and reliability of the results. You may be dealing with any one of the two replicate types –
- Technical replicates
- Biological replicates
Irrespective of the type, the number of replicates should be as high as possible. Target between three and twelve, the higher number represents better results. Reliable RNA seq data analysis platforms can generate QC plots and tables for a visual understanding of the quality and reliability of generated data.
What makes mapping on a reference eukaryotic genome tricky?
In the case of eukaryotic genomes, the reads come from mRNAs. Therefore they contain no exon sequence. Instead of mapping them directly to a genome, you need to separate them into –
- Reads map directly to exons
- Reads that do not map entirely to one exon, but two or more
You can bypass the technicalities of manual mapping and quality control during the mapping step by relying on a complete cloud-based software as a service (SaaS) cum platform as a service (PaaS) solution. Aligning the reads take less than an hour for most RNA seq data against reference genomes.
Transcriptome assembly and determining the differential expression from RNA seq data
Apart from alignment, the complete RNA seq data analysis platform is capable of taking care of the following steps for you –
- Transcript reconstruction – the algorithm assigns the reads to specific locations on the genome and determines the exon splice junctions. The information is used to build transcript models from the RNA seq data generated in the previous steps.
- Transcriptome assembly – instead of using multiple desktop-based tools for the assessing spliced read alignments, a robust sequencing data analysis platform can simultaneously quantify the abundance of the reads corresponding to each transcript.
- Transcriptome quantification – it is another step that takes only a few minutes on a one-stop RNA sequencing data analysis platform. It can estimate the expression level of each transcript and assign the reads accurately.
- Estimating the level of each transcript – forget tool-hopping. The one-stop RNA seq analysis platform can also quantify the level of each transcript for each population. That gives the team the first clue about the transcription level in the experimental and control population.
The goal of any differential expression analysis (DE) is to find the transcript or gene expression differences between two populations that have been exposed to different conditions, treatments, or developmental stages. Therefore, there can be two goals for your DE experiment –
- To estimate the extent of differences between the expression levels
- To estimate the impact or significance of the expression levels
Using a platform streamlines these two steps as well. The in-built algorithm and tools enable the direct generation of detailed read counts and expression count table. The expression count will tell you how many reads are aligning to the target exon or exons of a gene. The final results include chart and table formats for the differential expression count that makes understanding and sharing the reports straightforward.
Additionally, you can access the volcano plot and heatmap of the reads. The volcano plot will directly tell you the upregulation and downregulation of gene expression in the control and experimental conditions. Most importantly, you can get these maps in ready-to-publish, high-res format for either sharing or downstream analysis.