## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'

Overview

Here we will walk through a gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from some sample FASTQ files, see how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads within each gene for each sample. We will then perform quality assessment and explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.

RNA-Seq Technology

Figure 1: Overview of RNA-seq technology.
Back to Table of Contents

NGS data types

Raw: Sequencing Reads (FASTQ)

  • A FASTQ file normally uses four lines per sequence.
  • Line 1 begins with a ‘@’ character and is followed by a sequence identifier.
  • Line 2 is the raw sequence letters.
  • Line 3 begins with a ‘+’ character, is optionally followed by the same sequence identifier.
  • Line 4 encodes the Phred quality values for the sequence in line 2, each value represents the error probability of a given base call.

FASTQ Quality Scores:

  • Quality score represents the error probability of a given basecall
  • In a FASTQ file, quality scores are often represented using the ASCII alphabet
  • For example, a Phred score of 40 can be represented as the ASCII char “I” (40+33= ASCII #73), and an Illumina score of 40 as “h” (40+64=ASCII #104)
  • The range of scores will depend on the technology and the base caller used, but will typically be up to 40.

Derived

  • Alignments against reference genome
    • SAM / BAM: Sequence Alignment Map (SAM) is a tab-delimited alignment format consisting of a header section (lines starting with @) and an alignment section with 11 columns (Link). BAM is the compressed, indexed and binary version of this format. The below sample alignment contains the following features: (1) bases in lower cases are clipped from the alignment; (2) read r001/1 and r001/2 constitute a read pair; (3) r003 is a chimeric read; (4) r004 represents a split alignment.


Back to Table of Contents
  • Annotations
    • GFF / GTF
    • GFF3
Back to Table of Contents

Analysis Workfow of RNA-Seq Gene Expression Data

  1. Alignment of RNA reads to reference
    • Reference can be genome or transcriptome.
  2. Count reads overlapping with annotation features of interest
    • Most common: counts for exonic gene regions, but many viable alternatives exist here: counts per exons, genes, introns, etc.
  3. Normalization
    • Main adjustment for sequencing depth and compositional bias.
  4. Identification of Differentially Expressed Genes (DEGs)
    • Identification of genes with significant expression differences.
    • Identification of expressed genes possible for strongly expressed ones.
  5. Specialty applications
    • Splice variant discovery (semi-quantitative), gene discovery, antisense expressions, etc.
  6. Cluster Analysis
    • Identification of genes with similar expression profiles across many samples.
  7. Enrichment Analysis of Functional Annotations
    • Gene ontology analysis of obtained gene sets from steps 5-6.

Important Considerations for NGS Alignments

Back to Table of Contents

Normalization

Careful with RPKM/FPKM Values

RPKM Concept (FPKM is paired-end version of it)
- RPKM (FPKM): reads (fragments) per kp per million mapped reads
- For a given gene, larger counts imply more information; RPKM etc., treat all estimates as equally informative
- RPKM/FPKM are not suitable for statistical testing. Why?
- Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case, we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios.
- Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!

The following should make it more clear:

Every stack of the bar plot represents the number of counts of one gene. In this example you have 1.5 more depth in condition B compared to condition A. Only one gene (red) is differentially expressed. With the RPKM normalization all the other genes will look downregulated in the second sample (which is wrong). The size-factor normalization can instead produce values which are stable for all the genes (implemented in DESeq2).

Back to Table of Contents

RNA-Seq Data Normalization Methods

  • Total count (TC): Gene counts are divided by the total number of mapped reads (or library size) associated with their lane and multiplied by the mean total count across all the samples of the dataset.
  • Upper Quartile (UQ): Very similar in principle to TC, the total counts are replaced by the upper quartile of counts different from 0 in the computation of the normalization factors.
  • Median (Med): Also similar to TC, the total counts are replaced by the median counts different from 0 in the computation of the normalization factors.
  • DESeq: This normalization method is included in the DESeq Bioconductor package and is based on the hypothesis that most genes are not DE.
  • Trimmed Mean of M-values (TMM): This normalization method is implemented in the edgeR Bioconductor package (Robinson et al., 2010). It is also based on the hypothesis that most genes are not DE.
  • Quantile (Q): First proposed in the context of microarray data, this normalization method consists in matching distributions of gene counts across lanes.
  • Reads Per Kilobase per Million mapped reads (RPKM): This approach was initially introduced to facilitate comparisons between genes within a sample and combines between- and within-sample normalization.

Software for RNA-Seq DEG Analysis

RNA-Seq Analysis

Data Sets and Experimental Variables

The mini sample FASTQ files used in this overview vignette are available from here. Preferentially, they can be loaded as shown below. The chosen data set SRP010938 contains 18 slimmed down read sets from Arabidposis thaliana (Howard et al., 2013). To minimize processing time during testing, each FASTQ file has been subsetted to 90,000-100,000 randomly sampled reads that map to the first 100,000 nucleotides of each chromosome of the A. thaliana genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files (provided in the same download) have been truncated accordingly. This way the entire test sample data set requires less than 200MB disk storage space.

Start the analysis by opening in your R session the Rrnaseq.R script which contains the code shown in this vignette in pure text format. Set the current working directory as below:

setwd("Rrnaseq")

The working directory of the sample data contains the following pre-configured directory structure:

The sample workflows provided by this vignette are based on the above directory structure, where directory names are indicated in grey. Users can change this structure as needed, but need to adjust the code in their workflows accordingly.

In this RNA-Seq demo, we are using a Bioconductor package, systemPipeR, a R utility for building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq (Girke T, 2015) [Link].

Back to Table of Contents

Structure of targets file

The FASTQ files are organized in the provided targets.txt file. This is the only file in this analysis workflow that needs to be generated manually, e.g. in a spreadsheet program. To import targets.txt, we run the following commands from R. The targets file defines all input files (e.g. FASTQ, BAM, BCF) and sample comparisons of an analysis workflow. The following shows the format of a sample targets file provided by this package. In a target file with a single type of input files, here FASTQ files of single end (SE) reads, the first three columns are mandatory including their column names, while it is four mandatory columns for FASTQ files for PE reads. All subsequent columns are optional and any number of additional columns can be added as needed.

Structure of targets file for single end (SE) samples

library(systemPipeR)
args <- systemArgs(sysma="./param/tophat.param", mytargets="targets.txt")
targetsin(args)
##                    FileName SampleName Factor SampleLong Experiment        Date
## 1  ./data/SRR446027_1.fastq        M1A     M1  Mock.1h.A          1 23-Mar-2012
## 2  ./data/SRR446028_1.fastq        M1B     M1  Mock.1h.B          1 23-Mar-2012
## 3  ./data/SRR446029_1.fastq        A1A     A1   Avr.1h.A          1 23-Mar-2012
## 4  ./data/SRR446030_1.fastq        A1B     A1   Avr.1h.B          1 23-Mar-2012
## 5  ./data/SRR446031_1.fastq        V1A     V1   Vir.1h.A          1 23-Mar-2012
## 6  ./data/SRR446032_1.fastq        V1B     V1   Vir.1h.B          1 23-Mar-2012
## 7  ./data/SRR446033_1.fastq        M6A     M6  Mock.6h.A          1 23-Mar-2012
## 8  ./data/SRR446034_1.fastq        M6B     M6  Mock.6h.B          1 23-Mar-2012
## 9  ./data/SRR446035_1.fastq        A6A     A6   Avr.6h.A          1 23-Mar-2012
## 10 ./data/SRR446036_1.fastq        A6B     A6   Avr.6h.B          1 23-Mar-2012
## 11 ./data/SRR446037_1.fastq        V6A     V6   Vir.6h.A          1 23-Mar-2012
## 12 ./data/SRR446038_1.fastq        V6B     V6   Vir.6h.B          1 23-Mar-2012
## 13 ./data/SRR446039_1.fastq       M12A    M12 Mock.12h.A          1 23-Mar-2012
## 14 ./data/SRR446040_1.fastq       M12B    M12 Mock.12h.B          1 23-Mar-2012
## 15 ./data/SRR446041_1.fastq       A12A    A12  Avr.12h.A          1 23-Mar-2012
## 16 ./data/SRR446042_1.fastq       A12B    A12  Avr.12h.B          1 23-Mar-2012
## 17 ./data/SRR446043_1.fastq       V12A    V12  Vir.12h.A          1 23-Mar-2012
## 18 ./data/SRR446044_1.fastq       V12B    V12  Vir.12h.B          1 23-Mar-2012
Back to Table of Contents

Structure of targets file for paired end (PE) samples

##                  FileName1                FileName2 SampleName Factor SampleLong Experiment
## 1 ./data/SRR446027_1.fastq ./data/SRR446027_2.fastq        M1A     M1  Mock.1h.A          1
## 2 ./data/SRR446028_1.fastq ./data/SRR446028_2.fastq        M1B     M1  Mock.1h.B          1

Sample comparisons

Sample comparisons are defined in the header lines of the targets file starting with ‘# <CMP>’.

# Project ID: Arabidopsis - Pseudomonas alternative splicing study (SRA: SRP010938; PMID: 24098335)
# The following line(s) allow to specify the contrasts needed for comparative analyses, such as DEG identification. All possible comparisons can be specified with 'CMPset: ALL'.
# <CMP> CMPset1: M1-A1, M1-V1, A1-V1, M6-A6, M6-V6, A6-V6, M12-A12, M12-V12, A12-V12
# <CMP> CMPset2: ALL

The function readComp imports the comparison and stores them in a list. Alternatively, readComp can obtain the comparison information from the corresponding SYSargs object (see below). Note, the header lines are optional. They are mainly useful for controlling comparative analysis according to certain biological expectations, such as simple pairwise comparisons in RNA-Seq experiments.

readComp(file=args, format="vector", delim="-")
## $CMPset1
## [1] "M1-A1"   "M1-V1"   "A1-V1"   "M6-A6"   "M6-V6"   "A6-V6"   "M12-A12" "M12-V12" "A12-V12"
## 
## $CMPset2
##  [1] "M1-A1"   "M1-V1"   "M1-M6"   "M1-A6"   "M1-V6"   "M1-M12"  "M1-A12"  "M1-V12"  "A1-V1"  
## [10] "A1-M6"   "A1-A6"   "A1-V6"   "A1-M12"  "A1-A12"  "A1-V12"  "V1-M6"   "V1-A6"   "V1-V6"  
## [19] "V1-M12"  "V1-A12"  "V1-V12"  "M6-A6"   "M6-V6"   "M6-M12"  "M6-A12"  "M6-V12"  "A6-V6"  
## [28] "A6-M12"  "A6-A12"  "A6-V12"  "V6-M12"  "V6-A12"  "V6-V12"  "M12-A12" "M12-V12" "A12-V12"
Back to Table of Contents

Structure of param file and SYSargs container

The param file defines the parameters of the command-line software. The following shows the format of a sample param file provided by this package.

parampath <- system.file("extdata", "tophat.param", package="systemPipeR")
read.delim(parampath, comment.char = "#")
##      PairSet         Name                                  Value
## 1    modules         <NA>                          bowtie2/2.1.0
## 2    modules         <NA>                          tophat/2.0.8b
## 3   software         <NA>                                 tophat
## 4      cores           -p                                      4
## 5      other         <NA> -g 1 --segment-length 25 -i 30 -I 3000
## 6   outfile1           -o                            <FileName1>
## 7   outfile1         path                             ./results/
## 8   outfile1       remove                                   <NA>
## 9   outfile1       append                                .tophat
## 10  outfile1 outextension              .tophat/accepted_hits.bam
## 11 reference         <NA>                    ./data/tair10.fasta
## 12   infile1         <NA>                            <FileName1>
## 13   infile1         path                                   <NA>
## 14   infile2         <NA>                            <FileName2>
## 15   infile2         path                                   <NA>

The systemArgs function imports the definitions of both the param file and the targets file, and stores all relevant information as SYSargs object. To run the pipeline without command-line software, one can assign NULL to sysma instead of a param file. In addition, one can start the systemPipeR workflow with pre-generated BAM files by providing a targets file where the FileName column gives the paths to the BAM files and sysma is assigned NULL.

args <- suppressWarnings(systemArgs(sysma="./param/tophat.param", mytargets="targets.txt"))
args
## An instance of 'SYSargs' for running 'tophat' on 18 samples

FASTQ quality report

The following shows how to create read quality reports with the seeFastq function from systemPipeR. The seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution.

fqlist <- seeFastq(fastq=infile1(args), batchsize=10000, klength=8)
pdf("./results/fastqReport.pdf", height=18, width=4*length(fqlist))
seeFastqPlot(fqlist)
dev.off()