## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
Here we will walk through a gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from some sample FASTQ files, see how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads within each gene for each sample. We will then perform quality assessment and explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.
FASTQ Quality Scores:
RPKM Concept (FPKM is paired-end version of it)
- RPKM (FPKM): reads (fragments) per kp per million mapped reads
- For a given gene, larger counts imply more information; RPKM etc., treat all estimates as equally informative
- RPKM/FPKM are not suitable for statistical testing. Why?
- Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case, we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios.
- Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!
The following should make it more clear:
Every stack of the bar plot represents the number of counts of one gene. In this example you have 1.5 more depth in condition B compared to condition A. Only one gene (red) is differentially expressed. With the RPKM normalization all the other genes will look downregulated in the second sample (which is wrong). The size-factor normalization can instead produce values which are stable for all the genes (implemented in DESeq2).
The mini sample FASTQ files used in this overview vignette are available from
here. Preferentially, they can be loaded as shown below. The chosen data set
SRP010938 contains 18 slimmed down read sets from Arabidposis thaliana (Howard et al., 2013). To minimize processing time during testing, each FASTQ file has been subsetted to 90,000-100,000 randomly sampled reads that map to the first 100,000 nucleotides of each chromosome of the A. thaliana genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files (provided in the same download) have been truncated accordingly. This way the entire test sample data set requires less than 200MB disk storage space.
Start the analysis by opening in your R session the
Rrnaseq.R script which contains the code shown in this vignette in pure text format. Set the current working directory as below:
The working directory of the sample data contains the following pre-configured directory structure:
The sample workflows provided by this vignette are based on the above directory structure, where directory names are indicated in grey. Users can change this structure as needed, but need to adjust the code in their workflows accordingly.
In this RNA-Seq demo, we are using a Bioconductor package,
systemPipeR, a R utility for building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq (Girke T, 2015) [Link].
The FASTQ files are organized in the provided targets.txt file. This is the only file in this analysis workflow that needs to be generated manually, e.g. in a spreadsheet program. To import targets.txt, we run the following commands from R. The
targets file defines all input files (e.g. FASTQ, BAM, BCF) and sample comparisons of an analysis workflow. The following shows the format of a sample
targets file provided by this package. In a target file with a single type of input files, here FASTQ files of single end (SE) reads, the first three columns are mandatory including their column names, while it is four mandatory columns for FASTQ files for PE reads. All subsequent columns are optional and any number of additional columns can be added as needed.
targetsfile for single end (SE) samples
library(systemPipeR) args <- systemArgs(sysma="./param/tophat.param", mytargets="targets.txt") targetsin(args)
## FileName SampleName Factor SampleLong Experiment Date ## 1 ./data/SRR446027_1.fastq M1A M1 Mock.1h.A 1 23-Mar-2012 ## 2 ./data/SRR446028_1.fastq M1B M1 Mock.1h.B 1 23-Mar-2012 ## 3 ./data/SRR446029_1.fastq A1A A1 Avr.1h.A 1 23-Mar-2012 ## 4 ./data/SRR446030_1.fastq A1B A1 Avr.1h.B 1 23-Mar-2012 ## 5 ./data/SRR446031_1.fastq V1A V1 Vir.1h.A 1 23-Mar-2012 ## 6 ./data/SRR446032_1.fastq V1B V1 Vir.1h.B 1 23-Mar-2012 ## 7 ./data/SRR446033_1.fastq M6A M6 Mock.6h.A 1 23-Mar-2012 ## 8 ./data/SRR446034_1.fastq M6B M6 Mock.6h.B 1 23-Mar-2012 ## 9 ./data/SRR446035_1.fastq A6A A6 Avr.6h.A 1 23-Mar-2012 ## 10 ./data/SRR446036_1.fastq A6B A6 Avr.6h.B 1 23-Mar-2012 ## 11 ./data/SRR446037_1.fastq V6A V6 Vir.6h.A 1 23-Mar-2012 ## 12 ./data/SRR446038_1.fastq V6B V6 Vir.6h.B 1 23-Mar-2012 ## 13 ./data/SRR446039_1.fastq M12A M12 Mock.12h.A 1 23-Mar-2012 ## 14 ./data/SRR446040_1.fastq M12B M12 Mock.12h.B 1 23-Mar-2012 ## 15 ./data/SRR446041_1.fastq A12A A12 Avr.12h.A 1 23-Mar-2012 ## 16 ./data/SRR446042_1.fastq A12B A12 Avr.12h.B 1 23-Mar-2012 ## 17 ./data/SRR446043_1.fastq V12A V12 Vir.12h.A 1 23-Mar-2012 ## 18 ./data/SRR446044_1.fastq V12B V12 Vir.12h.B 1 23-Mar-2012
targetsfile for paired end (PE) samples
## FileName1 FileName2 SampleName Factor SampleLong Experiment ## 1 ./data/SRR446027_1.fastq ./data/SRR446027_2.fastq M1A M1 Mock.1h.A 1 ## 2 ./data/SRR446028_1.fastq ./data/SRR446028_2.fastq M1B M1 Mock.1h.B 1
Sample comparisons are defined in the header lines of the
targets file starting with ‘
# Project ID: Arabidopsis - Pseudomonas alternative splicing study (SRA: SRP010938; PMID: 24098335) # The following line(s) allow to specify the contrasts needed for comparative analyses, such as DEG identification. All possible comparisons can be specified with 'CMPset: ALL'. # <CMP> CMPset1: M1-A1, M1-V1, A1-V1, M6-A6, M6-V6, A6-V6, M12-A12, M12-V12, A12-V12 # <CMP> CMPset2: ALL
readComp imports the comparison and stores them in a
readComp can obtain the comparison information from the corresponding
SYSargs object (see below). Note, the header lines are optional. They are mainly useful for controlling comparative analysis according to certain biological expectations, such as simple pairwise comparisons in RNA-Seq experiments.
readComp(file=args, format="vector", delim="-")
## $CMPset1 ##  "M1-A1" "M1-V1" "A1-V1" "M6-A6" "M6-V6" "A6-V6" "M12-A12" "M12-V12" "A12-V12" ## ## $CMPset2 ##  "M1-A1" "M1-V1" "M1-M6" "M1-A6" "M1-V6" "M1-M12" "M1-A12" "M1-V12" "A1-V1" ##  "A1-M6" "A1-A6" "A1-V6" "A1-M12" "A1-A12" "A1-V12" "V1-M6" "V1-A6" "V1-V6" ##  "V1-M12" "V1-A12" "V1-V12" "M6-A6" "M6-V6" "M6-M12" "M6-A12" "M6-V12" "A6-V6" ##  "A6-M12" "A6-A12" "A6-V12" "V6-M12" "V6-A12" "V6-V12" "M12-A12" "M12-V12" "A12-V12"
param file defines the parameters of the command-line software. The following shows the format of a sample
param file provided by this package.
parampath <- system.file("extdata", "tophat.param", package="systemPipeR") read.delim(parampath, comment.char = "#")
## PairSet Name Value ## 1 modules <NA> bowtie2/2.1.0 ## 2 modules <NA> tophat/2.0.8b ## 3 software <NA> tophat ## 4 cores -p 4 ## 5 other <NA> -g 1 --segment-length 25 -i 30 -I 3000 ## 6 outfile1 -o <FileName1> ## 7 outfile1 path ./results/ ## 8 outfile1 remove <NA> ## 9 outfile1 append .tophat ## 10 outfile1 outextension .tophat/accepted_hits.bam ## 11 reference <NA> ./data/tair10.fasta ## 12 infile1 <NA> <FileName1> ## 13 infile1 path <NA> ## 14 infile2 <NA> <FileName2> ## 15 infile2 path <NA>
systemArgs function imports the definitions of both the
param file and the
targets file, and stores all relevant information as
SYSargs object. To run the pipeline without command-line software, one can assign
sysma instead of a
param file. In addition, one can start the
systemPipeR workflow with pre-generated BAM files by providing a targets file where the
FileName column gives the paths to the BAM files and
sysma is assigned
args <- suppressWarnings(systemArgs(sysma="./param/tophat.param", mytargets="targets.txt")) args
## An instance of 'SYSargs' for running 'tophat' on 18 samples
The following shows how to create read quality reports with the
seeFastq function from systemPipeR. The
seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution.
fqlist <- seeFastq(fastq=infile1(args), batchsize=10000, klength=8) pdf("./results/fastqReport.pdf", height=18, width=4*length(fqlist)) seeFastqPlot(fqlist) dev.off()