Packages, vignettes, work flows
biocLite.R script to install Bioconductor packages. To install core packages, type the following in an R command window:
Install specific packages, e.g., “GenomicFeatures” and “AnnotationDbi”, with
Bioconductor packages, especially those in the development branch, are updated fairly regularly. To identify packages requiring update within your version of Bioconductor, start a new session of R and enter
source("http://bioconductor.org/biocLite.R") biocLite() ## R version 3.0 or later
Some versions of R support more than one version of Bioconductor. To use the latest version of Bioconductor for your version of R, enter
source("http://bioconductor.org/biocLite.R") biocLite("BiocUpgrade") ## R version 2.15 or later
Read the help page for
?BiocUpgrade for additional details. Remember that more recent versions of Bioconductor may be available if your version of R is out-of-date.
Bioconductor provides software to help analyze diverse high-throughput genomic data. Common workflows include:
RNA-seq is a high resolution Next-Generation Sequencing (NGS) method to assess the transcriptome of an organism and compare transcriptional changes between organisms / treatments to ascertain specific pathways / genes that are moving in response. With this powerful approach, one can:
- Measure gene expression,
- Discover and annotate complete transcripts,
- Characterize alternative splicing and polyadenylation.
The types of information that can be gained from RNA-seq can be divided into two broad categories: qualitative and quantitative.
Qualitative data includes identifying expressed transcripts, and identifying exon/intron boundaries, transcriptional start sites (TSS), and poly-A sites. Here, we can refer to this type of information as “annotation”.
Quantitative data includes measuring differences in expression, alternative splicing, alternative TSS, and alternative polyadenylation between two or more treatments or groups. Here we focus specifically on experiments to measure differential gene expression (DGE).
These two types of data are related and to some extent, are inseparable. For instance, high quality annotation can lead to greater accuracy in mapping which can lead to the higher quality measures of gene expression. Similarly, accurate annotation requires some degree of quantification.
Today, we will walk through a gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from some sample FASTQ files, see how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads within each gene for each sample. We will then perform quality assessment and explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.
A typical RNA-seq experiment consists of the following steps:
FASTQ Quality Scores:
RPKM Concept (FPKM is paired-end version of it)
- RPKM (FPKM): reads (fragments) per kb per million mapped reads
- For a given gene, larger counts imply more information; RPKM etc., treat all estimates as equally informative
- RPKM/FPKM are not suitable for statistical testing. Why?
- Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case, we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios.
- Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!
Programs like edgeR and DESeq2 want to make use of the count nature of RNA-Seq data to increase statistical power. See here a simplified toy example:
Scenario 1: a 30000-bp transcript has 1000 counts in sample A and 700 counts in sample B.
Scenario 2: a 300-bp transcript has 10 counts in sample A and 7 counts in sample B.
Assume that the sequencing depths are the same in both samples and both scenarios. Then the RPKM is the same in sample A in both scenarios, and in sample B in both scenarios. In scenario A, we can be more confident that there is a true difference in the expression level than in scenario B (although we would want replicates of course).
The following should make it more clear:
Every stack of the bar plot represents the number of counts of one gene. In this example you have 1.5 more depth in condition B compared to condition A. Only one gene (red) is differentially expressed. With the RPKM normalization, all the other genes will look downregulated in the second sample (which is wrong). The size-factor normalization can instead produce values which are stable for all the genes (implemented in DESeq2).
estimateSizeFactorsForMatrixfunction, scaling factors can be calculated for each run. After dividing gene counts by each scaling factor, DESeq values are calculated as the total of rescaled gene counts of all runs.
calcNormFactorsfunction in the package, and then rescaled gene counts are obtained by dividing gene counts by each scaling factor for each run. TMM is the sum of rescaled gene counts of all runs.
See comparison of various normalization techniques here.
The mini sample FASTQ files used in this overview vignette and the associated workflow reporting vignettes can be loaded via the
systemPipeRdata package as shown below. The chosen data set
SRP010938 contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al., 2013). To minimize processing time during testing, each FASTQ file has been subsetted to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the A. thaliana genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files (provided in the same download) have been truncated accordingly. This way the entire test sample data set requires less than 200MB disk storage space.
In this RNA-Seq demo, we are using a Bioconductor package,
systemPipeR, a R utility for building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq (Girke T, 2015) [Link].
Start the analysis by opening in your R session the
RNAseq.R script which contains the code shown in this vignette in pure text format. The following generates a fully populated
systemPipeR workflow environment (here for RNA-Seq) in the current working directory of an R session:
library(systemPipeRdata) genWorkenvir(workflow="rnaseq") setwd("rnaseq")
The working environment of the sample data as loaded above contains the following pre-configured directory structure:
The following parameter files are included in each workflow template:
targets.txt: initial one provided by user; downstream
targets_*.txtfiles are generated automatically
*.param: defines parameter for input/output file operations, e.g.
*_run.sh: optional bash script, e.g.:
.BatchJobs: defines type of scheduler for
*.tmpl: specifies parameters of scheduler used by a system, e.g. Torque, SGE, StarCluster, Slurm, etc.
The sample workflows provided by this vignette are based on the above directory structure, where directory names are indicated in grey. Users can change this structure as needed, but need to adjust the code in their workflows accordingly.
The FASTQ files are organized in the provided targets.txt file. This is the only file in this analysis workflow that needs to be generated manually, e.g. in a spreadsheet program. To import targets.txt, we run the following commands from R. The
targets file defines all input files (e.g. FASTQ, BAM, BCF) and sample comparisons of an analysis workflow. The following shows the format of a sample
targets file provided by this package. In a target file with a single type of input files, here FASTQ files of single end (SE) reads, the first three columns are mandatory including their column names, while it is four mandatory columns for FASTQ files for PE reads. All subsequent columns are optional and any number of additional columns can be added as needed.
targetsfile for single end (SE) samples
library(systemPipeR) targetspath <- system.file("extdata", "targets.txt", package="systemPipeR") read.delim(targetspath, comment.char = "#")
## FileName SampleName Factor SampleLong Experiment Date ## 1 ./data/SRR446027_1.fastq M1A M1 Mock.1h.A 1 23-Mar-2012 ## 2 ./data/SRR446028_1.fastq M1B M1 Mock.1h.B 1 23-Mar-2012 ## 3 ./data/SRR446029_1.fastq A1A A1 Avr.1h.A 1 23-Mar-2012 ## 4 ./data/SRR446030_1.fastq A1B A1 Avr.1h.B 1 23-Mar-2012 ## 5 ./data/SRR446031_1.fastq V1A V1 Vir.1h.A 1 23-Mar-2012 ## 6 ./data/SRR446032_1.fastq V1B V1 Vir.1h.B 1 23-Mar-2012 ## 7 ./data/SRR446033_1.fastq M6A M6 Mock.6h.A 1 23-Mar-2012 ## 8 ./data/SRR446034_1.fastq M6B M6 Mock.6h.B 1 23-Mar-2012 ## 9 ./data/SRR446035_1.fastq A6A A6 Avr.6h.A 1 23-Mar-2012 ## 10 ./data/SRR446036_1.fastq A6B A6 Avr.6h.B 1 23-Mar-2012 ## 11 ./data/SRR446037_1.fastq V6A V6 Vir.6h.A 1 23-Mar-2012 ## 12 ./data/SRR446038_1.fastq V6B V6 Vir.6h.B 1 23-Mar-2012 ## 13 ./data/SRR446039_1.fastq M12A M12 Mock.12h.A 1 23-Mar-2012 ## 14 ./data/SRR446040_1.fastq M12B M12 Mock.12h.B 1 23-Mar-2012 ## 15 ./data/SRR446041_1.fastq A12A A12 Avr.12h.A 1 23-Mar-2012 ## 16 ./data/SRR446042_1.fastq A12B A12 Avr.12h.B 1 23-Mar-2012 ## 17 ./data/SRR446043_1.fastq V12A V12 Vir.12h.A 1 23-Mar-2012 ## 18 ./data/SRR446044_1.fastq V12B V12 Vir.12h.B 1 23-Mar-2012
To work with the custom data, users need to generate a
targets file containing the paths to their own FASTQ files and then provide under
targetspath the path to the corresponding
targetsfile for paired end (PE) samples
targetspath <- system.file("extdata", "targetsPE.txt", package="systemPipeR") read.delim(targetspath, comment.char = "#")[1:2,1:6]
## FileName1 FileName2 SampleName Factor SampleLong Experiment ## 1 ./data/SRR446027_1.fastq ./data/SRR446027_2.fastq M1A M1 Mock.1h.A 1 ## 2 ./data/SRR446028_1.fastq ./data/SRR446028_2.fastq M1B M1 Mock.1h.B 1
Sample comparisons are defined in the header lines of the
targets file starting with ‘
##  "# Project ID: Arabidopsis - Pseudomonas alternative splicing study (SRA: SRP010938; PMID: 24098335)" ##  "# The following line(s) allow to specify the contrasts needed for comparative analyses, such as DEG identification. All possible comparisons can be specified with 'CMPset: ALL'." ##  "# <CMP> CMPset1: M1-A1, M1-V1, A1-V1, M6-A6, M6-V6, A6-V6, M12-A12, M12-V12, A12-V12" ##  "# <CMP> CMPset2: ALL"
readComp imports the comparison and stores them in a
readComp can obtain the comparison information from the corresponding
SYSargs object (see below). Note, the header lines are optional. They are mainly useful for controlling comparative analysis according to certain biological expectations, such as simple pairwise comparisons in RNA-Seq experiments.
readComp(file=targetspath, format="vector", delim="-")
## $CMPset1 ##  "M1-A1" "M1-V1" "A1-V1" "M6-A6" "M6-V6" "A6-V6" "M12-A12" "M12-V12" "A12-V12" ## ## $CMPset2 ##  "M1-A1" "M1-V1" "M1-M6" "M1-A6" "M1-V6" "M1-M12" "M1-A12" "M1-V12" "A1-V1" ##  "A1-M6" "A1-A6" "A1-V6" "A1-M12" "A1-A12" "A1-V12" "V1-M6" "V1-A6" "V1-V6" ##  "V1-M12" "V1-A12" "V1-V12" "M6-A6" "M6-V6" "M6-M12" "M6-A12" "M6-V12" "A6-V6" ##  "A6-M12" "A6-A12" "A6-V12" "V6-M12" "V6-A12" "V6-V12" "M12-A12" "M12-V12" "A12-V12"