1 Integrated Development Environment (IDE) for R

  1. RStudio: a new Alternative Working Environment for R that works well for beginners and developers (Homepage)
Back to Table of Contents

2 Bioconductor

2.1 What is Bioconductor

Packages, vignettes, work flows

Helpful Links

Back to Table of Contents

2.2 Install Bioconductor Packages

Use the biocLite.R script to install Bioconductor packages. To install core packages, type the following in an R command window:


Install specific packages, e.g., “GenomicFeatures” and “AnnotationDbi”, with

biocLite(c("GenomicFeatures", "AnnotationDbi"))

2.3 Update Installed Bioconductor Packages

Bioconductor packages, especially those in the development branch, are updated fairly regularly. To identify packages requiring update within your version of Bioconductor, start a new session of R and enter

biocLite()                  ## R version 3.0 or later

2.4 Upgrading installed Bioconductor packages

Some versions of R support more than one version of Bioconductor. To use the latest version of Bioconductor for your version of R, enter

biocLite("BiocUpgrade")     ## R version 2.15 or later

Read the help page for ?BiocUpgrade for additional details. Remember that more recent versions of Bioconductor may be available if your version of R is out-of-date.

Back to Table of Contents

2.5 Bioconductor Workflows

Bioconductor provides software to help analyze diverse high-throughput genomic data. Common workflows include:

2.5.1 Basic Workflows

  • Sequence Analysis: Import fasta, fastq, BAM, gff, bed, wig, and other sequence formats. Trim, transform, align, and manipulate sequences. Perform quality assessment, ChIP-seq, differential expression, RNA-seq, and other workflows. Access the Sequence Read Archive.
  • Oligonucleotide Arrays: Import Affymetrix, Illumina, Nimblegen, Agilent, and other platforms. Perform quality assessment, normalization, differential expression, clustering, classification, gene set enrichment, genetical genomics and other workflows for expression, exon, copy number, SNP, methylation and other assays. Access GEO, ArrayExpress, Biomart, UCSC, and other community resources.
  • Annotation Resources: Introduction to using gene, pathway, gene ontology, homology annotations and the AnnotationHub. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.
  • Annotating Genomic Ranges: Represent common sequence data types (e.g., from BAM, gff, bed, and wig files) as genomic ranges for simple and advanced range-based queries.
  • Annotating Genomic Variants: Read and write VCF files. Identify structural location of variants and compute amino acid coding changes for non-synonymous variants. Use SIFT and PolyPhen database packages to predict consequence of amino acid coding changes.
  • Changing genomic coordinate systems: rtracklayer, etc.

2.5.2 Advanced Workflows

  • RNA-Seq workflow: gene-level exploratory analysis and differential expression
  • High Throughput Assays: Import, transform, edit, analyze and visualize flow cytometric, mass spec, HTqPCR, cell-based, and other assays
  • Mass spectrometry and proteomics
  • Transcription Factor Binding: Finding Candidate Binding Sites for Known Transcription Factors via Sequence Matching.
  • Cloud-enabled cis-eQTL search and annotation
  • Differential Binding from ChIP-seq data.

3 Overview of RNA-Seq

RNA-seq is a high resolution Next-Generation Sequencing (NGS) method to assess the transcriptome of an organism and compare transcriptional changes between organisms / treatments to ascertain specific pathways / genes that are moving in response. With this powerful approach, one can:
- Measure gene expression,
- Discover and annotate complete transcripts,
- Characterize alternative splicing and polyadenylation.

The types of information that can be gained from RNA-seq can be divided into two broad categories: qualitative and quantitative.

  1. Qualitative data includes identifying expressed transcripts, and identifying exon/intron boundaries, transcriptional start sites (TSS), and poly-A sites. Here, we can refer to this type of information as “annotation”.

  2. Quantitative data includes measuring differences in expression, alternative splicing, alternative TSS, and alternative polyadenylation between two or more treatments or groups. Here we focus specifically on experiments to measure differential gene expression (DGE).

These two types of data are related and to some extent, are inseparable. For instance, high quality annotation can lead to greater accuracy in mapping which can lead to the higher quality measures of gene expression. Similarly, accurate annotation requires some degree of quantification.

Today, we will walk through a gene-level RNA-seq differential expression workflow using Bioconductor packages. We will start from some sample FASTQ files, see how these were aligned to the reference genome, and prepare a count matrix which tallies the number of RNA-seq reads within each gene for each sample. We will then perform quality assessment and explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.

3.1 RNA-Seq Technology

A typical RNA-seq experiment consists of the following steps:

Figure 1: Overview of RNA-seq technology.
Back to Table of Contents

3.2 NGS data types

3.2.1 Raw: Sequencing Reads (FASTQ)

  • A FASTQ file normally uses four lines per sequence.
  • Line 1 begins with a ‘@’ character and is followed by a sequence identifier.
  • Line 2 is the raw sequence letters.
  • Line 3 begins with a ‘+’ character, is optionally followed by the same sequence identifier.
  • Line 4 encodes the Phred quality values for the sequence in line 2, each value represents the error probability of a given base call.

FASTQ Quality Scores:

  • Quality score represents the error probability of a given basecall
  • In a FASTQ file, quality scores are often represented using the ASCII alphabet
  • For example, a Phred score of 40 can be represented as the ASCII char “I” (40+33= ASCII #73), and an Illumina score of 40 as “h” (40+64=ASCII #104)
  • The range of scores will depend on the technology and the base caller used, but will typically be up to 40.

3.2.2 Derived

  • Alignments against reference genome
    • SAM / BAM: Sequence Alignment Map (SAM) is a tab-delimited alignment format consisting of a header section (lines starting with @) and an alignment section with 11 columns (Link). BAM is the compressed, indexed and binary version of this format. The below sample alignment contains the following features: (1) bases in lower cases are clipped from the alignment; (2) read r001/1 and r001/2 constitute a read pair; (3) r003 is a chimeric read; (4) r004 represents a split alignment.

Back to Table of Contents
  • Annotations
    • GFF / GTF
    • GFF3
Back to Table of Contents

3.3 Analysis Workflow of RNA-Seq Gene Expression Data

  1. Alignment of RNA reads to reference
    • Reference can be genome or transcriptome.
  2. Count reads overlapping with annotation features of interest
    • Most common: counts for exonic gene regions, but many viable alternatives exist here: counts per exons, genes, introns, etc.
  3. Normalization
    • Main adjustment for sequencing depth and compositional bias.
  4. Identification of Differentially Expressed Genes (DEGs)
    • Identification of genes with significant expression differences.
    • Identification of expressed genes possible for strongly expressed ones.
  5. Specialty applications
    • Splice variant discovery (semi-quantitative), gene discovery, antisense expressions, etc.
  6. Cluster Analysis
    • Identification of genes with similar expression profiles across many samples.
  7. Enrichment Analysis of Functional Annotations
    • Gene ontology analysis of obtained gene sets from steps 5-6.

3.4 Important Considerations for NGS Alignments

Back to Table of Contents

3.5 Normalization

3.5.1 Careful with RPKM/FPKM Values

RPKM Concept (FPKM is paired-end version of it)
- RPKM (FPKM): reads (fragments) per kb per million mapped reads
- For a given gene, larger counts imply more information; RPKM etc., treat all estimates as equally informative
- RPKM/FPKM are not suitable for statistical testing. Why?
- Consider the following example: in two libraries, each with one million reads, gene X may have 10 reads for treatment A and 5 reads for treatment B, while it is 100x as many after sequencing 100 millions reads from each library. In the latter case, we can be much more confident that there is a true difference between the two treatments than in the first one. However, the RPKM values would be the same for both scenarios.
- Thus, RPKM/FPKM are useful for reporting expression values, but not for statistical testing!

Programs like edgeR and DESeq2 want to make use of the count nature of RNA-Seq data to increase statistical power. See here a simplified toy example:

Scenario 1: a 30000-bp transcript has 1000 counts in sample A and 700 counts in sample B.

Scenario 2: a 300-bp transcript has 10 counts in sample A and 7 counts in sample B.

Assume that the sequencing depths are the same in both samples and both scenarios. Then the RPKM is the same in sample A in both scenarios, and in sample B in both scenarios. In scenario A, we can be more confident that there is a true difference in the expression level than in scenario B (although we would want replicates of course).

The following should make it more clear:

Every stack of the bar plot represents the number of counts of one gene. In this example you have 1.5 more depth in condition B compared to condition A. Only one gene (red) is differentially expressed. With the RPKM normalization, all the other genes will look downregulated in the second sample (which is wrong). The size-factor normalization can instead produce values which are stable for all the genes (implemented in DESeq2).

Back to Table of Contents

3.5.2 RNA-Seq Data Normalization Methods

  • Total count (TC): Gene counts are divided by the total number of mapped reads (or library size) associated with their lane and multiplied by the mean total count across all the samples of the dataset.
  • Upper Quartile (UQ): Very similar in principle to TC, the total counts are replaced by the upper quartile of counts different from 0 in the computation of the normalization factors.
  • Median (Med): Also similar to TC, the total counts are replaced by the median counts different from 0 in the computation of the normalization factors. That is, the median is calculated as the median of gene counts of all runs.
  • DESeq: This normalization method is included in the DESeq Bioconductor package and is based on the hypothesis that most genes are not DE. The method is based on a negative binomial distribution model, with variance and mean linked by local regression, and presents an implementation that gives scale factors. Within the DESeq package, and with the estimateSizeFactorsForMatrix function, scaling factors can be calculated for each run. After dividing gene counts by each scaling factor, DESeq values are calculated as the total of rescaled gene counts of all runs.
  • Trimmed Mean of M-values (TMM): This normalization method is implemented in the edgeR Bioconductor package (Robinson et al., 2010). It is also based on the hypothesis that most genes are not DE. Scaling factors are calculated using the calcNormFactors function in the package, and then rescaled gene counts are obtained by dividing gene counts by each scaling factor for each run. TMM is the sum of rescaled gene counts of all runs.
  • Quantile (Q): First proposed in the context of microarray data, this normalization method consists in matching distributions of gene counts across lanes.
  • Reads Per Kilobase per Million mapped reads (RPKM): This approach was initially introduced to facilitate comparisons between genes within a sample and combines between- and within-sample normalization. This approach quantifies gene expression from RNA-Seq data by normalizing for the total transcript length and the number of sequencing reads.

See comparison of various normalization techniques here.

3.6 Software for RNA-Seq DEG Analysis

4 RNA-Seq Analysis

4.1 Data Sets and Experimental Variables

The mini sample FASTQ files used in this overview vignette and the associated workflow reporting vignettes can be loaded via the systemPipeRdata package as shown below. The chosen data set SRP010938 contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al., 2013). To minimize processing time during testing, each FASTQ file has been subsetted to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the A. thaliana genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files (provided in the same download) have been truncated accordingly. This way the entire test sample data set requires less than 200MB disk storage space.

In this RNA-Seq demo, we are using a Bioconductor package, systemPipeR, a R utility for building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq (Girke T, 2015) [Link].

Start the analysis by opening in your R session the RNAseq.R script which contains the code shown in this vignette in pure text format. The following generates a fully populated systemPipeR workflow environment (here for RNA-Seq) in the current working directory of an R session:


The working environment of the sample data as loaded above contains the following pre-configured directory structure:

The following parameter files are included in each workflow template:

  1. targets.txt: initial one provided by user; downstream targets_*.txt files are generated automatically
  2. *.param: defines parameter for input/output file operations, e.g. trim.param, bwa.param, vartools.param, …
  3. * optional bash script, e.g.:
  4. Compute cluster environment (skip on single machine):
    • .BatchJobs: defines type of scheduler for BatchJobs
    • *.tmpl: specifies parameters of scheduler used by a system, e.g. Torque, SGE, StarCluster, Slurm, etc.

The sample workflows provided by this vignette are based on the above directory structure, where directory names are indicated in grey. Users can change this structure as needed, but need to adjust the code in their workflows accordingly.

Back to Table of Contents

4.1.1 Structure of targets file

The FASTQ files are organized in the provided targets.txt file. This is the only file in this analysis workflow that needs to be generated manually, e.g. in a spreadsheet program. To import targets.txt, we run the following commands from R. The targets file defines all input files (e.g. FASTQ, BAM, BCF) and sample comparisons of an analysis workflow. The following shows the format of a sample targets file provided by this package. In a target file with a single type of input files, here FASTQ files of single end (SE) reads, the first three columns are mandatory including their column names, while it is four mandatory columns for FASTQ files for PE reads. All subsequent columns are optional and any number of additional columns can be added as needed. Structure of targets file for single end (SE) samples

targetspath <- system.file("extdata", "targets.txt", package="systemPipeR") 
read.delim(targetspath, comment.char = "#")
##                    FileName SampleName Factor SampleLong Experiment        Date
## 1  ./data/SRR446027_1.fastq        M1A     M1  Mock.1h.A          1 23-Mar-2012
## 2  ./data/SRR446028_1.fastq        M1B     M1  Mock.1h.B          1 23-Mar-2012
## 3  ./data/SRR446029_1.fastq        A1A     A1   Avr.1h.A          1 23-Mar-2012
## 4  ./data/SRR446030_1.fastq        A1B     A1   Avr.1h.B          1 23-Mar-2012
## 5  ./data/SRR446031_1.fastq        V1A     V1   Vir.1h.A          1 23-Mar-2012
## 6  ./data/SRR446032_1.fastq        V1B     V1   Vir.1h.B          1 23-Mar-2012
## 7  ./data/SRR446033_1.fastq        M6A     M6  Mock.6h.A          1 23-Mar-2012
## 8  ./data/SRR446034_1.fastq        M6B     M6  Mock.6h.B          1 23-Mar-2012
## 9  ./data/SRR446035_1.fastq        A6A     A6   Avr.6h.A          1 23-Mar-2012
## 10 ./data/SRR446036_1.fastq        A6B     A6   Avr.6h.B          1 23-Mar-2012
## 11 ./data/SRR446037_1.fastq        V6A     V6   Vir.6h.A          1 23-Mar-2012
## 12 ./data/SRR446038_1.fastq        V6B     V6   Vir.6h.B          1 23-Mar-2012
## 13 ./data/SRR446039_1.fastq       M12A    M12 Mock.12h.A          1 23-Mar-2012
## 14 ./data/SRR446040_1.fastq       M12B    M12 Mock.12h.B          1 23-Mar-2012
## 15 ./data/SRR446041_1.fastq       A12A    A12  Avr.12h.A          1 23-Mar-2012
## 16 ./data/SRR446042_1.fastq       A12B    A12  Avr.12h.B          1 23-Mar-2012
## 17 ./data/SRR446043_1.fastq       V12A    V12  Vir.12h.A          1 23-Mar-2012
## 18 ./data/SRR446044_1.fastq       V12B    V12  Vir.12h.B          1 23-Mar-2012

To work with the custom data, users need to generate a targets file containing the paths to their own FASTQ files and then provide under targetspath the path to the corresponding targets file.

Back to Table of Contents Structure of targets file for paired end (PE) samples

targetspath <- system.file("extdata", "targetsPE.txt", package="systemPipeR")
read.delim(targetspath, comment.char = "#")[1:2,1:6]
##                  FileName1                FileName2 SampleName Factor SampleLong Experiment
## 1 ./data/SRR446027_1.fastq ./data/SRR446027_2.fastq        M1A     M1  Mock.1h.A          1
## 2 ./data/SRR446028_1.fastq ./data/SRR446028_2.fastq        M1B     M1  Mock.1h.B          1 Sample comparisons

Sample comparisons are defined in the header lines of the targets file starting with ‘# <CMP>’.

## [1] "# Project ID: Arabidopsis - Pseudomonas alternative splicing study (SRA: SRP010938; PMID: 24098335)"                                                                              
## [2] "# The following line(s) allow to specify the contrasts needed for comparative analyses, such as DEG identification. All possible comparisons can be specified with 'CMPset: ALL'."
## [3] "# <CMP> CMPset1: M1-A1, M1-V1, A1-V1, M6-A6, M6-V6, A6-V6, M12-A12, M12-V12, A12-V12"                                                                                             
## [4] "# <CMP> CMPset2: ALL"

The function readComp imports the comparison and stores them in a list. Alternatively, readComp can obtain the comparison information from the corresponding SYSargs object (see below). Note, the header lines are optional. They are mainly useful for controlling comparative analysis according to certain biological expectations, such as simple pairwise comparisons in RNA-Seq experiments.

readComp(file=targetspath, format="vector", delim="-")
## $CMPset1
## [1] "M1-A1"   "M1-V1"   "A1-V1"   "M6-A6"   "M6-V6"   "A6-V6"   "M12-A12" "M12-V12" "A12-V12"
## $CMPset2
##  [1] "M1-A1"   "M1-V1"   "M1-M6"   "M1-A6"   "M1-V6"   "M1-M12"  "M1-A12"  "M1-V12"  "A1-V1"  
## [10] "A1-M6"   "A1-A6"   "A1-V6"   "A1-M12"  "A1-A12"  "A1-V12"  "V1-M6"   "V1-A6"   "V1-V6"  
## [19] "V1-M12"  "V1-A12"  "V1-V12"  "M6-A6"   "M6-V6"   "M6-M12"  "M6-A12"  "M6-V12"  "A6-V6"  
## [28] "A6-M12"  "A6-A12"  "A6-V12"  "V6-M12"  "V6-A12"  "V6-V12"  "M12-A12" "M12-V12" "A12-V12"
Back to Table of Contents

4.1.2 Structure of param file and SYSargs container