Contents

1 Overview

1.1 ChIP-Seq Technology

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. Mapping the chromosomal locations of transcription factors, nucleosomes, histone modifications, chromatin remodeling enzymes, chaperones, and polymerases is one of the key tasks of modern biology, as evidenced by the Encyclopedia of DNA Elements (ENCODE) Project. By combining chromatin immunoprecipitation (ChIP) assays with sequencing, ChIP-sequencing (ChIP-Seq) is a powerful method for identifying genome-wide DNA binding sites for transcription factors and other proteins. Following ChIP protocols, DNA-bound protein is immunoprecipitated using a specific antibody. The bound DNA is then co-precipitated, purified, and sequenced. It can be used to map global binding sites precisely for any protein of interest (see animation).

First described in 2007, the ChIP-Seq technology allows in vivo determination of where a protein binds the genome, which can be transcription factors, DNA-binding enzymes, histones, chaperones, or nucleosomes. ChIP-seq first cross-links bound proteins to chromatin, fragments the chromatin, captures the DNA fragments bound to one protein using an antibody specific to it, and sequences the ends of the captured fragments using Next-Generation Sequencing (NGS) [Bailey et al., 2013]. Major steps:

+ Cross-linking: add Formaldehyde
+ Hydrolysis (breakdown of cell memberane)
+ Lyses of DNA fragments (shearing), e.g. treat with specific endonuclease enzymes (dpn1)
+ Add specific antibody (with beads) against the protein of interest. This will drag the protein down by immuno-precipitation
+ Separate DNA from the protein (unlinking), and purify the DNA
+ DNA fragments are amplified and fluorescently tagged, library construction, and fragments are then sequenced

Goal: A common aim in ChIP-seq experiments is to identify changes in protein binding patterns between conditions, i.e. Differential Binding (DB). ChIP-seq is now the most widely used procedure for genome-wide assays of protein-DNA interactions (Furey TS, 2012), and its use in mapping histone modifications has been seminal in epigenetics research (Ku et al., 2011).

Figure 1: An overview of ChIP-Seq technology.


Probably the most discussed issue in ChIP-seq experiments is the best method to find true “peaks” in the data. A peak is a site where multiple reads have mapped and produced a pileup (see above). ChIP sequencing is most often performed with single-end reads, and ChIP fragments are sequenced from their 5’ ends only. This creates two distinct peaks; one on each strand with the binding site falling in the middle of these peaks, the distance from the middle of the peaks to the binding site is often referred to as the “shift”.

Back to Table of Contents

1.2 ChIP-Seq Workflow

A typical ChIP-Seq workflow includes the following steps:

  1. Reads preprocessing
    • Quality filtering (trimming)
    • FASTQ quality report
  2. Alignments: Bowtie2 or rsubread
  3. Alignment stats
  4. Peak calling: MACS2, BayesPeak
  5. Peak annotation with genomic context
  6. Differential binding analysis
  7. GO term enrichment analysis
  8. Motif analysis
Figure 2: A typical ChIP-Seq data analysis workflow.


Back to Table of Contents

1.3 Important considerations

1.3.1 Sequencing Depth

Effective analysis of ChIP-seq data requires sufficient coverage by sequence reads (sequencing depth). It mainly depends on the size of the genome, and the number and size of the binding sites of the protein.

  • For mammalian transcription factors (TFs) and chromatin modifications such as enhancer-associated histone marks: 20 million reads are adequate (4 million reads for worm and fly TFs)
  • Proteins with more binding sites (e.g., RNA Pol II) or broader factors: need more reads, up to 60 million for mammalian ChIP-seq
  • Sequencing depth rules of thumb: >10M reads for narrow peaks, >20M for broad peaks
  • Long & paired-end reads useful but not essential
  • Replicates: The good news here is that, unlike RNA-Seq, more than 2 replicates does not significantly increase the number of targets
  • Control samples should be sequenced significantly deeper than the ChIP ones in a TF experiment and in experiments involving diffused broad-domain chromatin data
  • This ensures sufficient coverage of a substantial portion of the genome and non-repetitive autosomal DNA regions.

1.3.2 Quality Metrics

Preprocessing of ChIP-seq data, in general, is similar to that of any other sequencing data; assess the quality of raw reads to identify possible sequencing errors or biases.

  • FastQC can be used for an overview of the data quality
  • Usually the Phred quality scores are used to describe the confidence of each base call in each sequence tag, and can be used to filter low-quality reads
  • It may also be necessary to trim the end of reads that are of low quality
  • In addition, library complexity is a common quality measure for ChIP-seq libraries (e.g. using the preseq package)
  • Library complexity is linked to many factors such as antibody quality, over-cross-linking, amount of material, sonication, or over-amplification by PCR

1.3.3 Read Mapping

  • It is important to consider the percentage of uniquely mapped reads reported by the mapper
  • The percentage varies between organisms; for human, mouse, or Arabidopsis, above 70% uniquely mapped reads is normal
  • Less than 50% mapping may be cause for concern
  • A low percentage of uniquely mapped reads is often due to, (i) either excessive amplification in the PCR step, (ii) inadequate read length, or (iii) problems with the sequencing platform
  • In some cases, it may be unavoidable (e.g. if the protein binds frequently in repetitive DNA)
  • Read mappers are designed to allow a (user-settable) number of mismatches in the reads
  • Most peak calling algorithms would ignore the multi-mapping reads (filtered out).

1.4 Bioconductor Resources for ChIP-Seq

1.4.1 General Purpose Resources

  • GenomicRanges (Link): High-level infrastructure for range data
  • Rsamtools (Link): BAM support
  • DiffBind (Link): Differential binding analysis of ChIP-Seq peak data
  • rtracklayer (Link): Annotation imports, interface to online genome browsers
  • DESeq (Link): RNA-Seq analysis
  • edgeR (Link): RNA-Seq analysis
  • chipseq (Link): Utilities for ChIP-Seq analysis
  • ChIPpeakAnno (Link): Annotating peaks with genome context information
  • MotifDb (Link): Collection of motif databases
  • motifStack (Link): Stacked logo plots
  • PWMEnrich (Link): Position Weight Matrix (PWM) enrichment analysis
  • Bioc Workflow (Link): Overview of resources

1.4.2 Peak Calling in R

  • BayesPeak (Link): Hidden Markov Models (HMM) and Bayesian statistics
  • PICS (Link): Probabilistic inference
  • MOSAiCS (Link): Model-based analysis of ChIP-Seq data
  • iSeq (Link): Bayesian hidden Ising Models
  • ChIPseqR (Link):
  • CSAR (Link): Tests based on Poisson distribution
  • ChIP-Seq (Link):
  • SPP (Link):
  • NarrowPeaks (Link): Shape-based analysis of variation; applies a functional version of Principal Component Analysis (FPCA)

1.5 Peak Callers (command-line tools)

Key reference: Pepke et al. (2009) Computation for ChIP-seq and RNA-seq studies. Nature Methods 6(11s): S22-S32.

Back to Table of Contents

2 ChIP-Seq analysis

2.1 Load workflow environment

2.1.1 Load packages and Sample data

In this ChIP-Seq demo, we are using a Bioconductor package, systemPipeR, a R utility for building end-to-end analysis pipelines with automated report generation for next generation sequence (NGS) applications such as RNA-Seq, ChIP-Seq, VAR-Seq and Ribo-Seq (Girke T, 2015) [Link]. So at first, the systemPipeR package needs to be loaded to perform the analysis steps shown in this tutorial:

library(systemPipeR)

Load workflow environment with sample data into your current working directory.

library(systemPipeRdata)
genWorkenvir(workflow="chipseq")
setwd("chipseq")

In the workflow environments generated by genWorkenvir, all data inputs are stored in a data/ directory and all analysis results will be written to a separate results/ directory, while the *.Rmd script and the targets file are expected to be located in the parent directory. The R session is expected to run from this parent directory. Additional parameter files are stored under param/.

To work with real data, users would want to organize their own data similarly and substitute all test data for their own data. To rerun an established workflow on new data, the initial targets file along with the corresponding FASTQ files are usually the only inputs the user needs to provide.

2.1.2 Experiment definition provided by targets file

The targets file defines all FASTQ files and sample comparisons of the analysis workflow.

targetspath <- system.file("extdata", "targets_chip.txt", package="systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")
targets[1:4,-c(5,6)]
##                   FileName SampleName Factor SampleLong SampleReference
## 1 ./data/SRR446027_1.fastq        M1A     M1  Mock.1h.A                
## 2 ./data/SRR446028_1.fastq        M1B     M1  Mock.1h.B                
## 3 ./data/SRR446029_1.fastq        A1A     A1   Avr.1h.A             M1A
## 4 ./data/SRR446030_1.fastq        A1B     A1   Avr.1h.B             M1B
Back to Table of Contents

2.2 Read pre-processing

2.2.1 Read quality filtering and trimming

The following example shows how one can design a custom read preprocessing function using utilities provided by the ShortRead package, and then apply it with preprocessReads in batch mode to all FASTQ samples referenced in the corresponding SYSargs instance (args object below). More detailed information on read preprocessing is provided in systemPipeR’s main vignette [Link].

args <- systemArgs(sysma="param/trim.param", mytargets="targets_chip.txt")
filterFct <- function(fq, cutoff=20, Nexceptions=0) {
     qcount <- rowSums(as(quality(fq), "matrix") <= cutoff)
     fq[qcount <= Nexceptions] # Retains reads where Phred scores are >= cutoff with N exceptions
 }
preprocessReads(args=args, Fct="filterFct(fq, cutoff=20, Nexceptions=0)", batchsize=100000)
writeTargetsout(x=args, file="targets_chip_trim.txt", overwrite=TRUE)

2.2.2 FASTQ quality report

The following seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution (Link). The results are written to a PDF file named fastqReport.pdf.

args <- systemArgs(sysma="param/bowtieSE.param", mytargets="targets_chip_trim.txt")
fqlist <- seeFastq(fastq=infile1(args), batchsize=100000, klength=8)
pdf("./results/fastqReport.pdf", height=18, width=4*length(fqlist))
seeFastqPlot(fqlist)
dev.off()
Figure 3: QC report for 18 FASTQ files.


Back to Table of Contents

2.3 Alignments

2.3.1 Read mapping with Bowtie2

The NGS reads are aligned with Bowtie2 against the reference genome sequence (Langmead and Salzberg, 2012). The parameter settings of the aligner are defined in the bowtieSE.param file. In ChIP-Seq experiments, it is usually more appropriate to eliminate reads mapping to multiple locations. To achieve this, users can remove the argument setting ‘-k 50 –non-deterministic’ in the bowtieSE.param file. For details, please see Bowtie2 manual.

args <- systemArgs(sysma="param/bowtieSE.param", mytargets="targets_chip_trim.txt")
sysargs(args)[1] # Command-line parameters for first FASTQ file
moduleload(modules(args)) # Skip if a module system is not used
system("bowtie2-build ./data/tair10.fasta ./data/tair10.fasta") # Indexes reference genome
runCommandline(args)
writeTargetsout(x=args, file="targets_bam.txt", overwrite=TRUE)

Check whether all BAM files have been created:

file.exists(outpaths(args))

2.3.2 Read and Alignment stats

The following provides an overview of the number of reads in each sample and how many of them are aligned to the reference genome

read_statsDF <- alignStats(args=args)
write.table(read_statsDF, "results/alignStats.xls", row.names=FALSE, quote=FALSE, sep="\t")
read.delim("results/alignStats.xls")

2.4 Peak calling with MACS2

The read count data generated by ChIP-seq is massive. So how to predict the DNA-binding sites from this read count data? For this, various Peak Calling methods have been developed. The overall goal here is to predict the regions of the genome where the ChIPed protein is bound by finding regions with significant numbers of mapped reads (peaks). The most popular method is MACS2 which empirically models the shift size of ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites.

Few things to consider: