Tools for ChIP-seq data analysis

Quality metrics of sequence reads
FastQC A quality control tool for high throughput sequence data.

Mapper
BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.

Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.

SOAPaligner/soap2 is a member of the SOAP (Short Oligonucleotide Analysis Package). It is an updated version of SOAP software for short oligonucleotide alignment.

Maq stands for Mapping and Assembly with Quality It builds assembly by mapping short reads to reference sequences.

Quality metrics of read counts
CHANCE ChIP-seq Analytics and Confidence Estimation (CHANCE) is a software for assessing the quality of ChIP-seq experiments and providing feedback for the optimization of ChIP and library generation protocols.

phantompeakqualtools Computes quick but highly informative enrichment and quality measures and fragment lengths for ChIP-seq/DNase-seq/FAIRE-seq/MNase-seq data.

Peak caller
MACS2 provides quantitative measurements of ChIP-Seq enrichment across whole genome and is capable of identifying differential binding sites.

SPP A ChIP-seq peak calling algorithm, implemented as an R package, that accounts for the offset in forward-strand and reverse-strand reads to improve resolution, compares enrichment in signal to background or control experiments, and can also estimate whether the available number of reads is sufficient to achieve saturation, meaning that additional reads would not allow identification of additional peaks.

ZINBA (Zero Inflated Negative Binomial Algorithm) is a computational and statistical framework used to call regions of the genome enriched for sequencing reads originating from a diverse array of biological experiments. We collectively refer to the sequencing data derived from these experiments as DNA-seq, including FAIRE-seq, ChIP-seq, and DNAase-seq experiments.

BayesPeak is an implementation of the BayesPeak algorithm for peak-calling in ChIP-seq data.

SICER A clustering approach for identification of enriched domains from histone modification ChIP-Seq data.

Scripture is a method for transcriptome reconstruction that relies solely on RNA-Seq reads and an assembled genome to build a transcriptome ab initio. The statistical methods to estimate read coverage significance are also applicable to other sequencing data. Scripture also has modules for ChIP-Seq peak calling.

linear normalization
Sequencing depth normalization is to make the total reads in different samples the same. Many existing methods focuse on normalization against control samples.

RPKM (Reads per Kilobase of sequence range per Million mapped reads) proposed in adjusts for biases due to the higher probability of reads falling into longer regions.

non linear normalization
Lowess Normalization

Quantile Normalization

Assessment of Reproducibility
IDR The IDR (Irreproducible Discovery Rate) framework is a uniﬁed approach to measure the reproducibility of ﬁndings identiﬁed from replicate experiments and provide highly stable thresholds based on reproducibility.

Differential Binding Analysis
MACS2 provides quantitative measurements of ChIP-Seq enrichment across whole genome and is capable of identifying differential binding sites.

DBChIP is a Bioconductor package, which detects differentially bound sharp binding sites across multiple conditions, with or without matching control samples.

MAnorm for quantitative comparison of ChIP-Seq data sets describing transcription factor binding sites and epigenetic modifications.

DIME is a R package that considers an ensemble of finite mixture models combined with a local false discovery rate (fdr) for analyzing ChIP-seq data comparing two samples. This package can also be used to identify differential in other high throughput data such as microarray and DNA methylation.

ChIPDiff provides a solution for the identification of Differential Histone Modification Sites (DHMSs) by comparing two ChIP-seq libraries (L1 and L2). An HMM is employed in ChIPDiff to infer the states of histone modification changes.

Peak Annotation
CEAS Cis-regulatory Element Annotation System), it provides statistics on ChIP enrichment at important genome features such as specific chromosome, promoters, gene bodies, or exons, and infers genes most likely to be regulated by a binding factor. CEAS also enables biologists to visualize the average ChIP enrichment signals over specific genomic features, allowing continuous and broad ChIP enrichment to be perceived which might be too subtle to detect from ChIP peaks alone.

ChIPpeakAnno is a Bioconductor package. The package includes functions to retrieve the sequences around the peak, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. Starting 2.0.5, new functions have been added for finding the peaks with bi-directional promoters with summary statistics (peaksNearBDP), for summarizing the occurrence of motifs in peaks (summarizePatternInPeaks) and for adding other IDs to annotated peaks or enrichedGO (addGeneIDs).

Motif Analysis
MEME-ChIP performs comprehensive motif analysis (including motif discovery) on LARGE (50MB maximum) sets of nucleotide sequences such as those identified by ChIP-seq or CLIP-seq experiments (sample output from sequences).

Mark duplicates
Picard MarkDuplicates Examines aligned records in the supplied SAM or BAM file to locate duplicate molecules.