Stuart M. Brown, Ph.D.

Center for Health Informatics & Bioinformatics

NYU School of Medicine

The dataset for this tutorial is available here.


• Combine sequencing with Chromatin-Immunoprecipitaion

• Select (and identify) fragments of DNA that interact with specific proteins such as:

– Transcription factors

– Modified histones

– Methylation

– RNA Polymerase (survey actively transcribe portions of the genome)

– DNA polymerase (investigate DNA replication)

– DNA repair enzmes


• [Pre-sequencing technology]

• Do chromatin IP with YFA (Your Favorite Antibody)

• Take IP-purified DNA fragments, label & hybridize to a microarray containing (putative) promoter (or TF binding) sequences from lots of genes

• Estimate binding, relate to DNA binding of protein targeted by antibody

– limited to well annotated genomes

– need to build special microarrays

– suffers from hybridization bias

– assumes all TF binding sites are known and correctly located on genome


– Place millions of short read sequence ‘tags’ (25-50 bp) on the genome

– Finds perfect, 1, and 2 mismatch alignments; (Illumina ELAND, BWA, Bowtie)

– Aligns ~80% of PF tags to human/mouse genome

– Can parse output files to get only unique alignments (removes 2%-5% of ‘multi-map’ reads)

ChIP-seq for TF

Jothi, et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008), 36: 5221-31


• How many sequence reads are needed to find all of the binding targets in the genome?

• Look for plateau

ChIP-seq Challenges

• We want to find the peaks (enriched regions = protein binding sites on genome)

• Goals include: accuracy (location of peak on genome), sensitivity, & reproducibility

• Challenges: non-random background, PCR artifacts, difficult to estimate false negatives

• Very difficult to compare samples to find changes in TF binding (many borderline peaks)


• Find enriched regions on the genome (high tag density) = “peaks”

– Enriched vs. what?

• A statistical approach assumes an evenly distributed or randomly distributed background

– Poisson distribution of background is obviously not true

– Any threshold is essentially arbitrary

Different IP Targets

• Huge difference between Transcription Factors and Histone modification as targets of IP

Compare to Background

• Goal is to make ‘fold change’ measurements

• What is the appropriate background?

– Input DNA (no IP)

– IP with non-specific antibody (IgG)

• Must first identify “peak region” in sample, then compare tag counts vs BG

• Must collect and sequence an input DNA sample, then align input data file to ref. genome along with ChIP sample.


• How to compare lanes with different numbers of reads?

• Will bias fold-change calculations

• Simple method – set all counts in ‘peak’ regions as per million reads

– This does not work well for >2x differences in read counts.

Remove Duplicates

• In some ChIP-seq samples, PCR amplification of IP-enriched DNA creates artifacts (highly duplicated fragments)

• Huge differences depending on target of antibody, complexity of sample after IP.

% of Duplicates varies


• Unix command line software, most popular method for ChIP-seq peakcalling.

• Includes comparison to background (input), normalization, removal of duplicates, and statistical measure of False Discovery (FDR%).

Zhang et al. Model-based Analysis of ChIP-Seq (MACS).
Genome Biol (2008) vol. 9 (9) pp. R137


• Peaks near promoters of known genes (TSS)

• Generally a high %

• As parameters become less stringent, more peaks are found, % near TSS declines

• Estimate false positive rate

• Pure statistical (Poisson or Monte Carlo)

• Compare 2 bg sampels (QuEST)

• Reverse sample & bg (MACS)

• Can’t estimate false negative rate – don’t know ‘true’ number of binding sites


• Location of ChIP-seq peaks with respect to annotated genes

• TF proteins often bind near promoters (TSS)

• Gene annotation data are generally available in GFF (Generic Feature File) format. Can be downloaded from UCSC or any GMOD compliant genome database

Annotation tools

• Bedtools – a command line toolkit to find overlaps between genomic features (ie. peaks and genes) <http://code.google.com/p/bedtools>

• Galaxy – a web toolkit for a wide variety of gnomics data analysis. Includes overlap of genomic features <https://main.g2.bx.psu.edu>

Motif finding

• Many TF proteins bind a specific pattern of DNA bases – a motif

• By extracting the sequences at Chip-seq peaks, you can look for known binding sites, or attempt to compute new ones.

Software Workflow (tutorial)

• QC of sequence files (FASTQC)

• Align ChIP and Input data files to Ref. genome (BWA)

• Check % alignment, count & remove duplicates, convert to indexed BAM format (SAMtools)

• Peak calling (MACS)

• Visualize peaks (IGV)

• Annotate (Galaxy)

Login using Secure Shell (ssh)

: your NYULMC Kerberos id

: ec2-54-242-89-16.compute-1.amazonaws.com

• -bash-3.2$ bwa aln –I -f chipseq.tutorial.fastq.sai /data/tutorial/ChIPseq/Hg19DB/version0.6.0/genome.fa chipseq.tutorial.fastq

• -bash-3.2$ bwa samse -f chipseq.tutorial.fastq.sam /data/tutorial/ ChIPseq/Hg19DB/version0.6.0/genome.fa chipseq.tutorial.fastq.sai

• -bash-3.2$ samtools view -b -S -q 25 -t /data/tutorial/ChIPseq/Hg19DB/version0.6.0/genome.fai -o chipseq.tutorial.fastq.bam chipseq.tutorial.fastq.sam

• -bash-3.2$ samtools sort chipseq.tutorial.fastq.bam chipseq.tutorial.fastq.sort

• -bash-3.2$ samtools rmdup -s chipseq.tutorial.fastq.sort.bam chipseq.tutorial.fastq.sort.rmdup.bam

• -bash-3.2$ samtools index chipseq.tutorial.fastq.sort.rmdup.bam chipseq.tutorial.fastq.sort.rmdup.bam.bai

• REPEAT The Same Steps using control.tutorial.fastq

• -bash-3.2$ macs14 –t chipseq.tutorial.fastq.sort.rmdup.bam –c control.tutorial.fastq.sort.rmdup.bam -f BAM -g hs -n chipseq-tutorial-macs141