Next Generation DNA Sequencing Resources
Cloud Computing for RNA-seq
Alignment for NGS
Subscribe to RSS
Stuart M. Brown, Ph.D.
Center for Health Informatics & Bioinformatics
NYU School of Medicine
The dataset for this tutorial is available
• Combine sequencing with Chromatin-Immunoprecipitaion
• Select (and identify) fragments of DNA that interact with specific proteins such as:
– Transcription factors
– Modified histones
– RNA Polymerase (survey actively transcribe portions of the genome)
– DNA polymerase (investigate DNA replication)
– DNA repair enzmes
• [Pre-sequencing technology]
• Do chromatin IP with YFA
(Your Favorite Antibody)
• Take IP-purified DNA fragments, label & hybridize to a microarray containing (putative) promoter (or TF binding) sequences from lots of genes
• Estimate binding, relate to DNA binding of protein targeted by antibody
– limited to well annotated genomes
– need to build special microarrays
– suffers from hybridization bias
– assumes all TF binding sites are known and correctly located on genome
– Place millions of short read sequence ‘tags’ (25-50 bp) on the genome
– Finds perfect, 1, and 2 mismatch alignments; (Illumina ELAND, BWA, Bowtie)
– Aligns ~80% of PF tags to human/mouse genome
– Can parse output files to get only unique alignments (removes 2%-5% of ‘multi-map’ reads)
ChIP-seq for TF
Jothi, et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008), 36: 5221-31
• How many sequence reads are needed to find all of the binding targets in the genome?
• Look for plateau
• We want to find the peaks (enriched regions = protein binding sites on genome)
• Goals include: accuracy (location of peak on genome), sensitivity, & reproducibility
• Challenges: non-random background, PCR artifacts, difficult to estimate false negatives
• Very difficult to compare samples to find changes in TF binding (many borderline peaks)
• Find enriched regions on the genome (high tag density) = “peaks”
– Enriched vs. what?
• A statistical approach assumes an evenly distributed or randomly distributed background
– Poisson distribution of background is obviously not true
– Any threshold is essentially arbitrary
Different IP Targets
• Huge difference between Transcription Factors and Histone modification as targets of IP
Compare to Background
• Goal is to make ‘fold change’ measurements
• What is the appropriate background?
– Input DNA (no IP)
– IP with non-specific antibody (IgG)
• Must first identify “peak region” in sample, then compare tag counts vs BG
• Must collect and sequence an input DNA sample, then align input data file to ref. genome along with ChIP sample.
• How to compare lanes with different numbers of reads?
• Will bias fold-change calculations
• Simple method – set all counts in ‘peak’ regions as per million reads
– This does not work well for >2x differences in read counts.
• In some ChIP-seq samples, PCR amplification of IP-enriched DNA creates artifacts (highly duplicated fragments)
• Huge differences depending on target of antibody, complexity of sample after IP.
% of Duplicates varies
• Unix command line software, most popular method for ChIP-seq peakcalling.
• Includes comparison to background (input), normalization, removal of duplicates, and statistical measure of False Discovery (FDR%).
Zhang et al. Model-based Analysis of ChIP-Seq (MACS).
Genome Biol (2008) vol. 9 (9) pp. R137
• Peaks near promoters of known genes (TSS)
• Generally a high %
• As parameters become less stringent, more peaks are found, % near TSS declines
• Estimate false positive rate
• Pure statistical (Poisson or Monte Carlo)
• Compare 2 bg sampels (QuEST)
• Reverse sample & bg (MACS)
• Can’t estimate false negative rate – don’t know ‘true’ number of binding sites
• Location of ChIP-seq peaks with respect to annotated genes
• TF proteins often bind near promoters (TSS)
• Gene annotation data are generally available in GFF (Generic Feature File) format. Can be downloaded from UCSC or any GMOD compliant genome database
• Bedtools – a command line toolkit to find overlaps between genomic features (ie. peaks and genes) <http://code.google.com/p/bedtools>
• Galaxy – a web toolkit for a wide variety of gnomics data analysis. Includes overlap of genomic features <https://main.g2.bx.psu.edu>
• Many TF proteins bind a specific pattern of DNA bases – a motif
• By extracting the sequences at Chip-seq peaks, you can look for known binding sites, or attempt to compute new ones.
Software Workflow (tutorial)
• QC of sequence files (
data files to Ref. genome (
• Check % alignment, count & remove duplicates, convert to indexed BAM format (
• Peak calling (
• Visualize peaks (
• Annotate (
Login using Secure Shell (ssh)
: your NYULMC Kerberos id
• -bash-3.2$ bwa aln –I -f chipseq.tutorial.fastq.sai /data/tutorial/ChIPseq/Hg19DB/version0.6.0/genome.fa chipseq.tutorial.fastq
• -bash-3.2$ bwa samse -f chipseq.tutorial.fastq.sam /data/tutorial/ ChIPseq/Hg19DB/version0.6.0/genome.fa chipseq.tutorial.fastq.sai
• -bash-3.2$ samtools view -b -S -q 25 -t /data/tutorial/ChIPseq/Hg19DB/version0.6.0/genome.fai -o chipseq.tutorial.fastq.bam chipseq.tutorial.fastq.sam
• -bash-3.2$ samtools sort chipseq.tutorial.fastq.bam chipseq.tutorial.fastq.sort
• -bash-3.2$ samtools rmdup -s chipseq.tutorial.fastq.sort.bam chipseq.tutorial.fastq.sort.rmdup.bam
• -bash-3.2$ samtools index chipseq.tutorial.fastq.sort.rmdup.bam chipseq.tutorial.fastq.sort.rmdup.bam.bai
• REPEAT The Same Steps using
• -bash-3.2$ macs14 –t chipseq.tutorial.fastq.sort.rmdup.bam –c control.tutorial.fastq.sort.rmdup.bam -f BAM -g hs -n chipseq-tutorial-macs141
Stuart Brown’s Book
Getting Genetics Done
Copyright: (c) 2013