• As parameters become less stringent, more peaks are found, % near TSS declines
• Estimate false positive rate
• Pure statistical (Poisson or Monte Carlo)
• Compare 2 bg sampels (QuEST)
• Reverse sample & bg (MACS)
• Can’t estimate false negative rate – don’t know ‘true’ number of binding sites
Annotation
• Location of ChIP-seq peaks with respect to annotated genes
• TF proteins often bind near promoters (TSS)
• Gene annotation data are generally available in GFF (Generic Feature File) format. Can be downloaded from UCSC or any GMOD compliant genome database
<http://gmod.org/wiki/GFF>
Annotation tools
• Bedtools – a command line toolkit to find overlaps between genomic features (ie. peaks and genes) <http://code.google.com/p/bedtools>
• Galaxy – a web toolkit for a wide variety of gnomics data analysis. Includes overlap of genomic features <https://main.g2.bx.psu.edu>
Motif finding
• Many TF proteins bind a specific pattern of DNA bases – a motif
• By extracting the sequences at Chip-seq peaks, you can look for known binding sites, or attempt to compute new ones.
Software Workflow (tutorial)
• QC of sequence files (FASTQC)
• Align ChIP and Input data files to Ref. genome (BWA)
• Check % alignment, count & remove duplicates, convert to indexed BAM format (SAMtools)