Quality Control
Stuart M. Brown
NYU Center for Health Informatics
& Bioinformatics

Long Term Data Management Plan

  • FACT: Next-Gen Seq data lives forever

    • Every file from every lane of every run will be reanalyzed in the future, in ways that you have not yet thought of
    • Data is aggregated across multiple runs and re-analyzed (differential expression, SNP calling, etc)
    • You will need to submit original ‘raw’ data files to NCBI archive
    • Mange it well centrally, or mange it poorly (and differently) in each of 20-50 different Investigator’s labs
    • Plan to double your data storage every year.
  • Two words: OFFSITE BACKUP (!)
    • “Hurricane Sandy”

QC and Problems

  • Bioinformatics generally must take the lead role in NGS quality control.
  • QC software is part of the data processing pipeline.
  • You don’t really know if an experiment has worked until the reads are aligned.
  • Most NGS problems are diagnosed through data analysis

QC Metrics

  • Pre-Alignment: Vendor software, FASTQC
    • # raw clusters
    • % filtered by vendor software
    • % error and per-base quality scores
    • % duplication (PCR artifacts)
  • After Alignment

    • % alignment to reference genome
    • average depth of genome or transcriptome coverage
    • over-represented sequences
    • for ChIP-seq:

      • clustering of reads into peaks
      • distance from peaks to promoters (TSS)
    • for RNA-seq

      • % rDNA
      • genomic DNA contamination (reads map to non-coding regions)
      • 3’ (or 5’) bias

Summary.html (GAIIx)

Summary Information For Experiment on Machine HWI-EAS305


  • Simon Andrews, Babraham Bioinformatics

  • Java app, runs on any computer
  • per-base sequence quality
  • average whole read quality scores
  • per-base GC content
  • sequence length (all same for Illumina)
  • sequence duplications
  • overrepresented sequences

Preliminary QC (pre-alignment)

Depth of Coverage

  • With the Illumina HiSeq producing >200 million reads per sample, what depth of coverage is needed for RNA-seq?

  • Can we multiplex several samples per lane and save $$ on sequencing?
  • For expression profiling (and detection of differentially expressed genes), probably yes, 2-4 samples per lane is practical

Depth of Coverage, cont’d

100 million reads, 81% of genes FPKM ≥ 0.05

Each additional 100 million reads detects ~3% more genes

Picard toolkit


Huge collection of tools to work on SAM/BAM files

  • find barcodes and demultiplex
  • GC bias
  • insert size
  • mark (and remove) duplicates
  • quality score distribution, mean quality by cycle
  • RNA-seq metrics (3’/5’ bias; %rRNA; percent of bases of aligned reads in coding, intron, UTR, intergenic regions)

Sample prep can create 3’ or 5’ bias

Picard tools: CollectRnaSeqMetrics

5’ bias
(strand oriented protocol)
no bias
(low coverage at ends
of transcript)
3’ bias
(poly-A selection)

High Background Error Rate

data from Zavadil lab

Good SNP

data from Zavadil lab