//
Alignment for NGS

Alignment for NGS: Tutorial

NGS Sequencing Informatics

Stuart Brown, Stratos Efstathiadis

NYU Center for Health Informatics & Bioinformatics

Alignment

  • Most NGS experiments involve sequence reads collected from known genomes: gene expression, variants, ChIP-seq, etc.

  • First step in data analysis is to align reads to the reference genome.

  • This is computationally demanding due to the huge volume of NGS data and the large size of the reference genome (human = 3.2 Gb)

Tutorial Data

  • We can use a small data set for this alignment tutorial that was originally created by the Galaxy team for their ChIP-seq tutorial.

  • It contains a small number of reads that map just to one part of Chromosome 19 on the Mouse genome.

    https://main.g2.bx.psu.edu/u/james/p/exercise-chip-seq

  • The small size of this data set will make all of the compute operations go more quickly. A real data set will take MUCH LONGER at every step, and use more
    storage space.

Unix

  • For this tutorial, we will assume that you have access to a computer with a Unix or Linux operating system installed plus the standard software compilers
    and libraries needed to install and run software.

  • This can be done on any Macintosh computer, your own Unix server, or a “cloud” server such as the Amazon Elastic Computer Cloud (EC2)

  • See our Unix and HPC tutorials for more details on setting up your system and basic Unix commands

NGS Alignment

  • Pre-NGS alignment algorithms such as Smith-Waterman and BLAST are both too slow and too sensitive (able
    to align sequences with many mismatches)

  • NGS reads should be near-perfect matches to the reference genome.

  • Need to process hundreds of millions of reads in a few hours.

    • time is critical because alignment is used for QC of NGS experiments, lab may be waiting
      before running next sample

There are many NGS aligners

  • Bfast

  • BioScope

  • Bowtie

  • BWA

  • CLC bio

  • CloudBurst

  • Eland/Eland2

  • GenomeMapper

  • GnuMap

  • Karma

  • MAQ

  • MOM

  • Mosaik

  • MrFAST/MrsFAST

  • NovoAlign

  • PASS

  • PerM

  • RazerS

  • RMAP

  • SSAHA2

  • Segemehl

  • SeqMap

  • SHRiMP

  • Slider/SliderII

  • SOAP/SOAP2

  • Srprism

  • Stampy

  • vmatch

  • ZOOM

Alignment Software

  • Illumina provides the ELAND alignment tool

    • fast, max of 2 mismatches in 32 base ‘seed’

    • uses a hash index of the NGS reads (query data), which uses more RAM (and runs slower) as machines produce more data
    • constantly improving its handling of indels
  • BWA
    and Bowtie are popular open source tools based on the Burrows Wheeler Transformation

    • fast
    • only works well on sequences with <3% difference from the reference genome

    • better gapped alignments (indels) than ELAND

    • provides a “mapping quality” score for the alignment of each read

Burrows Wheeler Transformation

  • Invented by David Wheeler in 1983 (Bell Labs).

    A Block Sorting Lossless Data Compression Algorithm
    . Systems Research Center Technical Report No 124. Palo Alto, CA: Digital Equipment Corporation, Burrows M, Wheeler DJ. 1994

  • Originally developed for compressing large files (bzip2)

  • Lossless, fully reversible

  • Align reads to a BW transformed suffix array index of the reference genome, which enables fast exact matching

  • Short-read algorithm: alters the read sequence to match the reference exactly.

  • Results in great gains in speed, efficiency, and reduces RAM requirement vs. other types of indexes

BWA Features

  • Compresses the genome as a BWT index

  • Identical sequences (genome repeats) only occur once in the index, reads that match the genome in multiple locations get a mapping quality of 0

  • Fast and moderate memory footprint (<4GB)

  • Gapped alignment for both SE and PE reads

  • Effective pairing to achieve high alignment accuracy; suboptimal hits considered in pairing.

  • Non-unique reads are placed randomly with a mapping quality of 0; all hits can be output in a concise format.

  • The default configuration works for most typical input.

  • Automatically adjust parameters based on read lengths and error rates.

Install BWA software

Prepare Reference Genome Index

  • Obtain a file that contains the reference genome of your organism.

  • Download from UCSC Genome Browser
    http://hgdownload.soe.ucsc.edu/downloads.html

  • [Note: to align the ch19 tutorial data, you will need to download the mouse mm9 genome, or just mouse chromosome 19]

  • Unzip files for individual Ref Geneome chromosomes and join them together:

    tar zvfx chromFa.tar.gz

    cat *.fa > ref.fa
  • Index the reference genome :

    bwa index -a bwtsw ref.fa

Align your FASTQ file

  • Your data should be in FASTQ format

  • Align to the index:

    bwa aln ref.fa data.fastq.gz > data.sai
  • Export the alignment in SAM format

    bwa samse ref.fa data.sai data.fastq.gz > data.aln.sam
  • For paired end data, align two FASTQ files to index, then use sampe to export one SAM file

    bwa sampe ref.fa read1.sai read2.fq.gz > pe_aln.sam

Alignment Output Format

  • SAM format is a human readable text file

  • Shows each read, quality string for bases, alignment position on the genome, alignment quality score, and any mismatches or indels of the read vs reference
    (using CIGAR code).

  • All current NGS aligners now produce output in the BAM format, a compressed binary file, indexed by genome position.

SAM format: aligned to reference genome

SAMTools/Picard
http://samtools.sourceforge.net/

  • SAMTools is a simple toolkit to transform SAM to BAM, merge, sort, index

  • Can also calculate statistics (mean quality, depth of coverage, etc.)

  • filter duplicate reads

  • create multiple alignments of all reads over a genomic interval

  • call variants

SAMTools commands

  • Download and install SAMTools

    http://sourceforge.net/projects/samtools/files/samtools/

  • Make the BAM file:

    samtools view –bt ref.fa –o data.bam data.sam
  • Sort the BAM file:

    samtools sort data.bam data.sorted.bam
  • Index the BAM file:

    samtools index data.sorted.bam data.st_index.bam

Visualization

  • BAM file format also simplifies visualization of NGS data.

  • Index allows viewer to quickly find and load only read data for a specific genomic interval.

  • Integrative Genomics Viewer (IGV) from Broad Institute is our current favorite.

  • Use the Java Webstart to download and run IGV (use the 1.2 GB version): http://www.broadinstitute.org/igv/

    Note: if you are using a remote server (your own or Cloud), then you will need to download the BAM and .bai files from the server to your desktop/laptop in
    order to visualize.
  • IGV is easy to learn to use.

  • It has a nice Users Guide:

    http://www.broadinstitute.org/software/igv/UserGuide
  • •Very briefly, load the BAM file and choose the genome (mouse mm9 for the tutorial data)