Alignment for NGS: Tutorial
NGS Sequencing Informatics
Stuart Brown, Stratos Efstathiadis
NYU Center for Health Informatics & Bioinformatics
Alignment
-
Most NGS experiments involve sequence reads collected from known genomes: gene expression, variants, ChIP-seq, etc.
-
First step in data analysis is to align reads to the reference genome.
-
This is computationally demanding due to the huge volume of NGS data and the large size of the reference genome (human = 3.2 Gb)
Tutorial Data
-
We can use a small data set for this alignment tutorial that was originally created by the Galaxy team for their ChIP-seq tutorial.
-
It contains a small number of reads that map just to one part of Chromosome 19 on the Mouse genome.
https://main.g2.bx.psu.edu/u/james/p/exercise-chip-seq
-
The small size of this data set will make all of the compute operations go more quickly. A real data set will take MUCH LONGER at every step, and use more
storage space.
Unix
-
For this tutorial, we will assume that you have access to a computer with a Unix or Linux operating system installed plus the standard software compilers
and libraries needed to install and run software.
-
This can be done on any Macintosh computer, your own Unix server, or a “cloud” server such as the Amazon Elastic Computer Cloud (EC2)
-
See our Unix and HPC tutorials for more details on setting up your system and basic Unix commands
NGS Alignment
-
Pre-NGS alignment algorithms such as Smith-Waterman and BLAST are both too slow and too sensitive (able
to align sequences with many mismatches)
-
NGS reads should be near-perfect matches to the reference genome.
-
Need to process hundreds of millions of reads in a few hours.
- time is critical because alignment is used for QC of NGS experiments, lab may be waiting
before running next sample
There are many NGS aligners
-
Bfast
-
BioScope
-
Bowtie
-
BWA
-
CLC bio
-
CloudBurst
-
Eland/Eland2
-
GenomeMapper
-
GnuMap
-
Karma
-
MAQ
-
MOM
-
Mosaik
-
MrFAST/MrsFAST
-
NovoAlign
-
PASS
-
PerM
-
RazerS
-
RMAP
-
SSAHA2
-
Segemehl
-
SeqMap
-
SHRiMP
-
Slider/SliderII
-
SOAP/SOAP2
-
Srprism
-
Stampy
-
vmatch
-
ZOOM
Alignment Software
-
Illumina provides the ELAND alignment tool
-
fast, max of 2 mismatches in 32 base ‘seed’
- uses a hash index of the NGS reads (query data), which uses more RAM (and runs slower) as machines produce more data
-
constantly improving its handling of indels
-
BWA
and Bowtie are popular open source tools based on the Burrows Wheeler Transformation
- fast
-
only works well on sequences with <3% difference from the reference genome
-
better gapped alignments (indels) than ELAND
-
provides a “mapping quality” score for the alignment of each read
Burrows Wheeler Transformation
-
Invented by David Wheeler in 1983 (Bell Labs).
A Block Sorting Lossless Data Compression Algorithm
. Systems Research Center Technical Report No 124. Palo Alto, CA: Digital Equipment Corporation, Burrows M, Wheeler DJ. 1994
-
Originally developed for compressing large files (bzip2)
-
Lossless, fully reversible
-
Align reads to a BW transformed suffix array index of the reference genome, which enables fast exact matching
-
Short-read algorithm: alters the read sequence to match the reference exactly.
-
Results in great gains in speed, efficiency, and reduces RAM requirement vs. other types of indexes
BWA Features
-
Compresses the genome as a BWT index
-
Identical sequences (genome repeats) only occur once in the index, reads that match the genome in multiple locations get a mapping quality of 0
-
Fast and moderate memory footprint (<4GB)
-
Gapped alignment for both SE and PE reads
-
Effective pairing to achieve high alignment accuracy; suboptimal hits considered in pairing.
-
Non-unique reads are placed randomly with a mapping quality of 0; all hits can be output in a concise format.
-
The default configuration works for most typical input.
-
Automatically adjust parameters based on read lengths and error rates.
Install BWA software
Prepare Reference Genome Index
-
Obtain a file that contains the reference genome of your organism.
-
Download from UCSC Genome Browser
http://hgdownload.soe.ucsc.edu/downloads.html
-
[Note: to align the ch19 tutorial data, you will need to download the mouse mm9 genome, or just mouse chromosome 19]
-
Unzip files for individual Ref Geneome chromosomes and join them together:
-
tar zvfx chromFa.tar.gz
-
cat *.fa > ref.fa
-
Index the reference genome :
-
bwa index -a bwtsw ref.fa
Align your FASTQ file
-
Your data should be in FASTQ format
-
Align to the index:
-
bwa aln ref.fa data.fastq.gz > data.sai
-
Export the alignment in SAM format
-
bwa samse ref.fa data.sai data.fastq.gz > data.aln.sam
-
For paired end data, align two FASTQ files to index, then use sampe to export one SAM file
- bwa sampe ref.fa read1.sai read2.fq.gz > pe_aln.sam
Alignment Output Format
-
SAM format is a human readable text file
-
Shows each read, quality string for bases, alignment position on the genome, alignment quality score, and any mismatches or indels of the read vs reference
(using CIGAR code).
-
All current NGS aligners now produce output in the BAM format, a compressed binary file, indexed by genome position.
SAM format: aligned to reference genome
SAMTools/Picard
http://samtools.sourceforge.net/
-
SAMTools is a simple toolkit to transform SAM to BAM, merge, sort, index
-
Can also calculate statistics (mean quality, depth of coverage, etc.)
-
filter duplicate reads
-
create multiple alignments of all reads over a genomic interval
-
call variants
SAMTools commands
Visualization
-
BAM file format also simplifies visualization of NGS data.
-
Index allows viewer to quickly find and load only read data for a specific genomic interval.
-
Integrative Genomics Viewer (IGV) from Broad Institute is our current favorite.
-
Use the Java Webstart to download and run IGV (use the 1.2 GB version): http://www.broadinstitute.org/igv/
-
Note: if you are using a remote server (your own or Cloud), then you will need to download the BAM and .bai files from the server to your desktop/laptop in
order to visualize.
-
IGV is easy to learn to use.
-
It has a nice Users Guide:
-
http://www.broadinstitute.org/software/igv/UserGuide
-
•Very briefly, load the BAM file and choose the genome (mouse mm9 for the tutorial data)
Stuart Brown’s Book

: