Align paired end reads to the human reference genome hg19 using the Burrow Wheeler Aligner (BWA)

Introduction

Reference mapping is the process applied to NGS reads when the reference genome is available. Mapping (aligning) reads to the reference is required in order to later pileup all alignment results and search for variants at each conflicting position. In the mapping step, each read is aligned to the reference genome and the genome coordinate of the best hit(s) are stored together with the read sequence and quality parameters in a SAM/BAM file. This is the most time consuming step of NGS analysis and its quality and completeness will condition all downstream processes.

Full mapping of an average human NGS Illumina dataset (100M read pairs) will take several days and use full computer power on a 48cpu computer with 48GB RAM (values are indicative).

align short paired-reads to the reference genome

recreate the reference genome dictionary

The HiSeq_UCSC_hg19.dict file somehow got corrupted before cloning the laptops. We will re-create it now using PICARD

# select the reference genome
genome=HiSeq_UCSC_hg19.fa
 
java -jar $PICARD/picard.jar CreateSequenceDictionary \
	R=ref/${genome} \
	O=HiSeq_UCSC_hg19.dict \
	2>CreateSequenceDictionary.err
 
# compare the two versions of the dictionary
cat ref/HiSeq_UCSC_hg19.dict
cat HiSeq_UCSC_hg19.dict
 
# if this worked, overwrite the original file with the new copy with
mv HiSeq_UCSC_hg19.dict ref/

With this new dictionary file, we should hopefully take this exercise back on track.

prepare the reference genome for BWA alignment

BWA aligns reads to a library of possible short nucleotides (hash table). A hash table is build once for each new reference genome using one of BWA commands. This step is required but lengthy and will not be repeated today but is provided for info for those who wish to create their ons hash after today's session.

list of bwa commands

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>

Usage: bwa <command> [options]

Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries

shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.

Read the create a BWA index section for more info and command examples.

note on adding missing @PG line to the BWA output for traceability

We describe here another software suite that will be used after mapping to improve BAM content (see scripts below for aln and mem mapping)

Although not required, this will create a better BAM header with information on how the BAM data was obtained. We use here polishBam (online help:^[1]) from the software suite BamUtil developped in the Abecasis Group ^[2]. Other bamUtil utilities will prove useful in your pipelines (read the doc ;-)).

list of bamUtil (bam) commands

Set of tools for operating on SAM/BAM files.
Version: 1.0.11; Built: Thu Apr  3 16:14:22 CEST 2014 by bits
 
Tools to Rewrite SAM/BAM Files: 
 convert - Convert SAM/BAM to SAM/BAM
 writeRegion - Write a file with reads in the specified region and/or have the specified read name
 splitChromosome - Split BAM by Chromosome
 splitBam - Split a BAM file into multiple BAM files based on ReadGroup
 findCigars - Output just the reads that contain any of the specified CIGAR operations.
 
Tools to Modify & write SAM/BAM Files: 
 clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName
 filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the qual
ity of mismatches is too high
 revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
 squeeze -  reduces files size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the re
ference, binning quality scores, and replacing readNames with unique integers
 trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!'
 mergeBam - merge multiple BAMs and headers appending ReadGroupIDs if necessary
 polishBam - adds/updates header lines & adds the RG tag to each record
 dedup - Mark Duplicates
 recab - Recalibrate
 
Informational Tools
 validate - Validate a SAM/BAM File
 diff - Diff 2 coordinate sorted SAM/BAM files.
 stats - Stats a SAM/BAM File
 gapInfo - Print information on the gap between read pairs in a SAM/BAM File.
 
Tools to Print Information In Readable Format
 dumpHeader - Print SAM/BAM Header
 dumpRefInfo - Print SAM/BAM Reference Name Information
 dumpIndex - Print BAM Index File in English
 readReference - Print the reference string for the specified region
 explainFlags - Describe flags
 
Additional Tools
 bam2FastQ - Convert the specified BAM file to fastQs.
 
Dummy/Example Tools
 readIndexedBam - Read Indexed BAM By Reference and write it from reference id -1 to maxRefId
 
Usage: 
        bam <tool> [<tool arguments>]
The usage for each tool is described by specifying the tool with no arguments.

list of polishBam parameters

Version: 1.0.11; Built: Thu Apr  3 16:14:22 CEST 2014 by bits
 
 polishBam - adds/updates header lines & adds the RG tag to each record
Usage: polishBam (options) --in <inBamFile> --out <outBamFile>
 
Required parameters : 
-i/--in : input BAM file
-o/--out : output BAM file
Optional parameters :
-v : turn on verbose mode
-l/--log : writes logfile with specified name.
--HD : add @HD header line
--RG : add @RG header line
--PG : add @PG header line
-f/--fasta : fasta reference file to compute MD5sums and update SQ tags
--AS : AS tag for genome assembly identifier
--UR : UR tag for @SQ tag (if different from --fasta)
--SP : SP tag for @SQ tag
--checkSQ : check the consistency of SQ tags (SN and LN) with existing header lines. Must be used with --fasta option

align the reads in pairs to the reference genome

This can now be done with the 7.5 million chr21 read-pairs and BWA in two main ways detailed below. These instrumental steps are performed by a bash script where all necessary information and parameters are defined and can be edited for reuse.

Because our laptops have 'only' 6GB RAM, we can only use in the next part 2 threads for BWA aln and 1 thread for BWA mem respectively (BWA sampe uses 1 thread by default)

align using the bwa aln algorithm

The following script seems long but it mainly prepares a relatively short command and adds important information to be able to trace back the obtained data. This kind of detailed annotation will make a big difference few month after running the analysis and when referees will ask you what arguments were used when mapping was done. It also allows comparing different runs with knowledge of the different applied steps.

bwa aln specific parameters

Usage: bwa aln [options] <prefix> <in.fq>

Options: -n NUM max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
-o INT maximum number or fraction of gap opens [1]
-e INT maximum number of gap extensions, -1 for disabling long gaps [-1]
-i INT do not put an indel within INT bp towards the ends [5]
-d INT maximum occurrences for extending a long deletion [10]
-l INT seed length [32]
-k INT maximum differences in the seed [2]
-m INT maximum entries in the queue [2000000]
-t INT number of threads [1]
-M INT mismatch penalty [3]
-O INT gap open penalty [11]
-E INT gap extension penalty [4]
-R INT stop searching when there are >INT equally best hits [30]
-q INT quality threshold for read trimming down to 35bp [0]
-f FILE file to write output to instead of stdout
-B INT length of barcode
-L log-scaled gap penalty for long deletions
-N non-iterative mode: search for all n-difference hits (slooow)
-I the input is in the Illumina 1.3+ FASTQ-like format
-b the input read file is in the BAM format
-0 use single-end reads only (effective with -b)
-1 use the 1st read in a pair (effective with -b)
-2 use the 2nd read in a pair (effective with -b)
-Y filter Casava-filtered sequences

with bwa sampe: the '-r' parameter allows specifying the '@RG' line while with the new bwa mem command, '-r' is replaced by '-R'

#! /usr/bin/env bash
## script: 'mapping_bwa-aln_sampe.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2014-04-08
 
# required: 
# bwa version: 0.7.x+
# Picard Version: 1.x+
# bamutils Version: 1.0.11+
 
## define global variables
refgen=ref/HiSeq_UCSC_hg19.fa
 
## select one of the following blocks and comment the other
## full trainer data 
# infolder=Final_Results/reads_results
# f=shuffled_PE_NA18507_GAIIx_100_chr21
# outfolder=hg19_bwa-mapping/full_chromosome
# pfx=all
 
## training read sample
infolder=Final_Results/reads_results
f=shuffled_10pc_PE_NA18507_GAIIx_100_chr21
pfx=sample
 
# create new folder
outfolder=hg19_bwa-mapping
mkdir -p ${outfolder}
 
# common
fq1=${infolder}/${f}_2_1.fq.gz
fq2=${infolder}/${f}_2_2.fq.gz
 
# using 'nthr' processors in parallel -  each thread needs little less than 3Gb
nthr=2
 
# create a valid '@RG' line required by many downstream tools (GATK) and not provided by BWA natively
rgstring='@RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.aln'
 
echo "# mapping ${f} reads from the shuffled BAM"
 
echo "# mapping reads with *bwa aln*"
cat /dev/null > ${outfolder}/${pfx}_bwa-aln.err
 
echo "# mapping left mate reads"
# note '>>' io > to 'add' to the existing error log
bwa1=${outfolder}/${pfx}_data.R1.bwa
bwa aln -t ${nthr} ${refgen} ${fq1} >${bwa1} \
	2>>${outfolder}/${pfx}_bwa-aln.err
 
echo "# mapping right mate reads"
 
bwa2=${outfolder}/${pfx}_data.R2.bwa
bwa aln -t ${nthr} ${refgen} ${fq2} >${bwa2} \
	2>>${outfolder}/${pfx}_bwa-aln.err
 
echo "# assembling paired results"
 
# the merging command **sampe** is run on a single processor and IS SLOW!!
# we add during this command the '@RG' line defined in ${rgstring}
 
## default numeric settings are left unchanged as:
#  -a INT   maximum insert size [500] <== !! large insert libraries
#  -o INT   maximum occurrences for one end [100000]
#  -n INT   maximum hits to output for paired reads [3]
#  -N INT   maximum hits to output for discordant pairs [10]
#  -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]
 
bwape=${f}_aln-pe
 
cmd="bwa sampe -r \"${rgstring}\" \
	${refgen} \
	${bwa1} ${bwa2} \
	${fq1} ${fq2} > ${outfolder}/${bwape}.sam"
 
# execute the command
eval ${cmd} 2>${outfolder}/${pfx}_bwa-sampe.err
 
########################### post process results ##############
#### add @PG group to results with **bamUtil 'bam polishBam' **
# important formatting issues here
# $'\t' add a 'tab' character
# $'\'' add single quotes around ${cmd} to avoid it being interpreted
# ${cmd/\t/\ } replaces <tab> characters within the $cmd string with <space> to keep it in one nice line
# finally, $pgstring is called quoted to be replaced by its building variables
 
# we first detect the current version of BWA to store it
bwaver=$(expr "$(bwa 2>&1)" : '.*Version:\ \(.*\)Contact.*')
 
# we then create a '@PG' line to store our action and include the 'BWA version number' AND the 'full command'
pgstring=@PG$'\t'ID:BWA$'\t'PN:bwa$'\t'VN:${bwaver}$'\t'CL:$'\''${cmd/\t/\ }$'\''
 
# we will clean up the original SAM upon success to save space
# add a program '@PG' line recording the applied $cmd
bam polishBam --PG "${pgstring}" \
	-i ${outfolder}/${bwape}.sam \
	-o ${outfolder}/${bwape}_pg.sam \
	-l ${outfolder}/polishBam_${bwape}.log && rm ${outfolder}/${bwape}.sam
 
# at this point the two .bwa files and the corrected .sam file might also be deleted
 
##### convert to sorted bam, index & cleanup
java -Xmx4g -jar $PICARD/picard.jar SortSam \
	I=${outfolder}/${bwape}_pg.sam \
	O=${outfolder}/${bwape}.bam \
	SO=coordinate \
	CREATE_INDEX=true \
	VALIDATION_STRINGENCY=LENIENT
 
###### get basic stats from the resulting bam file
samtools flagstat ${outfolder}/${bwape}.bam \
	>${outfolder}/${bwape}_flagstat.txt
 
echo "# finished mapping ${f} reads with BWA aln + BWA sampe"

bwa aln mapping results on the 10% sample

cat shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_flagstat.txt
 
1498242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
1478639 + 0 mapped (98.69%:-nan%)
1498242 + 0 paired in sequencing
749121 + 0 read1
749121 + 0 read2
1471564 + 0 properly paired (98.22%:-nan%)
1474852 + 0 with itself and mate mapped
3787 + 0 singletons (0.25%:-nan%)
2466 + 0 with mate mapped to a different chr
1919 + 0 with mate mapped to a different chr (mapQ>=5)

align using the bwa mem algorithm

Alternatively run the more recent and recommended bwa mem command directly from the paired fastQ files.

bwa mem specific parameters

Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

Algorithm options:

-t INT number of threads [1]
-k INT minimum seed length [19]
-w INT band width for banded alignment [100]
-d INT off-diagonal X-dropoff [100]
-r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
-c INT skip seeds with more than INT occurrences [10000]
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
-A INT score for a sequence match [1]
-B INT penalty for a mismatch [4]
-O INT gap open penalty [6]
-E INT gap extension penalty; a gap of size k cost {-O} + {-E}*k [1]
-L INT penalty for clipping [5]
-U INT penalty for an unpaired read pair [17]

Input/output options:

-p first query file consists of interleaved paired-end sequences
-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
-T INT minimum score to output [30]
-a output all alignments for SE or unpaired PE
-C append FASTA/FASTQ comment to SAM output
-M mark shorter split hits as secondary (for Picard/GATK compatibility)

Note: Please read the man page for detailed description of the command line and options.

#! /usr/bin/env bash
## script: 'mapping_bwa-mem.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2014-04-08
 
# required: 
# bwa version: 0.7.5a-r405
# Picard Version: 1.112+
# bamutils Version: 1.0.11
 
## define global variables
refgen=ref/HiSeq_UCSC_hg19.fa
 
## select one of the following blocks and comment the other
## full trainer data
# infolder=Final_Results/reads_results
# f=shuffled_PE_NA18507_GAIIx_100_chr21
# outfolder=hg19_bwa-mapping/full_chromosome
# pfx=all
 
## your picard sample
infolder=Final_Results/reads_results
f=shuffled_10pc_PE_NA18507_GAIIx_100_chr21
pfx=sample
 
# create new folder
outfolder=hg19_bwa-mapping
mkdir -p ${outfolder}
 
# common
fq1=${infolder}/${f}_2_1.fq.gz
fq2=${infolder}/${f}_2_2.fq.gz
 
# marking secondary hits with -M to comply with Picard
# using 'nthr' processors in parallel (again limited by our RAM!) 
# mem is more demanding and needs more than 3Gb per thread
nthr=1
 
rgstring='@RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.mem'
 
echo
echo "# mapping interleaved reads with **bwa mem**"
 
## default numeric settings are left unchanged as:
#  -k INT     minimum seed length [19]
#  -w INT     band width for banded alignment [100]
#  -d INT     off-diagonal X-dropoff [100]
#  -r FLOAT   look for internal seeds inside a seed longer than {-k}
#             * FLOAT [1.5]
#  -c INT     skip seeds with more than INT occurrences [10000]
#  -A INT     score for a sequence match [1]
#  -B INT     penalty for a mismatch [4]
#  -O INT     gap open penalty [6]
#  -E INT     gap extension penalty; a gap of size k cost {-O} + {-E}*k [1]
#  -L INT     penalty for clipping [5]
#  -U INT     penalty for an unpaired read pair [17]
#  -T INT     minimum score to output [30]
# '-p': first query file consists of interleaved paired-end sequences
#
# '-M': mark shorter split hits as secondary (for Picard/GATK compatibility)
 
bwape=${f}_mem-pe
 
# store the full command line to include it in the next part
cmd="bwa mem -R \"${rgstring}\" \
	-M \
	-t ${nthr} \
	${refgen} \
	${fq1} ${fq2} > ${outfolder}/${bwape}.sam"
 
# display the full command with expanded variables
echo "# the command will be:"
echo ${cmd}
 
# execute the command
eval ${cmd} 2>${outfolder}/${pfx}_bwa_mem.err
 
########################### post process results ##############
#### add @PG group to results with **bamUtil 'bam polishBam' **
# important formatting issues here
# $'\t' add a 'tab' character
# $'\'' add single quotes around ${cmd} to avoid it being interpreted
# ${cmd/\t/\ } replaces <tab> characters within the $cmd string with <space> to keep it in one nice line
# finally, $pgstring is called quoted to be replaced by its building variables
 
bwaver=$(expr "$(bwa 2>&1)" : '.*Version:\ \(.*\)Contact.*')
pgstring=@PG$'\t'ID:BWA$'\t'PN:bwa$'\t'VN:${bwaver}$'\t'CL:$'\''${cmd/\t/\ }
 
echo "# now adding a PG field to the data"
 
# clean up initial file after success to save space
bam polishBam --PG "${pgstring}" \
	-i ${outfolder}/${bwape}.sam \
	-o ${outfolder}/${bwape}_pg.sam \
	-l ${outfolder}/polishBam_${bwape}.log && rm ${outfolder}/${bwape}.sam
 
##### convert to sorted bam, index & cleanup
java -Xmx4g -jar $PICARD/picard.jar SortSam \
	I=${outfolder}/${bwape}_pg.sam \
	O=${outfolder}/${bwape}.bam \
	SO=coordinate \
	CREATE_INDEX=true \
	VALIDATION_STRINGENCY=LENIENT
 
##### get basic stats from the resulting bam file
samtools flagstat ${outfolder}/${bwape}.bam \
	>${outfolder}/${bwape}_flagstat.txt
 
echo "# finished mapping ${f} reads with BWA mem"

bwa mem mapping results on the 10% sample

cat shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_flagstat.txt
 
1500296 + 0 in total (QC-passed reads + QC-failed reads)
2054 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1498454 + 0 mapped (99.88% : N/A)
1498242 + 0 paired in sequencing
749121 + 0 read1
749121 + 0 read2
1475988 + 0 properly paired (98.51% : N/A)
1494678 + 0 with itself and mate mapped
1722 + 0 singletons (0.11% : N/A)
10614 + 0 with mate mapped to a different chr
7042 + 0 with mate mapped to a different chr (mapQ>=5)

The mapping is much faster with the multithreaded bwa mem algorithm than with the single core sampe step of the classical bwa pipeline.

what did we learn here?

using '>>' instead of '>' allows 'appending' content to an existing file without loosing what was already written
adding '2> logfile....txt' after a command allows storing all error messages to a file and catch execution errors of very long execution output
storing a complex command into a variable allows:
- echo the same string to the terminal as a comment at run-time (not done here but achieved with eg: echo "# "${cmd})
- recycle the command text for other uses (eg log the full command in the BAM file as a PG-tag)
- execute the command through the use of a bash 'eval' call
running 'bam polishBam' on a BAM/SAM file allows annotating it to keep track of important information (eg RG tag from the ${cmd} string)

identify duplicate reads with Picard MarkDuplicates

The presence of PCR duplicates in NGS libraries can cause false positive variant calls at later stages. For this reason, the duplicates present after mapping to the reference genome must be marked using the dedicated picard tool.

You can read more about read duplicates in our [Q&A section about read duplicates]

We added 'LENIENT' for residual error messages about "MAPQ should be 0 for unmapped read".

the next command will leave duplicates in but 'mark' them, while adding a line with REMOVE_DUPLICATES=TRUE would physically remove the duplicates from the output

#! /usr/bin/env bash
## script: 'picard_markdup.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2014-04-08
 
# required: 
# Picard Version: 1.112+
 
## full chromosome samples
# infolder="hg19_bwa-mapping/full_chromosome"
 
## 10 pct samples
# process all bam files in 'hg19_bwa-mapping'
infolder="hg19_bwa-mapping"
 
for infile in ${infolder}/*.bam; do
 
# the folder is part of ${infile} and does not need to be specified
# create outfile base path + name
outname=${infile%%-pe.bam}
 
echo "# marking duplicates in ${infile}"
 
java -Xmx4g -jar $PICARD/picard.jar MarkDuplicates \
        I=${infile} \
        O=${outname}_mdup.bam \
        ASSUME_SORTED=TRUE \
        REMOVE_DUPLICATES=FALSE \
        CREATE_INDEX=TRUE \
        M=${outname}_duplicate_metrics.txt \
        VALIDATION_STRINGENCY=LENIENT \
        2>${outname}_MarkDuplicates.err
echo
 
echo "# results are"
head -8 ${outname}_duplicate_metrics.txt| tail -2 | transpose -t | column -t
echo
 
done

MarkDuplicate results

# marking duplicates in hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe.bam

# results are
LIBRARY lib-NA18507
UNPAIRED_READS_EXAMINED 3787
READ_PAIRS_EXAMINED 737426
UNMAPPED_READS 19603
UNPAIRED_READ_DUPLICATES 146
READ_PAIR_DUPLICATES 29
READ_PAIR_OPTICAL_DUPLICATES 10
PERCENT_DUPLICATION 0.000138
ESTIMATED_LIBRARY_SIZE 14309816538

# marking duplicates in hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe.bam

# results are
LIBRARY lib-NA18507
UNPAIRED_READS_EXAMINED 1722
READ_PAIRS_EXAMINED 747339
UNMAPPED_READS 1842
UNPAIRED_READ_DUPLICATES 101
READ_PAIR_DUPLICATES 61
READ_PAIR_OPTICAL_DUPLICATES 12
PERCENT_DUPLICATION 0.000149
ESTIMATED_LIBRARY_SIZE 5698706455

download exercise files

Download exercise files here

Use the right application to open the files present in ex3-files

References:

[1] ttp://genome.sph.umich.edu/wiki/BamUtil:_polishBam

[2] ttp://genome.sph.umich.edu/wiki/Main_Page

[1]

[2]

NGS Exercise.3

Contents

Introduction

align short paired-reads to the reference genome

recreate the reference genome dictionary

prepare the reference genome for BWA alignment

note on adding missing @PG line to the BWA output for traceability

align the reads in pairs to the reference genome

align using the bwa aln algorithm

align using the bwa mem algorithm

what did we learn here?

identify duplicate reads with Picard MarkDuplicates

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox