[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.1 | NGS-Var Exercise.3 ]

Align paired end reads to the human reference genome hg19 using the Burrow Wheeler Aligner (BWA)

1 Introduction
2 prepare the reference genome for BWA alignment
- 2.1 note on adding missing @PG line to the BWA output for traceability
3 align the reads in pairs to the reference genome using the bwa mem algorithm
4 identify duplicate reads with Picard MarkDuplicates
5 download exercise files

Introduction

Reference mapping is the process applied to NGS reads when the reference genome is available. Mapping (aligning) reads to the reference is required in order to later pileup all alignment results and search for variants at each conflicting position. In the mapping step, each read is aligned to the reference genome and the genome coordinate of the best hit(s) are stored together with the read sequence and quality parameters in a SAM/BAM file. This is the most time consuming step of NGS analysis and its quality and completeness will condition all downstream processes.

Full mapping of an average human NGS Illumina dataset (100M read pairs) will take several days and use full computer power on a 48cpu computer with 48GB RAM (values are indicative).

prepare the reference genome for BWA alignment

BWA aligns reads to a library of possible short nucleotides (hash table). A hash table is build once for each new reference genome using the BWA commands **bwa index**. This step is required but lengthy and will not be repeated today. Please note that **bwa index** requires 5 to 6GB RAM to operate on the human 3G-base hg19 reference genome ^[1]. Recent versions of **bwa mem**, used to align the reads, require 3.5GB RAM just to load the hg19 genome hash but are happier with 6GB or more. If you plan to install a computer for training or production, please consider having at least 8GB RAM or you will run into trouble.

Please refer to NGS_variant_analysis:_laptop_configuration_and_files covering laptop installation steps for more information

list of bwa commands

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>

Usage: bwa <command> [options]

Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries

shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ

Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.

note on adding missing @PG line to the BWA output for traceability

We introduce here another software suite that will be used after mapping to improve BAM content (used in the bwa mem script)

Although not required, this will create a better BAM header with information on how the BAM data was obtained. We use here polishBam (online help:^[2]) from the software suite BamUtil developped in the Abecasis Group ^[3]. Other bamUtil utilities will prove useful in your pipelines (read the doc ;-)).

list of bamUtil (bam) commands

Set of tools for operating on SAM/BAM files.
Version: 1.0.14; Built: Wed Feb 17 10:23:06 CET 2016 by splaisan
 
Tools to Rewrite SAM/BAM Files: 
 convert - Convert SAM/BAM to SAM/BAM
 writeRegion - Write a file with reads in the specified region and/or have the specified read name
 splitChromosome - Split BAM by Chromosome
 splitBam - Split a BAM file into multiple BAM files based on ReadGroup
 findCigars - Output just the reads that contain any of the specified CIGAR operations.
 
Tools to Modify & write SAM/BAM Files: 
 clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName
 filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high
 revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags
 squeeze -  reduces files size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers
 trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!' or softclipping the ends (resulting file will not be sorted)
 mergeBam - merge multiple BAMs and headers appending ReadGroupIDs if necessary
 polishBam - adds/updates header lines & adds the RG tag to each record
 dedup - Mark Duplicates
 dedup_LowMem - Mark Duplicates using only a little memory
 recab - Recalibrate
 
Informational Tools
 validate - Validate a SAM/BAM File
 diff - Diff 2 coordinate sorted SAM/BAM files.
 stats - Stats a SAM/BAM File
 gapInfo - Print information on the gap between read pairs in a SAM/BAM File.
 
Tools to Print Information In Readable Format
 dumpHeader - Print SAM/BAM Header
 dumpRefInfo - Print SAM/BAM Reference Name Information
 dumpIndex - Print BAM Index File in English
 readReference - Print the reference string for the specified region
 explainFlags - Describe flags
 
Additional Tools
 bam2FastQ - Convert the specified BAM file to fastQs.
 
Dummy/Example Tools
 readIndexedBam - Read Indexed BAM By Reference and write it from reference id -1 to maxRefId
 
Usage: 
        bam <tool> [<tool arguments>]
The usage for each tool is described by specifying the tool with no arguments.

list of polishBam parameters

Version: 1.0.14; Built: Wed Feb 17 10:23:06 CET 2016 by splaisan
 
 polishBam - adds/updates header lines & adds the RG tag to each record
Usage: polishBam (options) --in <inBamFile> --out <outBamFile>
 
Required parameters : 
-i/--in : input BAM file
-o/--out : output BAM file
Optional parameters :
-v : turn on verbose mode
-l/--log : writes logfile with specified name.
--HD : add @HD header line
--RG : add @RG header line
--PG : add @PG header line
--CO : add @CO header line
-f/--fasta : fasta reference file to compute MD5sums and update SQ tags
--AS : AS tag for genome assembly identifier
--UR : UR tag for @SQ tag (if different from --fasta)
--SP : SP tag for @SQ tag
--checkSQ : check the consistency of SQ tags (SN and LN) with existing header lines. Must be used with --fasta option

align the reads in pairs to the reference genome using the bwa mem algorithm

This can now be done with the 7.5 million chr21 read-pairs and BWA as detailed below. These instrumental steps are performed by a bash script where all necessary information and parameters are defined and can be edited for reuse.

{{Tip|Because our laptops have 'only' 6GB RAM, we can only use in the next part 1 thread for BWA mem. Stronger computers could dedicate more resources to the job and make it end faster.

Run the more recent and recommended bwa mem command directly from the paired fastQ files.

bwa mem specific parameters

Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq]

Algorithm options:

-t INT number of threads [1]
-k INT minimum seed length [19]
-w INT band width for banded alignment [100]
-d INT off-diagonal X-dropoff [100]
-r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
-c INT skip seeds with more than INT occurrences [10000]
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
-A INT score for a sequence match [1]
-B INT penalty for a mismatch [4]
-O INT gap open penalty [6]
-E INT gap extension penalty; a gap of size k cost {-O} + {-E}*k [1]
-L INT penalty for clipping [5]
-U INT penalty for an unpaired read pair [17]

Input/output options:

-p first query file consists of interleaved paired-end sequences
-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
-T INT minimum score to output [30]
-a output all alignments for SE or unpaired PE
-C append FASTA/FASTQ comment to SAM output
-M mark shorter split hits as secondary (for Picard/GATK compatibility)

Note: Please read the man page for detailed description of the command line and options.

#! /usr/bin/env bash
## script: 'mapping_bwa-mem.sh'
## ©SP-BITS, 2013 v1.1
# last edit: 2016-02-18
 
# required: 
# bwa version: 0.7.12-r1039
# Picard Version: 2.1.0
# bamutils Version: 1.0.12b
 
## define global variables
refgen=ref/HiSeq_UCSC_hg19.fa
 
## select one of the folowing three blocks and comment the other two
infolder=reads
 
## full data
# f=shuffled_PE_NA18507_GAIIx_100_chr21
# pfx=all
 
## 10% sample
#f=shuffled_10pc_PE_NA18507_GAIIx_100_chr21
#pfx=sample
 
## 1% sample
f=shuffled_1pc_PE_NA18507_GAIIx_100_chr21
pfx=sample
 
# create new folder
outfolder=hg19_bwa-mapping
mkdir -p ${outfolder}
 
# common
fq1=${infolder}/${f}_2_1.fq.gz
fq2=${infolder}/${f}_2_2.fq.gz
 
# marking secondary hits with -M to comply with Picard
# using 'nthr' processors in parallel (again limited by our RAM!) 
# mem is more demanding and needs more than 3Gb per thread
nthr=1
 
rgstring='@RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.mem'
 
echo
echo "# mapping paired reads with **bwa mem**"
 
## default numeric settings are left unchanged as:
#  -k INT     minimum seed length [19]
#  -w INT     band width for banded alignment [100]
#  -d INT     off-diagonal X-dropoff [100]
#  -r FLOAT   look for internal seeds inside a seed longer than {-k}
#             * FLOAT [1.5]
#  -c INT     skip seeds with more than INT occurrences [10000]
#  -A INT     score for a sequence match [1]
#  -B INT     penalty for a mismatch [4]
#  -O INT     gap open penalty [6]
#  -E INT     gap extension penalty; a gap of size k cost {-O} + {-E}*k [1]
#  -L INT     penalty for clipping [5]
#  -U INT     penalty for an unpaired read pair [17]
#  -T INT     minimum score to output [30]
# '-p': first query file consists of interleaved paired-end sequences
#
# '-M': mark shorter split hits as secondary (for Picard/GATK compatibility)
 
bwape=${f}_mem-pe
 
# store the full command line to include it in the next part
cmd="bwa mem -R \"${rgstring}\" \
	-M \
	-t ${nthr} \
	${refgen} \
	${fq1} ${fq2} > ${outfolder}/${bwape}.sam"
 
# execute the command
eval ${cmd} 2>${outfolder}/${pfx}_bwa_mem.err
 
########################### post process results ##############
#### add @PG group to results with **bamUtil 'bam polishBam' **
# important formatting issues here
# $'\t' add a 'tab' character
# $'\'' add single quotes around ${cmd} to avoid it being interpreted
# finally, $pgstring is called quoted to be replaced by its building variables
 
# we first detect the current version of BWA to store it
bwaver=$(expr "$(bwa 2>&1)" : '.*Version:\ \(.*\)Contact.*')
 
# we then create a '@PG' line to store our action and include the 'BWA version number' AND the 'full command'
# clean extra spaces with sed and the POSIX character class [:space:]
# some explanation: [:space:] is functionally identical to [ \t\r\n\v\f] 
cmd=$(echo ${cmd} | sed -e "s/[[:space:]]\+/ /g")
 
pgstring=@PG$'\t'ID:BWA$'\t'PN:bwa$'\t'VN:${bwaver}$'\t'CL:$'\''${cmd}$'\''
 
# clean up initial file after success to save space
bam polishBam --PG "${pgstring}" \
	-i ${outfolder}/${bwape}.sam \
	-o ${outfolder}/${bwape}_pg.sam \
	-l ${outfolder}/polishBam_${bwape}.log && rm ${outfolder}/${bwape}.sam
 
##### convert to sorted bam, index & cleanup
java -jar $PICARD/picard.jar SortSam \
	I=${outfolder}/${bwape}_pg.sam \
	O=${outfolder}/${bwape}.bam \
	SO=coordinate \
	CREATE_INDEX=true \
	VALIDATION_STRINGENCY=LENIENT && rm ${outfolder}/${bwape}_pg.sam
 
##### get basic stats from the resulting bam file
samtools flagstat ${outfolder}/${bwape}.bam \
	>${outfolder}/${bwape}_flagstat.txt
 
echo "# finished mapping ${f} reads with BWA mem"

bwa mem mapping results on the 10% sample and full data

cat shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_flagstat.txt
 
1500296 + 0 in total (QC-passed reads + QC-failed reads)
2054 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
1498454 + 0 mapped (99.88% : N/A)
1498242 + 0 paired in sequencing
749121 + 0 read1
749121 + 0 read2
1475990 + 0 properly paired (98.51% : N/A)
1494678 + 0 with itself and mate mapped
1722 + 0 singletons (0.11% : N/A)
10614 + 0 with mate mapped to a different chr
7042 + 0 with mate mapped to a different chr (mapQ>=5)
 
## as a comparison, below are the results obtained with the full data
 
cat  shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_flagstat.txt
15010573 + 0 in total (QC-passed reads + QC-failed reads)
20379 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
14992193 + 0 mapped (99.88% : N/A)
14990194 + 0 paired in sequencing
7495097 + 0 read1
7495097 + 0 read2
14769700 + 0 properly paired (98.53% : N/A)
14954678 + 0 with itself and mate mapped
17136 + 0 singletons (0.11% : N/A)
104906 + 0 with mate mapped to a different chr
70173 + 0 with mate mapped to a different chr (mapQ>=5)

what did we learn here?

using '>>' instead of '>' allows 'appending' content to an existing file without loosing what was already written
adding '2> logfile....txt' after a command allows storing all error messages to a file and catch execution errors of very long execution output
storing a complex command into a variable allows:
- echo the same string to the terminal as a comment at run-time (not done here but achieved with eg: echo "# "${cmd})
- recycle the command text for other uses (eg log the full command in the BAM file as a PG-tag)
- execute the command through the use of a bash 'eval' call
running 'bam polishBam' on a BAM/SAM file allows annotating it to keep track of important information (eg RG tag from the ${cmd} string)

identify duplicate reads with Picard MarkDuplicates

The presence of PCR duplicates in NGS libraries can cause false positive variant calls at later stages. For this reason, the duplicates present after mapping to the reference genome must be marked using the dedicated picard tool.

You can read more about read duplicates in our [Q&A section about read duplicates]

We added 'LENIENT' for residual error messages about "MAPQ should be 0 for unmapped read".

the next command will leave duplicates in but 'mark' them, while adding a line with REMOVE_DUPLICATES=TRUE would physically remove the duplicates from the output

##! /usr/bin/env bash
## script: 'picard_markdup.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2016-02-19
 
# required: 
# Picard Version: 2.1.0
 
# 10% sample
infile="hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe.bam"
 
# full data
# infile="hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.bam"
 
echo "# marking duplicates in ${infile}"
 
# we extract part of the infile and replace the end of it by a new text and extension
# ${infile%%-pe.bam} removes the text '-pe.bam' from the end of 'infile'
 
## the reads are sorted by coordinate in the bam data
## this is required to run MarkDuplicates
 
java -jar $PICARD/picard.jar MarkDuplicates \
        I=${infile} \
        O=${infile%%-pe.bam}_mdup.bam \
        ASSUME_SORTED=TRUE \
        REMOVE_DUPLICATES=FALSE \
        CREATE_INDEX=TRUE \
        M=${infile%%-pe.bam}_duplicate_metrics.txt \
        VALIDATION_STRINGENCY=LENIENT \
        2>${infile%%-pe.bam}_MarkDuplicates.err
echo
 
echo "# results are"
head -8 ${infile%%-pe.bam}_duplicate_metrics.txt| tail -2 | transpose -t | column -t

MarkDuplicate results

# marking duplicates in hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe.bam

# results are
LIBRARY lib-NA18507
UNPAIRED_READS_EXAMINED 1722
READ_PAIRS_EXAMINED 747339
UNMAPPED_READS 1842
UNPAIRED_READ_DUPLICATES 101
READ_PAIR_DUPLICATES 61
READ_PAIR_OPTICAL_DUPLICATES 12
PERCENT_DUPLICATION 0.000149
ESTIMATED_LIBRARY_SIZE 5698706455

# as a comparison, marking duplicates in hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.bam

# results are
LIBRARY lib-NA18507
UNPAIRED_READS_EXAMINED 17136
READ_PAIRS_EXAMINED 7477339
UNMAPPED_READS 18380
UNPAIRED_READ_DUPLICATES 4679
READ_PAIR_DUPLICATES 5450
READ_PAIR_OPTICAL_DUPLICATES 1310
PERCENT_DUPLICATION 0.001041
ESTIMATED_LIBRARY_SIZE 6747629693

download exercise files

Download exercise files here

Use the right application to open the files present in ex2-files

References:

[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.1 | NGS-Var Exercise.3 ]

[1] ttp://seqanswers.com/forums/showthread.php?p=102458

[2] ttp://genome.sph.umich.edu/wiki/BamUtil:_polishBam

[3] ttp://genome.sph.umich.edu/wiki/Main_Page

[1]

[2]

[3]

NGS-Var Exercise.2

Contents

Introduction

prepare the reference genome for BWA alignment

note on adding missing @PG line to the BWA output for traceability

align the reads in pairs to the reference genome using the bwa mem algorithm

identify duplicate reads with Picard MarkDuplicates

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox