NGS Exercise.3
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3b | NGS Exercise.3c | NGS Exercise.3d | NGS Exercise.3e ]
# updated 2014 version
Align paired end reads to the human reference genome hg19 using the Burrow Wheeler Aligner (BWA)
Contents
- 1 Introduction
- 2 align short paired-reads to the reference genome
- 2.1 recreate the reference genome dictionary
- 2.2 prepare the reference genome for BWA alignment
- 2.3 note on adding missing @PG line to the BWA output for traceability
- 2.4 align the reads in pairs to the reference genome
- 2.5 what did we learn here?
- 2.6 identify duplicate reads with Picard MarkDuplicates
- 3 download exercise files
Introduction
Reference mapping is the process applied to NGS reads when the reference genome is available. Mapping (aligning) reads to the reference is required in order to later pileup all alignment results and search for variants at each conflicting position. In the mapping step, each read is aligned to the reference genome and the genome coordinate of the best hit(s) are stored together with the read sequence and quality parameters in a SAM/BAM file. This is the most time consuming step of NGS analysis and its quality and completeness will condition all downstream processes.
Full mapping of an average human NGS Illumina dataset (100M read pairs) will take several days and use full computer power on a 48cpu computer with 48GB RAM (values are indicative).
align short paired-reads to the reference genome
recreate the reference genome dictionary
The HiSeq_UCSC_hg19.dict file somehow got corrupted before cloning the laptops. We will re-create it now using PICARD
# select the reference genome genome=HiSeq_UCSC_hg19.fa java -jar $PICARD/picard.jar CreateSequenceDictionary \ R=ref/${genome} \ O=HiSeq_UCSC_hg19.dict \ 2>CreateSequenceDictionary.err # compare the two versions of the dictionary cat ref/HiSeq_UCSC_hg19.dict cat HiSeq_UCSC_hg19.dict # if this worked, overwrite the original file with the new copy with mv HiSeq_UCSC_hg19.dict ref/
With this new dictionary file, we should hopefully take this exercise back on track.
prepare the reference genome for BWA alignment
BWA aligns reads to a library of possible short nucleotides (hash table). A hash table is build once for each new reference genome using one of BWA commands. This step is required but lengthy and will not be repeated today but is provided for info for those who wish to create their ons hash after today's session.
list of bwa commands
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.
Read the create a BWA index section for more info and command examples.
note on adding missing @PG line to the BWA output for traceability
We describe here another software suite that will be used after mapping to improve BAM content (see scripts below for aln and mem mapping)
Although not required, this will create a better BAM header with information on how the BAM data was obtained. We use here polishBam (online help:[1]) from the software suite BamUtil developped in the Abecasis Group [2]. Other bamUtil utilities will prove useful in your pipelines (read the doc ;-)).
list of bamUtil (bam) commands
Set of tools for operating on SAM/BAM files. Version: 1.0.11; Built: Thu Apr 3 16:14:22 CEST 2014 by bits Tools to Rewrite SAM/BAM Files: convert - Convert SAM/BAM to SAM/BAM writeRegion - Write a file with reads in the specified region and/or have the specified read name splitChromosome - Split BAM by Chromosome splitBam - Split a BAM file into multiple BAM files based on ReadGroup findCigars - Output just the reads that contain any of the specified CIGAR operations. Tools to Modify & write SAM/BAM Files: clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the qual ity of mismatches is too high revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags squeeze - reduces files size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the re ference, binning quality scores, and replacing readNames with unique integers trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!' mergeBam - merge multiple BAMs and headers appending ReadGroupIDs if necessary polishBam - adds/updates header lines & adds the RG tag to each record dedup - Mark Duplicates recab - Recalibrate Informational Tools validate - Validate a SAM/BAM File diff - Diff 2 coordinate sorted SAM/BAM files. stats - Stats a SAM/BAM File gapInfo - Print information on the gap between read pairs in a SAM/BAM File. Tools to Print Information In Readable Format dumpHeader - Print SAM/BAM Header dumpRefInfo - Print SAM/BAM Reference Name Information dumpIndex - Print BAM Index File in English readReference - Print the reference string for the specified region explainFlags - Describe flags Additional Tools bam2FastQ - Convert the specified BAM file to fastQs. Dummy/Example Tools readIndexedBam - Read Indexed BAM By Reference and write it from reference id -1 to maxRefId Usage: bam <tool> [<tool arguments>] The usage for each tool is described by specifying the tool with no arguments.
list of polishBam parameters
Version: 1.0.11; Built: Thu Apr 3 16:14:22 CEST 2014 by bits polishBam - adds/updates header lines & adds the RG tag to each record Usage: polishBam (options) --in <inBamFile> --out <outBamFile> Required parameters : -i/--in : input BAM file -o/--out : output BAM file Optional parameters : -v : turn on verbose mode -l/--log : writes logfile with specified name. --HD : add @HD header line --RG : add @RG header line --PG : add @PG header line -f/--fasta : fasta reference file to compute MD5sums and update SQ tags --AS : AS tag for genome assembly identifier --UR : UR tag for @SQ tag (if different from --fasta) --SP : SP tag for @SQ tag --checkSQ : check the consistency of SQ tags (SN and LN) with existing header lines. Must be used with --fasta option
align the reads in pairs to the reference genome
This can now be done with the 7.5 million chr21 read-pairs and BWA in two main ways detailed below. These instrumental steps are performed by a bash script where all necessary information and parameters are defined and can be edited for reuse.
Because our laptops have 'only' 6GB RAM, we can only use in the next part 2 threads for BWA aln and 1 thread for BWA mem respectively (BWA sampe uses 1 thread by default)
align using the bwa aln algorithm
The following script seems long but it mainly prepares a relatively short command and adds important information to be able to trace back the obtained data. This kind of detailed annotation will make a big difference few month after running the analysis and when referees will ask you what arguments were used when mapping was done. It also allows comparing different runs with knowledge of the different applied steps.
bwa aln specific parameters
Options: -n NUM max #diff (int) or missing prob under 0.02 err rate (float) [0.04]
-o INT maximum number or fraction of gap opens [1]
-e INT maximum number of gap extensions, -1 for disabling long gaps [-1]
-i INT do not put an indel within INT bp towards the ends [5]
-d INT maximum occurrences for extending a long deletion [10]
-l INT seed length [32]
-k INT maximum differences in the seed [2]
-m INT maximum entries in the queue [2000000]
-t INT number of threads [1]
-M INT mismatch penalty [3]
-O INT gap open penalty [11]
-E INT gap extension penalty [4]
-R INT stop searching when there are >INT equally best hits [30]
-q INT quality threshold for read trimming down to 35bp [0]
-f FILE file to write output to instead of stdout
-B INT length of barcode
-L log-scaled gap penalty for long deletions
-N non-iterative mode: search for all n-difference hits (slooow)
-I the input is in the Illumina 1.3+ FASTQ-like format
-b the input read file is in the BAM format
-0 use single-end reads only (effective with -b)
-1 use the 1st read in a pair (effective with -b)
-2 use the 2nd read in a pair (effective with -b)
-Y filter Casava-filtered sequences
with bwa sampe: the '-r' parameter allows specifying the '@RG' line while with the new bwa mem command, '-r' is replaced by '-R'
#! /usr/bin/env bash ## script: 'mapping_bwa-aln_sampe.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2014-04-08 # required: # bwa version: 0.7.x+ # Picard Version: 1.x+ # bamutils Version: 1.0.11+ ## define global variables refgen=ref/HiSeq_UCSC_hg19.fa ## select one of the following blocks and comment the other ## full trainer data # infolder=Final_Results/reads_results # f=shuffled_PE_NA18507_GAIIx_100_chr21 # outfolder=hg19_bwa-mapping/full_chromosome # pfx=all ## training read sample infolder=Final_Results/reads_results f=shuffled_10pc_PE_NA18507_GAIIx_100_chr21 pfx=sample # create new folder outfolder=hg19_bwa-mapping mkdir -p ${outfolder} # common fq1=${infolder}/${f}_2_1.fq.gz fq2=${infolder}/${f}_2_2.fq.gz # using 'nthr' processors in parallel - each thread needs little less than 3Gb nthr=2 # create a valid '@RG' line required by many downstream tools (GATK) and not provided by BWA natively rgstring='@RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.aln' echo "# mapping ${f} reads from the shuffled BAM" echo "# mapping reads with *bwa aln*" cat /dev/null > ${outfolder}/${pfx}_bwa-aln.err echo "# mapping left mate reads" # note '>>' io > to 'add' to the existing error log bwa1=${outfolder}/${pfx}_data.R1.bwa bwa aln -t ${nthr} ${refgen} ${fq1} >${bwa1} \ 2>>${outfolder}/${pfx}_bwa-aln.err echo "# mapping right mate reads" bwa2=${outfolder}/${pfx}_data.R2.bwa bwa aln -t ${nthr} ${refgen} ${fq2} >${bwa2} \ 2>>${outfolder}/${pfx}_bwa-aln.err echo "# assembling paired results" # the merging command **sampe** is run on a single processor and IS SLOW!! # we add during this command the '@RG' line defined in ${rgstring} ## default numeric settings are left unchanged as: # -a INT maximum insert size [500] <== !! large insert libraries # -o INT maximum occurrences for one end [100000] # -n INT maximum hits to output for paired reads [3] # -N INT maximum hits to output for discordant pairs [10] # -c FLOAT prior of chimeric rate (lower bound) [1.0e-05] bwape=${f}_aln-pe cmd="bwa sampe -r \"${rgstring}\" \ ${refgen} \ ${bwa1} ${bwa2} \ ${fq1} ${fq2} > ${outfolder}/${bwape}.sam" # execute the command eval ${cmd} 2>${outfolder}/${pfx}_bwa-sampe.err ########################### post process results ############## #### add @PG group to results with **bamUtil 'bam polishBam' ** # important formatting issues here # $'\t' add a 'tab' character # $'\'' add single quotes around ${cmd} to avoid it being interpreted # ${cmd/\t/\ } replaces <tab> characters within the $cmd string with <space> to keep it in one nice line # finally, $pgstring is called quoted to be replaced by its building variables # we first detect the current version of BWA to store it bwaver=$(expr "$(bwa 2>&1)" : '.*Version:\ \(.*\)Contact.*') # we then create a '@PG' line to store our action and include the 'BWA version number' AND the 'full command' pgstring=@PG$'\t'ID:BWA$'\t'PN:bwa$'\t'VN:${bwaver}$'\t'CL:$'\''${cmd/\t/\ }$'\'' # we will clean up the original SAM upon success to save space # add a program '@PG' line recording the applied $cmd bam polishBam --PG "${pgstring}" \ -i ${outfolder}/${bwape}.sam \ -o ${outfolder}/${bwape}_pg.sam \ -l ${outfolder}/polishBam_${bwape}.log && rm ${outfolder}/${bwape}.sam # at this point the two .bwa files and the corrected .sam file might also be deleted ##### convert to sorted bam, index & cleanup java -Xmx4g -jar $PICARD/picard.jar SortSam \ I=${outfolder}/${bwape}_pg.sam \ O=${outfolder}/${bwape}.bam \ SO=coordinate \ CREATE_INDEX=true \ VALIDATION_STRINGENCY=LENIENT ###### get basic stats from the resulting bam file samtools flagstat ${outfolder}/${bwape}.bam \ >${outfolder}/${bwape}_flagstat.txt echo "# finished mapping ${f} reads with BWA aln + BWA sampe"
bwa aln mapping results on the 10% sample
cat shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_flagstat.txt 1498242 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 1478639 + 0 mapped (98.69%:-nan%) 1498242 + 0 paired in sequencing 749121 + 0 read1 749121 + 0 read2 1471564 + 0 properly paired (98.22%:-nan%) 1474852 + 0 with itself and mate mapped 3787 + 0 singletons (0.25%:-nan%) 2466 + 0 with mate mapped to a different chr 1919 + 0 with mate mapped to a different chr (mapQ>=5)
align using the bwa mem algorithm
Alternatively run the more recent and recommended bwa mem command directly from the paired fastQ files.
bwa mem specific parameters
Algorithm options:
-t INT number of threads [1]
-k INT minimum seed length [19]
-w INT band width for banded alignment [100]
-d INT off-diagonal X-dropoff [100]
-r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
-c INT skip seeds with more than INT occurrences [10000]
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
-A INT score for a sequence match [1]
-B INT penalty for a mismatch [4]
-O INT gap open penalty [6]
-E INT gap extension penalty; a gap of size k cost {-O} + {-E}*k [1]
-L INT penalty for clipping [5]
-U INT penalty for an unpaired read pair [17]
Input/output options:
-p first query file consists of interleaved paired-end sequences
-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]
-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
-T INT minimum score to output [30]
-a output all alignments for SE or unpaired PE
-C append FASTA/FASTQ comment to SAM output
-M mark shorter split hits as secondary (for Picard/GATK compatibility)
Note: Please read the man page for detailed description of the command line and options.
#! /usr/bin/env bash ## script: 'mapping_bwa-mem.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2014-04-08 # required: # bwa version: 0.7.5a-r405 # Picard Version: 1.112+ # bamutils Version: 1.0.11 ## define global variables refgen=ref/HiSeq_UCSC_hg19.fa ## select one of the following blocks and comment the other ## full trainer data # infolder=Final_Results/reads_results # f=shuffled_PE_NA18507_GAIIx_100_chr21 # outfolder=hg19_bwa-mapping/full_chromosome # pfx=all ## your picard sample infolder=Final_Results/reads_results f=shuffled_10pc_PE_NA18507_GAIIx_100_chr21 pfx=sample # create new folder outfolder=hg19_bwa-mapping mkdir -p ${outfolder} # common fq1=${infolder}/${f}_2_1.fq.gz fq2=${infolder}/${f}_2_2.fq.gz # marking secondary hits with -M to comply with Picard # using 'nthr' processors in parallel (again limited by our RAM!) # mem is more demanding and needs more than 3Gb per thread nthr=1 rgstring='@RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.mem' echo echo "# mapping interleaved reads with **bwa mem**" ## default numeric settings are left unchanged as: # -k INT minimum seed length [19] # -w INT band width for banded alignment [100] # -d INT off-diagonal X-dropoff [100] # -r FLOAT look for internal seeds inside a seed longer than {-k} # * FLOAT [1.5] # -c INT skip seeds with more than INT occurrences [10000] # -A INT score for a sequence match [1] # -B INT penalty for a mismatch [4] # -O INT gap open penalty [6] # -E INT gap extension penalty; a gap of size k cost {-O} + {-E}*k [1] # -L INT penalty for clipping [5] # -U INT penalty for an unpaired read pair [17] # -T INT minimum score to output [30] # '-p': first query file consists of interleaved paired-end sequences # # '-M': mark shorter split hits as secondary (for Picard/GATK compatibility) bwape=${f}_mem-pe # store the full command line to include it in the next part cmd="bwa mem -R \"${rgstring}\" \ -M \ -t ${nthr} \ ${refgen} \ ${fq1} ${fq2} > ${outfolder}/${bwape}.sam" # display the full command with expanded variables echo "# the command will be:" echo ${cmd} # execute the command eval ${cmd} 2>${outfolder}/${pfx}_bwa_mem.err ########################### post process results ############## #### add @PG group to results with **bamUtil 'bam polishBam' ** # important formatting issues here # $'\t' add a 'tab' character # $'\'' add single quotes around ${cmd} to avoid it being interpreted # ${cmd/\t/\ } replaces <tab> characters within the $cmd string with <space> to keep it in one nice line # finally, $pgstring is called quoted to be replaced by its building variables bwaver=$(expr "$(bwa 2>&1)" : '.*Version:\ \(.*\)Contact.*') pgstring=@PG$'\t'ID:BWA$'\t'PN:bwa$'\t'VN:${bwaver}$'\t'CL:$'\''${cmd/\t/\ } echo "# now adding a PG field to the data" # clean up initial file after success to save space bam polishBam --PG "${pgstring}" \ -i ${outfolder}/${bwape}.sam \ -o ${outfolder}/${bwape}_pg.sam \ -l ${outfolder}/polishBam_${bwape}.log && rm ${outfolder}/${bwape}.sam ##### convert to sorted bam, index & cleanup java -Xmx4g -jar $PICARD/picard.jar SortSam \ I=${outfolder}/${bwape}_pg.sam \ O=${outfolder}/${bwape}.bam \ SO=coordinate \ CREATE_INDEX=true \ VALIDATION_STRINGENCY=LENIENT ##### get basic stats from the resulting bam file samtools flagstat ${outfolder}/${bwape}.bam \ >${outfolder}/${bwape}_flagstat.txt echo "# finished mapping ${f} reads with BWA mem"
bwa mem mapping results on the 10% sample
cat shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_flagstat.txt 1500296 + 0 in total (QC-passed reads + QC-failed reads) 2054 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 1498454 + 0 mapped (99.88% : N/A) 1498242 + 0 paired in sequencing 749121 + 0 read1 749121 + 0 read2 1475988 + 0 properly paired (98.51% : N/A) 1494678 + 0 with itself and mate mapped 1722 + 0 singletons (0.11% : N/A) 10614 + 0 with mate mapped to a different chr 7042 + 0 with mate mapped to a different chr (mapQ>=5)
The mapping is much faster with the multithreaded bwa mem algorithm than with the single core sampe step of the classical bwa pipeline.
what did we learn here?
- using '>>' instead of '>' allows 'appending' content to an existing file without loosing what was already written
- adding '2> logfile....txt' after a command allows storing all error messages to a file and catch execution errors of very long execution output
- storing a complex command into a variable allows:
- echo the same string to the terminal as a comment at run-time (not done here but achieved with eg: echo "# "${cmd})
- recycle the command text for other uses (eg log the full command in the BAM file as a PG-tag)
- execute the command through the use of a bash 'eval' call
- running 'bam polishBam' on a BAM/SAM file allows annotating it to keep track of important information (eg RG tag from the ${cmd} string)
identify duplicate reads with Picard MarkDuplicates
The presence of PCR duplicates in NGS libraries can cause false positive variant calls at later stages. For this reason, the duplicates present after mapping to the reference genome must be marked using the dedicated picard tool.
You can read more about read duplicates in our [Q&A section about read duplicates]
We added 'LENIENT' for residual error messages about "MAPQ should be 0 for unmapped read".
the next command will leave duplicates in but 'mark' them, while adding a line with REMOVE_DUPLICATES=TRUE would physically remove the duplicates from the output
#! /usr/bin/env bash ## script: 'picard_markdup.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2014-04-08 # required: # Picard Version: 1.112+ ## full chromosome samples # infolder="hg19_bwa-mapping/full_chromosome" ## 10 pct samples # process all bam files in 'hg19_bwa-mapping' infolder="hg19_bwa-mapping" for infile in ${infolder}/*.bam; do # the folder is part of ${infile} and does not need to be specified # create outfile base path + name outname=${infile%%-pe.bam} echo "# marking duplicates in ${infile}" java -Xmx4g -jar $PICARD/picard.jar MarkDuplicates \ I=${infile} \ O=${outname}_mdup.bam \ ASSUME_SORTED=TRUE \ REMOVE_DUPLICATES=FALSE \ CREATE_INDEX=TRUE \ M=${outname}_duplicate_metrics.txt \ VALIDATION_STRINGENCY=LENIENT \ 2>${outname}_MarkDuplicates.err echo echo "# results are" head -8 ${outname}_duplicate_metrics.txt| tail -2 | transpose -t | column -t echo done
MarkDuplicate results
# results are
LIBRARY lib-NA18507
UNPAIRED_READS_EXAMINED 3787
READ_PAIRS_EXAMINED 737426
UNMAPPED_READS 19603
UNPAIRED_READ_DUPLICATES 146
READ_PAIR_DUPLICATES 29
READ_PAIR_OPTICAL_DUPLICATES 10
PERCENT_DUPLICATION 0.000138
ESTIMATED_LIBRARY_SIZE 14309816538
# marking duplicates in hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe.bam
# results are
LIBRARY lib-NA18507
UNPAIRED_READS_EXAMINED 1722
READ_PAIRS_EXAMINED 747339
UNMAPPED_READS 1842
UNPAIRED_READ_DUPLICATES 101
READ_PAIR_DUPLICATES 61
READ_PAIR_OPTICAL_DUPLICATES 12
PERCENT_DUPLICATION 0.000149
ESTIMATED_LIBRARY_SIZE 5698706455
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3b | NGS Exercise.3c | NGS Exercise.3d | NGS Exercise.3e ]