NGS-Var Exercise.2
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.1 | NGS-Var Exercise.3 ]
Align paired end reads to the human reference genome hg19 using the Burrow Wheeler Aligner (BWA)
Contents
Introduction
Reference mapping is the process applied to NGS reads when the reference genome is available. Mapping (aligning) reads to the reference is required in order to later pileup all alignment results and search for variants at each conflicting position. In the mapping step, each read is aligned to the reference genome and the genome coordinate of the best hit(s) are stored together with the read sequence and quality parameters in a SAM/BAM file. This is the most time consuming step of NGS analysis and its quality and completeness will condition all downstream processes.
Full mapping of an average human NGS Illumina dataset (100M read pairs) will take several days and use full computer power on a 48cpu computer with 48GB RAM (values are indicative).
prepare the reference genome for BWA alignment
BWA aligns reads to a library of possible short nucleotides (hash table). A hash table is build once for each new reference genome using the BWA commands **bwa index**. This step is required but lengthy and will not be repeated today. Please note that **bwa index** requires 5 to 6GB RAM to operate on the human 3G-base hg19 reference genome [1]. Recent versions of **bwa mem**, used to align the reads, require 3.5GB RAM just to load the hg19 genome hash but are happier with 6GB or more. If you plan to install a computer for training or production, please consider having at least 8GB RAM or you will run into trouble.
Please refer to NGS_variant_analysis:_laptop_configuration_and_files covering laptop installation steps for more information
list of bwa commands
Version: 0.7.12-r1039
Contact: Heng Li <lh3@sanger.ac.uk>
Usage: bwa <command> [options]
Command: index index sequences in the FASTA format
mem BWA-MEM algorithm
fastmap identify super-maximal exact matches
pemerge merge overlapping paired ends (EXPERIMENTAL)
aln gapped/ungapped alignment
samse generate alignment (single ended)
sampe generate alignment (paired ended)
bwasw BWA-SW for long queries
shm manage indices in shared memory
fa2pac convert FASTA to PAC format
pac2bwt generate BWT from PAC
pac2bwtgen alternative algorithm for generating BWT
bwtupdate update .bwt to the new format
bwt2sa generate SA from BWT and Occ
Note: To use BWA, you need to first index the genome with `bwa index'.
There are three alignment algorithms in BWA: `mem', `bwasw', and
`aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
first. Please `man ./bwa.1' for the manual.
note on adding missing @PG line to the BWA output for traceability
We introduce here another software suite that will be used after mapping to improve BAM content (used in the bwa mem script)
Although not required, this will create a better BAM header with information on how the BAM data was obtained. We use here polishBam (online help:[2]) from the software suite BamUtil developped in the Abecasis Group [3]. Other bamUtil utilities will prove useful in your pipelines (read the doc ;-)).
list of bamUtil (bam) commands
Set of tools for operating on SAM/BAM files. Version: 1.0.14; Built: Wed Feb 17 10:23:06 CET 2016 by splaisan Tools to Rewrite SAM/BAM Files: convert - Convert SAM/BAM to SAM/BAM writeRegion - Write a file with reads in the specified region and/or have the specified read name splitChromosome - Split BAM by Chromosome splitBam - Split a BAM file into multiple BAM files based on ReadGroup findCigars - Output just the reads that contain any of the specified CIGAR operations. Tools to Modify & write SAM/BAM Files: clipOverlap - Clip overlapping read pairs in a SAM/BAM File already sorted by Coordinate or ReadName filter - Filter reads by clipping ends with too high of a mismatch percentage and by marking reads unmapped if the quality of mismatches is too high revert - Revert SAM/BAM replacing the specified fields with their previous values (if known) and removes specified tags squeeze - reduces files size by dropping OQ fields, duplicates, & specified tags, using '=' when a base matches the reference, binning quality scores, and replacing readNames with unique integers trimBam - Trim the ends of reads in a SAM/BAM file changing read ends to 'N' and quality to '!' or softclipping the ends (resulting file will not be sorted) mergeBam - merge multiple BAMs and headers appending ReadGroupIDs if necessary polishBam - adds/updates header lines & adds the RG tag to each record dedup - Mark Duplicates dedup_LowMem - Mark Duplicates using only a little memory recab - Recalibrate Informational Tools validate - Validate a SAM/BAM File diff - Diff 2 coordinate sorted SAM/BAM files. stats - Stats a SAM/BAM File gapInfo - Print information on the gap between read pairs in a SAM/BAM File. Tools to Print Information In Readable Format dumpHeader - Print SAM/BAM Header dumpRefInfo - Print SAM/BAM Reference Name Information dumpIndex - Print BAM Index File in English readReference - Print the reference string for the specified region explainFlags - Describe flags Additional Tools bam2FastQ - Convert the specified BAM file to fastQs. Dummy/Example Tools readIndexedBam - Read Indexed BAM By Reference and write it from reference id -1 to maxRefId Usage: bam <tool> [<tool arguments>] The usage for each tool is described by specifying the tool with no arguments.
list of polishBam parameters
Version: 1.0.14; Built: Wed Feb 17 10:23:06 CET 2016 by splaisan polishBam - adds/updates header lines & adds the RG tag to each record Usage: polishBam (options) --in <inBamFile> --out <outBamFile> Required parameters : -i/--in : input BAM file -o/--out : output BAM file Optional parameters : -v : turn on verbose mode -l/--log : writes logfile with specified name. --HD : add @HD header line --RG : add @RG header line --PG : add @PG header line --CO : add @CO header line -f/--fasta : fasta reference file to compute MD5sums and update SQ tags --AS : AS tag for genome assembly identifier --UR : UR tag for @SQ tag (if different from --fasta) --SP : SP tag for @SQ tag --checkSQ : check the consistency of SQ tags (SN and LN) with existing header lines. Must be used with --fasta option
align the reads in pairs to the reference genome using the bwa mem algorithm
This can now be done with the 7.5 million chr21 read-pairs and BWA as detailed below. These instrumental steps are performed by a bash script where all necessary information and parameters are defined and can be edited for reuse.
{{Tip|Because our laptops have 'only' 6GB RAM, we can only use in the next part 1 thread for BWA mem. Stronger computers could dedicate more resources to the job and make it end faster.
Run the more recent and recommended bwa mem command directly from the paired fastQ files.
bwa mem specific parameters
Algorithm options:
-t INT number of threads [1]
-k INT minimum seed length [19]
-w INT band width for banded alignment [100]
-d INT off-diagonal X-dropoff [100]
-r FLOAT look for internal seeds inside a seed longer than {-k} * FLOAT [1.5]
-c INT skip seeds with more than INT occurrences [10000]
-S skip mate rescue
-P skip pairing; mate rescue performed unless -S also in use
-A INT score for a sequence match [1]
-B INT penalty for a mismatch [4]
-O INT gap open penalty [6]
-E INT gap extension penalty; a gap of size k cost {-O} + {-E}*k [1]
-L INT penalty for clipping [5]
-U INT penalty for an unpaired read pair [17]
Input/output options:
-p first query file consists of interleaved paired-end sequences
-R STR read group header line such as '@RG\tID:foo\tSM:bar' [null]
-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]
-T INT minimum score to output [30]
-a output all alignments for SE or unpaired PE
-C append FASTA/FASTQ comment to SAM output
-M mark shorter split hits as secondary (for Picard/GATK compatibility)
Note: Please read the man page for detailed description of the command line and options.
#! /usr/bin/env bash ## script: 'mapping_bwa-mem.sh' ## ©SP-BITS, 2013 v1.1 # last edit: 2016-02-18 # required: # bwa version: 0.7.12-r1039 # Picard Version: 2.1.0 # bamutils Version: 1.0.12b ## define global variables refgen=ref/HiSeq_UCSC_hg19.fa ## select one of the folowing three blocks and comment the other two infolder=reads ## full data # f=shuffled_PE_NA18507_GAIIx_100_chr21 # pfx=all ## 10% sample #f=shuffled_10pc_PE_NA18507_GAIIx_100_chr21 #pfx=sample ## 1% sample f=shuffled_1pc_PE_NA18507_GAIIx_100_chr21 pfx=sample # create new folder outfolder=hg19_bwa-mapping mkdir -p ${outfolder} # common fq1=${infolder}/${f}_2_1.fq.gz fq2=${infolder}/${f}_2_2.fq.gz # marking secondary hits with -M to comply with Picard # using 'nthr' processors in parallel (again limited by our RAM!) # mem is more demanding and needs more than 3Gb per thread nthr=1 rgstring='@RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.mem' echo echo "# mapping paired reads with **bwa mem**" ## default numeric settings are left unchanged as: # -k INT minimum seed length [19] # -w INT band width for banded alignment [100] # -d INT off-diagonal X-dropoff [100] # -r FLOAT look for internal seeds inside a seed longer than {-k} # * FLOAT [1.5] # -c INT skip seeds with more than INT occurrences [10000] # -A INT score for a sequence match [1] # -B INT penalty for a mismatch [4] # -O INT gap open penalty [6] # -E INT gap extension penalty; a gap of size k cost {-O} + {-E}*k [1] # -L INT penalty for clipping [5] # -U INT penalty for an unpaired read pair [17] # -T INT minimum score to output [30] # '-p': first query file consists of interleaved paired-end sequences # # '-M': mark shorter split hits as secondary (for Picard/GATK compatibility) bwape=${f}_mem-pe # store the full command line to include it in the next part cmd="bwa mem -R \"${rgstring}\" \ -M \ -t ${nthr} \ ${refgen} \ ${fq1} ${fq2} > ${outfolder}/${bwape}.sam" # execute the command eval ${cmd} 2>${outfolder}/${pfx}_bwa_mem.err ########################### post process results ############## #### add @PG group to results with **bamUtil 'bam polishBam' ** # important formatting issues here # $'\t' add a 'tab' character # $'\'' add single quotes around ${cmd} to avoid it being interpreted # finally, $pgstring is called quoted to be replaced by its building variables # we first detect the current version of BWA to store it bwaver=$(expr "$(bwa 2>&1)" : '.*Version:\ \(.*\)Contact.*') # we then create a '@PG' line to store our action and include the 'BWA version number' AND the 'full command' # clean extra spaces with sed and the POSIX character class [:space:] # some explanation: [:space:] is functionally identical to [ \t\r\n\v\f] cmd=$(echo ${cmd} | sed -e "s/[[:space:]]\+/ /g") pgstring=@PG$'\t'ID:BWA$'\t'PN:bwa$'\t'VN:${bwaver}$'\t'CL:$'\''${cmd}$'\'' # clean up initial file after success to save space bam polishBam --PG "${pgstring}" \ -i ${outfolder}/${bwape}.sam \ -o ${outfolder}/${bwape}_pg.sam \ -l ${outfolder}/polishBam_${bwape}.log && rm ${outfolder}/${bwape}.sam ##### convert to sorted bam, index & cleanup java -jar $PICARD/picard.jar SortSam \ I=${outfolder}/${bwape}_pg.sam \ O=${outfolder}/${bwape}.bam \ SO=coordinate \ CREATE_INDEX=true \ VALIDATION_STRINGENCY=LENIENT && rm ${outfolder}/${bwape}_pg.sam ##### get basic stats from the resulting bam file samtools flagstat ${outfolder}/${bwape}.bam \ >${outfolder}/${bwape}_flagstat.txt echo "# finished mapping ${f} reads with BWA mem"
bwa mem mapping results on the 10% sample and full data
cat shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_flagstat.txt 1500296 + 0 in total (QC-passed reads + QC-failed reads) 2054 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 1498454 + 0 mapped (99.88% : N/A) 1498242 + 0 paired in sequencing 749121 + 0 read1 749121 + 0 read2 1475990 + 0 properly paired (98.51% : N/A) 1494678 + 0 with itself and mate mapped 1722 + 0 singletons (0.11% : N/A) 10614 + 0 with mate mapped to a different chr 7042 + 0 with mate mapped to a different chr (mapQ>=5) ## as a comparison, below are the results obtained with the full data cat shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_flagstat.txt 15010573 + 0 in total (QC-passed reads + QC-failed reads) 20379 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 14992193 + 0 mapped (99.88% : N/A) 14990194 + 0 paired in sequencing 7495097 + 0 read1 7495097 + 0 read2 14769700 + 0 properly paired (98.53% : N/A) 14954678 + 0 with itself and mate mapped 17136 + 0 singletons (0.11% : N/A) 104906 + 0 with mate mapped to a different chr 70173 + 0 with mate mapped to a different chr (mapQ>=5)
what did we learn here?
- using '>>' instead of '>' allows 'appending' content to an existing file without loosing what was already written
- adding '2> logfile....txt' after a command allows storing all error messages to a file and catch execution errors of very long execution output
- storing a complex command into a variable allows:
- echo the same string to the terminal as a comment at run-time (not done here but achieved with eg: echo "# "${cmd})
- recycle the command text for other uses (eg log the full command in the BAM file as a PG-tag)
- execute the command through the use of a bash 'eval' call
- running 'bam polishBam' on a BAM/SAM file allows annotating it to keep track of important information (eg RG tag from the ${cmd} string)
identify duplicate reads with Picard MarkDuplicates
The presence of PCR duplicates in NGS libraries can cause false positive variant calls at later stages. For this reason, the duplicates present after mapping to the reference genome must be marked using the dedicated picard tool.
You can read more about read duplicates in our [Q&A section about read duplicates]
We added 'LENIENT' for residual error messages about "MAPQ should be 0 for unmapped read".
the next command will leave duplicates in but 'mark' them, while adding a line with REMOVE_DUPLICATES=TRUE would physically remove the duplicates from the output
##! /usr/bin/env bash ## script: 'picard_markdup.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2016-02-19 # required: # Picard Version: 2.1.0 # 10% sample infile="hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe.bam" # full data # infile="hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.bam" echo "# marking duplicates in ${infile}" # we extract part of the infile and replace the end of it by a new text and extension # ${infile%%-pe.bam} removes the text '-pe.bam' from the end of 'infile' ## the reads are sorted by coordinate in the bam data ## this is required to run MarkDuplicates java -jar $PICARD/picard.jar MarkDuplicates \ I=${infile} \ O=${infile%%-pe.bam}_mdup.bam \ ASSUME_SORTED=TRUE \ REMOVE_DUPLICATES=FALSE \ CREATE_INDEX=TRUE \ M=${infile%%-pe.bam}_duplicate_metrics.txt \ VALIDATION_STRINGENCY=LENIENT \ 2>${infile%%-pe.bam}_MarkDuplicates.err echo echo "# results are" head -8 ${infile%%-pe.bam}_duplicate_metrics.txt| tail -2 | transpose -t | column -t
MarkDuplicate results
# results are
LIBRARY lib-NA18507
UNPAIRED_READS_EXAMINED 1722
READ_PAIRS_EXAMINED 747339
UNMAPPED_READS 1842
UNPAIRED_READ_DUPLICATES 101
READ_PAIR_DUPLICATES 61
READ_PAIR_OPTICAL_DUPLICATES 12
PERCENT_DUPLICATION 0.000149
ESTIMATED_LIBRARY_SIZE 5698706455
# as a comparison, marking duplicates in hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.bam
# results are
LIBRARY lib-NA18507
UNPAIRED_READS_EXAMINED 17136
READ_PAIRS_EXAMINED 7477339
UNMAPPED_READS 18380
UNPAIRED_READ_DUPLICATES 4679
READ_PAIR_DUPLICATES 5450
READ_PAIR_OPTICAL_DUPLICATES 1310
PERCENT_DUPLICATION 0.001041
ESTIMATED_LIBRARY_SIZE 6747629693
download exercise files
Download exercise files here
References:
- ↑ http://seqanswers.com/forums/showthread.php?p=102458
- ↑ http://genome.sph.umich.edu/wiki/BamUtil:_polishBam
- ↑ http://genome.sph.umich.edu/wiki/Main_Page
[ Main_Page | Hands-on_introduction_to_NGS_variant_analysis-2016 | NGS-Var Exercise.1 | NGS-Var Exercise.3 ]