NGS Exercise.3c
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3d | NGS Exercise.3e ]
# updated 2014 version
Align paired end reads to the human reference genome hg19 using Bowtie2
Contents
- 1 AIM of this exercise
- 2 Method & Results
- 2.1 Map all chr21 paired reads to the human genome using Bowtie2
- 2.2 Post process and clean the mapping results
- 2.3 Call variants from the Bowtie2 mappings using samtools_0.2.0 and bcftools (htlib)
- 2.3.1 Review samtools bcftools variant calls
- 2.3.2 Perform QC on the obtained VCF files
- 2.3.3 Compare the BWA-mem and Bowtie2 variants using vcftools
- 2.3.4 Compare bowtie2 & BWA-mem results with vcf-compare
- 2.3.5 Compare bowtie2 & BWA-mem results to the Complete Genomics and HapMap3.3 sets with vcf-compare
- 3 Discussion about BWA, Bowtie2, and this work
- 4 Conclusion
- 5 download exercise files
AIM of this exercise
During the training session, the rightful question was asked of why BWA was chosen over the faster and more recent Bowtie2. After searching for elements of answer (see Which mapper should I use for DNA sequencing?), we decided to reproduce the BWA analysis on the full read set using Bowtie2 and compare the results of both experiments; This page presents the detailed results of such analysis using ~7.5million real read-pairs rather than synthetic read sets preferred by most bioinformaticians.
We acknowledge the fact that our read-set (obtained from the Illumina site) were extracted from a BAM file obtained using CASAVA and therefore do not contain difficult to map or unmapped sequences. This strategy was chosen in order to speed up our experiment and start from a smaller subset of reads while keeping a rather high genome coverage of the human chr21 which is key to a good variant calling step
Any comment on our approach or discussion about its validity are MOST WELCOME to help us improve this document.
Method & Results
BWA is not the only 'mapper' able to quickly align short reads to a reference genome. The second most popular mapper is Bowtie2; we present here the results of full genom emapping of the chr21 reads to compare BWA and Bowtie2 results and verify that the two mappers obtain comparable results with the data used during this training. We proceed with full set mapping only in order to maximize the comparison with BWA data.
A lot of the code presented in this exercise was adapted from previous 'BWA' pages. Please refer to the original pages for comments and explanatory text
Map all chr21 paired reads to the human genome using Bowtie2
list of Bowtie 2 version 2.2.2 commands
Usage:
bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
<bt2-idx> Index filename prefix (minus trailing .X.bt2).
NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
<m1> Files with #1 mates, paired with files in <m2>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads.
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<sam> File for SAM output (default: stdout)
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.
Options (defaults in parentheses):
Input:
-q query input files are FASTQ .fq/.fastq (default)
--qseq query input files are in Illumina's qseq format
-f query input files are (multi-)FASTA .fa/.mfa
-r query input files are raw one-sequence-per-line
-c <m1>, <m2>, <r> are sequences themselves, not files
-s/--skip <int> skip the first <int> reads/pairs in the input (none)
-u/--upto <int> stop after first <int> reads/pairs (no limit)
-5/--trim5 <int> trim <int> bases from 5'/left end of reads (0)
-3/--trim3 <int> trim <int> bases from 3'/right end of reads (0)
--phred33 qualities are Phred+33 (default)
--phred64 qualities are Phred+64
--int-quals qualities encoded as space-delimited integers
Presets: Same as:
For --end-to-end:
--very-fast -D 5 -R 1 -N 0 -L 22 -i S,0,2.50
--fast -D 10 -R 2 -N 0 -L 22 -i S,0,2.50
--sensitive -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default)
--very-sensitive -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
For --local:
--very-fast-local -D 5 -R 1 -N 0 -L 25 -i S,1,2.00
--fast-local -D 10 -R 2 -N 0 -L 22 -i S,1,1.75
--sensitive-local -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default)
--very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50
Alignment:
-N <int> max # mismatches in seed alignment; can be 0 or 1 (0)
-L <int> length of seed substrings; must be >3, <32 (22)
-i <func> interval between seed substrings w/r/t read len (S,1,1.15)
--n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
--dpad <int> include <int> extra ref chars on sides of DP table (15)
--gbar <int> disallow gaps within <int> nucs of read extremes (4)
--ignore-quals treat all quality values as 30 on Phred scale (off)
--nofw do not align forward (original) version of read (off)
--norc do not align reverse-complement version of read (off)
--no-1mm-upfront do not allow 1 mismatch alignments before attempting to
scan for the optimal seeded alignments
--end-to-end entire read must align; no clipping (on)
OR
--local local alignment; ends might be soft clipped (off)
Scoring:
--ma <int> match bonus (0 for --end-to-end, 2 for --local)
--mp <int> max penalty for mismatch; lower qual = lower penalty (6)
--np <int> penalty for non-A/C/G/Ts in read/ref (1)
--rdg <int>,<int> read gap open, extend penalties (5,3)
--rfg <int>,<int> reference gap open, extend penalties (5,3)
--score-min <func> min acceptable alignment score w/r/t read length
(G,20,8 for local, L,-0.6,-0.6 for end-to-end)
Reporting:
(default) look for multiple alignments, report best, with MAPQ
OR
-k <int> report up to <int> alns per read; MAPQ not meaningful
OR
-a/--all report all alignments; very slow, MAPQ not meaningful
Effort:
-D <int> give up extending after <int> failed extends in a row (15)
-R <int> for reads w/ repetitive seeds, try <int> sets of seeds (2)
Paired-end:
-I/--minins <int> minimum fragment length (0)
-X/--maxins <int> maximum fragment length (500)
--fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
--no-mixed suppress unpaired alignments for paired reads
--no-discordant suppress discordant alignments for paired reads
--no-dovetail not concordant when mates extend past each other
--no-contain not concordant when one mate alignment contains other
--no-overlap not concordant when mates overlap at all
Output:
-t/--time print wall-clock time taken by search phases
--un <path> write unpaired reads that didn't align to <path>
--al <path> write unpaired reads that aligned at least once to <path>
--un-conc <path> write pairs that didn't align concordantly to <path>
--al-conc <path> write pairs that aligned concordantly at least once to <path>
(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
--un-gz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.)
--quiet print nothing to stderr except serious errors
--met-file <path> send metrics to file at <path> (off)
--met-stderr send metrics to stderr (off)
--met <int> report internal counters & metrics every <int> secs (1)
--no-head supppress header lines, i.e. lines starting with @
--no-sq supppress @SQ header lines
--rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field
--rg <text> add <text> ("lab:value") to @RG line of SAM header.
Note: @RG line only printed when --rg-id is set.
--omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.
Performance:
-p/--threads <int> number of alignment threads to launch (1)
--reorder force SAM output order to match order of input reads
--mm use memory-mapped I/O for index; many 'bowtie's can share
Other:
--qc-filter filter out reads that are bad according to QSEQ filter
--seed <int> seed for random number generator (0)
--non-deterministic seed rand. gen. arbitrarily instead of using read attributes
--version print version information and quit
-h/--help print this usage message
prepare the reference genome for Bowtie2 alignment
Bowtie2, just like any other mapper using a 'hash' to map short reads, requires that we provide the reference genome in the correct format. Several reference genome hash files for Bowtie2 can be obtained from several places including the Bowtie2 pages (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml[1]) and the Illumina iGenome site (http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn[2]). We will instead here build our own index files in order to use a reference genome identical to that used in the former exercises (HiSeq_UCSC_hg19.fa). A very good documentation is present on the Bowtie2 site that you should read before proceeding with the code reproduced below.
bowtie2-build command parameters
Bowtie 2 version 2.2.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea) Usage: bowtie2-build [options]* <reference_in> <bt2_index_base> reference_in comma-separated list of files with ref sequences bt2_index_base write bt2 data to files with this dir/basename *** Bowtie 2 indexes work only with v2 (not v1). Likewise for v1 indexes. *** Options: -f reference files are Fasta (default) -c reference sequences given on cmd line (as <reference_in>) --large-index force generated index to be 'large', even if ref has fewer than 4 billion nucleotides -a/--noauto disable automatic -p/--bmax/--dcv memory-fitting -p/--packed use packed strings internally; slower, less memory --bmax <int> max bucket sz for blockwise suffix-array builder --bmaxdivn <int> max bucket sz as divisor of ref len (default: 4) --dcv <int> diff-cover period for blockwise (default: 1024) --nodc disable diff-cover (algorithm becomes quadratic) -r/--noref don't build .3/.4 index files -3/--justref just build .3/.4 index files -o/--offrate <int> SA is sampled every 2^<int> BWT chars (default: 5) -t/--ftabchars <int> # of chars consumed in initial lookup (default: 10) --seed <int> seed for random number generator -q/--quiet verbose output (for debugging) -h/--help print detailed description of tool and its options --usage print this usage message --version print version information and quit
Build the Bowtie2 index files from HiSeq_UCSC_hg19.fa
The command creates hash files from the fasta sequence of the reference genome. All files are named by the fasta genome file with different added extensions. These files could be obtained premade from the above cited links but are created for the sake of training you in the underlying commands. As always, you are warmly advised to read the documentation. A new ENV variable waa created in .bashrc and exported that points to the default location for bowtie indexes; this variable was named BOWTIE2_INDEXES as recommended by the bowtie documentation and is used below.
# define variables infasta=ref/HiSeq_UCSC_hg19.fa idxfolder=$BOWTIE2_INDEXES # we use the fasta file basename to name the index name=$(basename ${infasta} ".fa") # create bowtie2 index next to fasta file bowtie2-build -f ${infasta} ${idx}/${name} 2>${idxfolder}/${name}_bowtie2-build.err
Control if the obtained Bowtie2 index files are valid
Controlling the obtained indexes before using them is always a good idea.
The recent version 2.2.2 of bowtie2 contained a bug that prevented inspect to work. Downloading and building from github is/was necessary to regain that functionality
bowtie2-inspect command parameters
Bowtie 2 version 2.2.2 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea) Usage: bowtie2-inspect [options]* <bt2_base> <bt2_base> bt2 filename minus trailing .1.bt2/.2.bt2 By default, prints FASTA records of the indexed nucleotide sequences to standard out. With -n, just prints names. With -s, just prints a summary of the index parameters and sequences. With -e, preserves colors if applicable. Options: --large-index force inspection of the 'large' index, even if a 'small' one is present. -a/--across <int> Number of characters across in FASTA output (default: 60) -n/--names Print reference sequence names only -s/--summary Print summary incl. ref names, lengths, index properties -e/--bt2-ref Reconstruct reference from .bt2 (slow, preserves colors) -v/--verbose Verbose output (for debugging) -h/--help print detailed description of tool and its options --help print this usage message
# define variables idxfolder=$BOWTIE2_INDEXES ref=${idxfolder}/HiSeq_UCSC_hg19 # create bowtie2 index next to fasta file bowtie2-inspect -s ${ref} 2>bowtie2-inspect_HiSeq_UCSC_hg19.err
bowtie2-inspect results
Flags 1 Reverse flags 5 Colorspace 0 2.0-compatible 1 SA-Sample 1 in 16 FTab-Chars 10 Sequence-1 chrM 16571 Sequence-2 chr1 249250621 Sequence-3 chr2 243199373 Sequence-4 chr3 198022430 Sequence-5 chr4 191154276 Sequence-6 chr5 180915260 Sequence-7 chr6 171115067 Sequence-8 chr7 159138663 Sequence-9 chr8 146364022 Sequence-10 chr9 141213431 Sequence-11 chr10 135534747 Sequence-12 chr11 135006516 Sequence-13 chr12 133851895 Sequence-14 chr13 115169878 Sequence-15 chr14 107349540 Sequence-16 chr15 102531392 Sequence-17 chr16 90354753 Sequence-18 chr17 81195210 Sequence-19 chr18 78077248 Sequence-20 chr19 59128983 Sequence-21 chr20 63025520 Sequence-22 chr21 48129895 Sequence-23 chr22 51304566 Sequence-24 chrX 155270560 Sequence-25 chrY 59373566
This looks OK.
Map paired reads to the reference genome using Bowtie2
In order to compare Bowtie2 and BWA, we should ideally use the same parameters in both. Obviously, this is not possible as the two programs are different and do not even have the same parameters. We will therefore do for Bowtie2 as done for BWA and use standard parameters. It is very likely that fine-tuning the cutoffs in either program can lead to optimal mappings not reached with standard parameters used in this demo. However, this would require deep knowledge which we do not have, we will therefore trust the makers and apply what has been defined by them to be adequate for standard conditions. We will however show some of these default parameters to tease you, please have a look in the help text at the variety of cutoffs you could apply.
In order to know what the standard parameters are, run the dry command 'bowtie2' (do the same for 'bwa mem') and look at the statements indicating default values in each
By default, Bowtie2 reports only the best alignment of the read (based on the mapping quality\). This can be changes using -a or -k but will considerably slow down the mapping.
#! /usr/bin/env bash ## script: 'mapping_bowtie2.sh' ## ©SP-BITS, 2014 v1.0 # last edit: 2014-05-27 # required: # bowtie2 version: 2.2.2 (github) # define variables idxfolder=$BOWTIE2_INDEXES ref=${idxfolder}/HiSeq_UCSC_hg19 # full trainer data # infolder=Final_Results/reads_results # f=shuffled_PE_NA18507_GAIIx_100_chr21 # pfx=all # outfolder=hg19_bowtie2-mapping/full_chromosome # 10 pc sample infolder=Final_Results/reads f=shuffled_10pc_PE_NA18507_GAIIx_100_chr21 pfx=sample fq1=${infolder}/${f}_2_1.fq.gz fq2=${infolder}/${f}_2_2.fq.gz outfolder=hg19_bowtie2-mapping mkdir -p ${outfolder} # output name bowtiepe="bowtie_mapped-reads" # using 'nthr' processors in parallel # computers with more than 4GB RAM will be primarily limited by cpu # you need ~3.5GB of RAM to store the full human reference but multithreading uses the same shared copy nthr=4 echo echo "# mapping paired reads with **bowtie2**" # we report here some of the options even they are kept as default # -I 0 (min insert size) # -X 500 (max insert size) # --end-to-end (default) # --sensitive -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default) ## Special parameters # --seed 2014 (to reproduce these results later on => this is only interesting in the context of a training or for benchmarking) # -u 100 (to stop after 100 mapping and test that you command is correct => remove after the first test run) cmd="bowtie2 -p ${nthr} \ -x ${ref} \ -q \ -1 ${fq1} \ -2 ${fq2} \ --phred33 \ --fr \ -I 0 \ -X 500 \ --un-gz ${outfolder}/bowtie_unmapped-reads.sam.gz \ --end-to-end \ --sensitive \ --seed 2014 \ -S ${outfolder}/${bowtiepe}.sam \ -u 100" # show it echo "# ${cmd}" # run it eval "${cmd} 2>${outfolder}/bowtie2-mapping.err" echo "# finished mapping ${f} reads with Bowtie2" echo echo "## the resulting SAM header is:" head -100 ${outfolder}/${bowtiepe}.sam | grep "^@"
bowtie2 summarized counts
7495097 (100.00%) were paired; of these:
132563 (1.77%) aligned concordantly 0 times
6688102 (89.23%) aligned concordantly exactly 1 time
674432 (9.00%) aligned concordantly >1 times
----
132563 pairs aligned concordantly 0 times; of these:
9036 (6.82%) aligned discordantly 1 time
----
123527 pairs aligned 0 times concordantly or discordantly; of these:
247054 mates make up the pairs; of these:
131691 (53.30%) aligned 0 times
78251 (31.67%) aligned exactly 1 time
37112 (15.02%) aligned >1 times
99.12% overall alignment rate
Post process and clean the mapping results
Convert mapped reads from SAM to BAM, sort, and index using samtools
A @RG line is first added to the SAM data using bamutils bam polishbam to describe the library and read group. Then the SAM is sorted, converted to BAM and the BAM version indexed using a Picard command. Finally, samtools flagstat is applied to get basic BAM counts (compare them to those reported by bowtie2 above.
bamutils bam polishbam command
-i/--in : input BAM file
-o/--out : output BAM file
Optional parameters:
-v : turn on verbose mode
-l/--log : writes logfile. <outBamFile>.log will be used if value is unspecified
--HD : add @HD header line
--RG : add @RG header line
--PG : add @PG header line
-f/--fasta : fasta reference file to compute MD5sums and update SQ tags
--AS : AS tag for genome assembly identifier
--UR : UR tag for @SQ tag (if different from --fasta)
--SP : SP tag for @SQ tag
--checkSQ : check the consistency of SQ tags (SN and LN) with existing header lines. Must be used with --fasta option
PhoneHome:
--noPhoneHome : disable PhoneHome (default enabled)
--phoneHomeThinning : adjust the PhoneHome thinning parameter (default 50)
#! /usr/bin/env bash ## script: 'post-process_bowtie2.sh' ## ©SP-BITS, 2014 v1.0 # last edit: 2014-05-27 # required: # bamutils Version: 1.0.11 # Picard Version: 1.112+ # samtools version 0.1.19+ infolder=hg19_bowtie2-mapping bowtiepe="bowtie_mapped-reads" echo "# add a RG fields to the data" rgstring=@RG$'\t'ID:NA18507$'\t'LB:lib-NA18507$'\t'PU:unkn-0.0$'\t'PL:ILLUMINA$'\t'SM:GAIIx-chr21-Bowtie2 # clean up initial file after success to save space bam polishBam --RG "${rgstring}" \ -i ${outfolder}/${bowtiepe}.sam \ -o ${outfolder}/${bowtiepe}_rg.sam \ -l ${outfolder}/polishBam_${bowtiepe}.log && rm ${outfolder}/${bowtiepe}.sam echo echo "# sort, compress, and index" ##### convert to sorted bam, index & cleanup java -Xmx4g -jar $PICARD/SortSam.jar \ I=${outfolder}/${bowtiepe}_rg.sam \ O=${outfolder}/${bowtiepe}-pe.bam \ SO=coordinate \ CREATE_INDEX=true \ VALIDATION_STRINGENCY=LENIENT && rm ${outfolder}/${bowtiepe}_rg.sam echo echo "# get basic stats from the resulting bam file" ##### get basic stats from the resulting bam file samtools flagstat ${outfolder}/${bowtiepe}-pe.bam \ >${outfolder}/${bowtiepe}_flagstat.txt echo "# finished sorting and indexing"
samtools flagstat results
1498242 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
1485061 + 0 mapped (99.12%:-nan%)
1498242 + 0 paired in sequencing
749121 + 0 read1
749121 + 0 read2
1471456 + 0 properly paired (98.21%:-nan%)
1476598 + 0 with itself and mate mapped
8463 + 0 singletons (0.56%:-nan%)
4090 + 0 with mate mapped to a different chr
1910 + 0 with mate mapped to a different chr (mapQ>=5)
## results from the BWA-mem mapping on 10 pc sample
1485462 + 0 in total (QC-passed reads + QC-failed reads)
180 + 0 duplicates
1483826 + 0 mapped (99.89%:-nan%)
1485462 + 0 paired in sequencing
742791 + 0 read1
742671 + 0 read2
1469184 + 0 properly paired (98.90%:-nan%)
1482186 + 0 with itself and mate mapped
1640 + 0 singletons (0.11%:-nan%)
4664 + 0 with mate mapped to a different chr
4291 + 0 with mate mapped to a different chr (mapQ>=5)
Identify duplicate reads with Picard MarkDuplicates
#! /usr/bin/env bash ## script: 'picard_markdup-bowtie2.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2014-04-08 # required: # Picard Version: 1.112+ # declare variables infile=hg19_bowtie2-mapping/bowtie_mapped-reads-pe.bam outname=${infile%%-pe.bam} echo "# marking duplicates in ${infile}" java -Xmx4g -jar $PICARD/MarkDuplicates.jar \ I=${infile} \ O=${outname}_mdup.bam \ ASSUME_SORTED=TRUE \ REMOVE_DUPLICATES=FALSE \ CREATE_INDEX=TRUE \ M=${outname}_duplicate_metrics.txt \ VALIDATION_STRINGENCY=LENIENT \ 2>${outname}_MarkDuplicates.err echo echo "# results are" head -8 ${outname}_duplicate_metrics.txt| tail -2 | transpose -t | column -t echo
Picard MarkDuplicates counts
LIBRARY lib-NA18507 lib-NA18507
UNPAIRED_READS_EXAMINED 83867 17268
READ_PAIRS_EXAMINED 7387318 7477196
UNMAPPED_READS 131691 18534
UNPAIRED_READ_DUPLICATES 24266 4725
READ_PAIR_DUPLICATES 3371 5428
READ_PAIR_OPTICAL_DUPLICATES 1110 1310
PERCENT_DUPLICATION 0.002087 0.001041
ESTIMATED_LIBRARY_SIZE 12062126176 6783431981
The duplication rate is not alarming here with 3371 PCR duplicates and 1110 otical duplicates detected in 7'387'318 pairs
Perform integrated BAM QC with Qualimap
Qualimap command parameters
usage: qualimap bamqc -bam <arg> [-c] [-gd <arg>] [-gff <arg>] [-nr <arg>] [-nt <arg>] [-nw <arg>] [-os] [-outdir <arg>] [-outformat <arg>] -bam <arg> input mapping file -c,--paint-chromosome-limits paint chromosome limits inside charts -gd <arg> compare with genome distribution (possible values: HUMAN or MOUSE) -gff <arg> region file (in GFF/GTF or BED format) -hm <arg> minimum size for a homopolymer to be considered in indel analysis (default is 3) -nr <arg> number of reads in the chunk (default is 500) -nt <arg> number of threads (default equals the number of cores) -nw <arg> number of windows (default is 400) -os,--outside-stats compute region outside stats (only with -gff option) -outdir <arg> output folder -outformat <arg> output report format (PDF or HTML, default is HTML) -p <arg> specify protocol to calculate correct strand reads (works only with -gff option, possible values are STRAND-SPECIFIC-FORWARD or STRAND-SPECIFIC-REVERSE, default is NON-STRAND-SPECIFIC)
#! /usr/bin/env bash ## script: 'qualimap_bamqc-bowtie2.sh' ## ©SP-BITS, 2014 v1.0 # last edit: 2014-04-09 # required: # qualimap 0.8 # R dependencies # full chromosome # outfolder="bowtie2-mapping-qualimapQC/full_chromosome" # infile="hg19_bowtie2-mapping/full_chromosome/bowtie_mapped-reads_mdup_chr21.bam" # declare variables outfolder="bowtie2-mapping-qualimapQC" mkdir -p ${outfolder} infile="hg19_bowtie2-mapping/bowtie_mapped-reads_mdup.bam" outname=$(basename "${infile}" ".bam") mygtf="ref/Homo_sapiens.hg19_chr21.gtf" # dedicate as much RAM as you can below (default=1200M) maxram="4096M" echo "# processing ${infile}" # read the full help to find more about other parameters # we output here to HTML but PDF is also available for portability and compactness # we restrict the analysis to chr21 using annotation data from the Qualimap site (-gff) # we also ask analysis outside of regions defined in the 'gtf' file (-os) ## Human genome annotations from Ensembl database (v.64) qualimap bamqc --java-mem-size=${maxram} \ -bam ${infile} \ -gd HUMAN \ -c \ -gff ${mygtf} \ -os \ -outdir ${outfolder}/${outname}_qualimap-results \ -outformat HTML # for PDF, replace last line by # -outformat PDF # -outfile ${outname}_chr21_report.pdf
Results of Qualimap analysis of 'all' chr21 mappings
extract all chr21 reads with samtools view
Because the re-mapping was done against the full human genome, some of the reads may now map to other chromosomes. For each method, we extract all reads mapping at least by one end to chr21 to a new bam file using samtools to get a clear chr21-specific picture out of our results.
#! /usr/bin/env bash # script: samtools_extract-chr21-bowtie.sh # required: # samtools Version: 0.1.19+ infile=hg19_bowtie2-mapping/bowtie_mapped-reads_mdup.bam echo "# processing ${infile}" # subset and index ONLY chr21 calls # also produce a text summary with samtools flagstat samtools view -b ${infile} chr21 > ${infile%%.bam}_chr21.bam && \ samtools index ${infile%%.bam}_chr21.bam && \ samtools flagstat ${infile%%.bam}_chr21.bam > ${infile%%.bam}_chr21.flagstat.txt
samtools flagstat counts
28800 + 0 duplicates
14773537 + 0 mapped (99.51%:-nan%)
14846590 + 0 paired in sequencing
7423315 + 0 read1
7423275 + 0 read2
14672436 + 0 properly paired (98.83%:-nan%)
14700484 + 0 with itself and mate mapped
73053 + 0 singletons (0.49%:-nan%)
18122 + 0 with mate mapped to a different chr
16582 + 0 with mate mapped to a different chr (mapQ>=5)
Compare BWA-mem and Bowtie2 mappings
The Picard tool CompareSAMs compares the headers of the two input SAM or BAM files, and, if possible, the SAMRecords. For SAMRecords, compares only the readUnmapped flag, reference name, start position and strand. Reports the number of SAMRecords that match, differ in alignment, are mapped in only one input, or are missing in one of the files.
Only the subset of hg19 mappings attributed to chr21 is analyzed here using data obtained by the trainer
#! /usr/bin/env bash # script: picard_CompareSAMs-bowtie2.sh # required: # Picard Version: 1.112+ # declare variables bam1=hg19_bwa-mapping/full_chromosome/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam bam2=hg19_bowtie2-mapping/full_chromosome/bowtie_mapped-reads_mdup_chr21.bam outfolder="hg19_bowtie2-mapping" name="bwa_vs_bowtie2" java -Xmx4g -jar $PICARD/CompareSAMs.jar \ ${bam1} \ ${bam2} >\ ${outfolder}/CompareSAMs_${name}.txt \ 2>${outfolder}/CompareSAMs_${name}.err
comparison results for bwa-mem & bowtie2 methods
Sample for read group NA18507 differs. Final_Results/hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: GAIIx-chr21-BWA.aln hg19_bowtie2-mapping/bowtie_mapped-reads_mdup_chr21.bam: GAIIx-chr21-Bowtie2 Number of program records differs. Final_Results/hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: 3 hg19_bowtie2-mapping/bowtie_mapped-reads_mdup_chr21.bam: 2 Match 14501704 Differ 220357 Unmapped_both 5007 Unmapped_left 8923 Unmapped_right 52603 Missing_left 43683 Missing_right 25068 SAM files differ.
A very basic comparison of the resulting BAM files shows >98% agreement between both methods with apparently the BWA algorithm mapping more reads to chr21 than the bowtie2 method (unmapped 8923 vs 52603 ; using the default parameters!).
Call variants from the Bowtie2 mappings using samtools_0.2.0 and bcftools (htlib)
#! /usr/bin/env bash # script bcftools_call-variants.sh # required # samtools_0.2.0 (htslib) # bcftools_0.2.0 (htslib) call, vcfutils_0.2.0.pl # source user-defined functions (including vcf2index) . $HOME/.myfunctions # we call from the chr21 bowtie2 mapping results infolder="hg19_bowtie2-mapping" infile="bowtie_mapped-reads_mdup_chr21.bam" name="NA18507_bowtie2" outfolder="bcftools_htslib_variants-bowtie2" mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # call variants from pileup, filter and convert calls to VCF format # exclude calls where more than 1000 reads support a variation (samtools_0.2.0 mpileup -uf ${ref} ${infolder}/${infile} | \ bcftools_0.2.0 call -vc -O u > ${outfolder}/${name}_var_htslib.raw.bcf \ 2>${outfolder}/samtools_mpileup_${name}_var_htslib_raw.err) && \ (bcftools_0.2.0 view ${outfolder}/${name}_var_htslib.raw.bcf | \ vcfutils_0.2.0.pl varFilter -D1000 > \ ${outfolder}/${name}_var_htslib.flt-D1000.vcf \ 2>${outfolder}/samtools_mpileup_${name}_htslib_filter.err) # Sort, compress, and index the variant VCF files vcf2index ${outfolder}/${name}_var_htslib.flt-D1000.vcf && \ rm ${outfolder}/${name}_var_htslib.flt-D1000.vcf # extract the chr21 subset of calls vcftools --gzvcf ${outfolder}/${name}_var_htslib.flt-D1000.vcf.gz \ --chr chr21 \ --out ${outfolder}/chr21_${name}_var_htslib.flt-D1000 \ --recode # recoding introduces the word recode at the end of the file-name # remove the recode tag from the name mv ${outfolder}/chr21_${name}_var_htslib.flt-D1000.recode.vcf \ ${outfolder}/chr21_${name}_var_htslib.flt-D1000.vcf # sort, compress and index using our custom 'vcf2index' function) vcf2index ${outfolder}/chr21_${name}_var_htslib.flt-D1000.vcf && \ rm ${outfolder}/chr21_${name}_var_htslib.flt-D1000.vcf
Review samtools bcftools variant calls
# analysis of 'aln' calls infolder="bcftools_htslib_variants-bowtie2" infile="NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz" zgrep -v "^#" ${infolder}/${infile} | \ cut -f 1 | sort | uniq -c | sort -k 2V,2
variant counts for the bowtie data
77688 chr21
Perform QC on the obtained VCF files
# analysing the chr21 subset of our bwa-calls infolder="bcftools_htslib_variants-bowtie2" infile="NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz" bcftools_0.2.0 stats ${infolder}/${infile} > \ ${infolder}/chr21_bowtie2_var.check plot-vcfstats_0.2.0 -t "NA18507_bowtie2-calls" \ -p ${outfolder}/chr21_bowtie2_plots/ \ ${infolder}/chr21_bowtie2_var.check
The summary statistics derived from the VCF file obtained from the bowtie2 mappings is presented below.
hg19 bowtie2 variants PDF | |
---|---|
Compare the BWA-mem and Bowtie2 variants using vcftools
#! /usr/bin/env bash infile1=bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz infile2=Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz outfolder=bcftools_htslib_variants-bowtie2/compare_chr21_bowtie2-bwa-mem mkdir -p ${outfolder} # run vcftools diff to identify differences vcftools --gzvcf ${infile1} \ --gzdiff ${infile2} \ --out ${outfolder}/compare_chr21_NA18507_bowtie2-bwa-mem
comparing bowtie and bwa data
VCFtools - v0.1.12a (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --gzvcf bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz --out bcftools_htslib_variants-bowtie2/compare_chr21_bowtie2-bwa-mem/compare_chr21_NA18507_bowtie2-bwa-mem --gzdiff Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz Using zlib version: 1.2.3 Versions of zlib >= 1.2.4 will be *much* faster when reading zipped VCF files. After filtering, kept 1 out of 1 Individuals Using zlib version: 1.2.3 Versions of zlib >= 1.2.4 will be *much* faster when reading zipped VCF files. Comparing individuals in VCF files... N_combined_individuals: 2 N_individuals_common_to_both_files: 0 N_individuals_unique_to_file1: 1 N_individuals_unique_to_file2: 1 Comparing sites in VCF files... Non-matching REF. Skipping all such sites. Found 75788 SNPs common to both files. Found 1792 SNPs only in main file. Found 8889 SNPs only in second file. After filtering, kept 77688 out of a possible 77688 Sites Run Time = 57.00 seconds
Compare bowtie2 & BWA-mem results with vcf-compare
Another tool to compare VCF files is vcf-compare that produces values directly usable to draw VENN diagrams. The command can look at positions only (less stringent) or at position + genotype.
infolder=bcftools_htslib_variants-bowtie2 infile1=bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz infile2=Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz vcf-compare ${infile1} \ ${infile2} | \ grep ^VN | cut -f 2- | column -t \ > ${infolder}/vcf-compare_bowtie2_bwa-mem.txt
command results
1792 bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz (2.3%) 8889 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (10.5%) 75896 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (89.5%) \ bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz (97.7%)
The obtained counts were used in 2DVenn.R to plot the next figure. This Venn diagram is only slightly different from the previous one, possibly due to the reported error: 'Non-matching REF. Skipping all such sites.'.
vcftools venn diagram
2DVenn.R -a 1792 -b 8889 -i 75896 -A "bowtie2" -B "BWA mem" \ -o ${infolder}/bowtie2_bwa-mem_recall -t "bcftools_0.1.19 calls" -x 1 -u 2
Compare bowtie2 & BWA-mem results to the Complete Genomics and HapMap3.3 sets with vcf-compare
What about gold-standard calls?. We add two more important reference sets being Complete Genomics and HapMap. Looking at variants from the hapMap3.3 set indicates true positive but is restricted on variants present in that set. By contrast looking at the Complete Genomics set reveals additional variants of high quality in either BWA or Bowtie lists. This is not all what could be done to truly appreciate the quality of BWA and Bowtie but constitute an honest estimate.
infolder=bcftools_htslib_variants-bowtie2 infile1=bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz infile2=Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz infile3=public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz infile4=public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz vcf-compare ${infile1} \ ${infile2} \ ${infile3} \ ${infile4} | \ grep ^VN | cut -f 2- | column -t \ > ${infolder}/vcf-compare4.txt
command results
2 bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz (0.0%)\ public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz (0.0%) 7 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (0.0%)\ public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz (0.1%) 11 public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz (0.0%)\ public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz (0.1%) 21 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (0.0%)\ public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz (0.0%)\ public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz (0.2%) 70 public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz (0.7%) 90 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (0.1%)\ bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz (0.1%)\ public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz (0.9%) 506 bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz (0.7%)\ public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz (0.3%) 1284 bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz (1.7%) 2635 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (3.1%)\ public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz (1.3%) 6226 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (7.3%) 7193 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (8.5%)\ bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz (9.3%) 9851 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (11.6%)\ bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz (12.7%)\ public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz (4.9%)\ public_info/NA18507_hapmap_3.3.hg19/chr21_NA18507_hapmap_3.3.hg19.vcf.gz (98.0%) 58762 Final_Results/bcftools_samtools_variants/chr21_NA18507_mem_var.flt-D1000.vcf.gz (69.3%)\ bcftools_htslib_variants-bowtie2/chr21_NA18507_bowtie2_var_htslib.flt-D1000.vcf.gz (75.6%)\ public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz (29.3%) 128921 public_info/CG_variants/NA18507_CG_SRR822930.chr21.vcf.gz (64.2%)
The obtained counts were used in 4DVenn.R to plot the next figure. This Venn diagram is only slightly different from the previous one, possibly due to the reported error: 'Non-matching REF. Skipping all such sites.'.
vcftools venn diagram
4DVenn.R --ab.count 2 --bd.count 7 --cd.count 11 --bcd.count 21 --d.count 70 --abd.count 90 --ac.count 506 \ --a.count 1284 --bc.count 2635 --b.count 6226 --ab.count 7193 --abcd.count 9851 --abc.count 58762 --c.count 128921 \ -A "Bowtie2" -B "BWA-mem" -C "CG" -D "HapMap3.3" -t "4-way recall for chr21" -x 1 \ -o "bcftools_htslib_variants-bowtie2/bowtie2_4wy_recall" -u 2
Discussion about BWA, Bowtie2, and this work
It should be noted that the paired end reads used in this workshop are not all the reads obtained by sequencing the NA18507 sample but only those present in the Illumina BAM file (hg18-mapped) downloaded from the Illumina site. Using the original (unmapped) full-read set would have been more correct in that regard but would have likely added more noise than real signal to the comparison and taken more time for the full genome mapping step. All counts and figures reported in our pages were obtained without optimizing the different methods and applying standard parameters for each step which probably leaves room for improvement for expert users. By contrast, the fact that we have used real reads from one full chromosome makes our data probably more 'real' than many reported benchmarks that are based on simulated reads.
# Considering only Bowtie2 and BWA-mem derived calls, we can say that:
- of the 86577 individual calls merged from both sets; 1792 (2%) are private to bowtie2, 8889 (10%) are found only by BWA-mem, and 75896 (88%) are called from both mapping sets.
BWA-mem seems calls ~5x more private variants than Bowtie2 but both methods share the larger number of calls (80%))
# Considering Complete Genomics as the best available high-quality call-set, we can draw few conclusions about this work.
- of the 1792 variants private to bowtie2 (see 2DVenn above), 506 (28%) are also called by Complete Genomics while 1284 (72%) are not.
- of the 8889 variants private to BWA-mem (id), 2656 (30%) are also called by Complete Genomics while 6226 (70%) are not.
Bowtie2 seems a little bit less sensitive than BWA-mem based on this observation (calls 2% less of the high-quality CG variants)
# Considering HapMap3.3 calls as the best validated set available, we can refine this statement by saying that:
- Bowtie2 does not report calls validated by HapMap and absent in the BWA-set
- reversely: BWA-mem does report 21+7=28 chr21 calls unknown to bowtie but present in HapMap. This number is relatively small as compared to the 9851+90=9941 calls shared by all three sets.
# Considering the mapping time required on a robust BITS desktop machine to map all chr21 reads using all 8 cores and up to ~8GB RAM (of the available 16GB).
benchmark command details
## test run for bowtie2 # bowtie2 -p 8 \ -x /data/biodata/bowtie-index/HiSeq_UCSC_hg19 \ -q \ -1 Final_Results/reads_results/shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz \ -2 Final_Results/reads_results/shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz \ --phred33 \ --fr \ -S hg19_bowtie2-mapping-time/bowtie_mapped-reads.sam # finished mapping shuffled_PE_NA18507_GAIIx_100_chr21 reads with Bowtie2 real 16m33.786s user 118m16.500s sys 3m39.556s ## test run for bwa-mem # bwa mem \ -R "@RG\tID:NA18507\tLB:lib-NA18507\tPU:unkn-0.0\tPL:ILLUMINA\tSM:GAIIx-chr21-BWA.mem" \ -M \ -t 8 \ ref/HiSeq_UCSC_hg19.fa \ Final_Results/reads_results/shuffled_PE_NA18507_GAIIx_100_chr21_2_1.fq.gz \ Final_Results/reads_results/shuffled_PE_NA18507_GAIIx_100_chr21_2_2.fq.gz \ > hg19_bwa-mapping-time/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe.sam # finished mapping shuffled_PE_NA18507_GAIIx_100_chr21 reads with bwa-mem real 15m32.321s user 52m40.842s sys 0m35.390s
plot of the computer activity | |
---|---|
The code used to create these plots can be obtained from the BITS github repository[3] |
- bowtie2 used less RAM than bwa-mem at full load (~4GB vs ~6GB).
- bowtie2 and bwa-mem took about the same time with 15-16 min (real time) including the time to load the reference index in RAM.
Conclusion
We conclude that with the standard parameters used for both BWA-mem (https://github.com/lh3/bwa[4]) and Bowtie2, both methods seem valid and present a small fraction of specific calls. In our hands, Bowtie2 and BWA-MEM are almost as fast with this realistic paired-end read mapping experiment and calls made on Bowtie2 mappings (--sensitive settings) missed a number validated calls made from BWA-mem mappings (standard settings).
This is in apparent contradiction with reports claiming that Bowtie is much faster than BWA! Is it possible that such claims report speed obtained with older version(s) of the softwares?
Based on the current report, we advise to use BWA-mem rather than Bowtie2 when mapping paired-end reads to a full genome as we see that this mapper allows calling more gold-standard variants AND is not slower than the competitor program Bowtie2 (both taken from their most recent version with similar settings)
# Please send your reactions to bits@vib.be
download exercise files
Download exercise files here
References:
- ↑ http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
- ↑ http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn
- ↑ https://github.com/BITS-VIB/geek-tools
- ↑ https://github.com/lh3/bwa
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.2 | NGS Exercise.3 | NGS Exercise.3b | NGS Exercise.3d | NGS Exercise.3e ]