NGS Exercise.3b

From BITS wiki
Jump to: navigation, search


[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.3 | NGS Exercise.3c | NGS Exercise.4 ]
# updated 2014 version


Perform QC on the BWA alignment data


ex03_wf.png

Introduction

Each mapping is unique and can be biased by many unknown factors. Performing QC is primordial in each experiment to avoid carryover of errors to the downstream analyses.

use samtools to rapidely estimate the quality of BWA mapping results

Review duplicate counts in both mapping runs

duplicate metrics for sample & all and both bwa methods

infolder="Final_Results/hg19_bwa-mapping"
 
for metrics in ${infolder}/*_duplicate_metrics.txt; do
    echo -e "# counts for: \n# "${metrics}
    # keep only lines starting ('^') with 'lib-' or 'LIBRARY'
    grep -E "^lib-|LIBRARY" ${metrics} | transpose -t | column -t
    echo
done

what did we learn here?

Cli tools.png More Goodies:

  • echo followed by '-e' ('enable interpretation of backslash escapes') allows using '\n' to spit text lines
  • the enhanced version of grep: 'grep -E' (formerly egrep) allows searching for more than one thing at a time
# counts for:
# hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_duplicate_metrics.txt
LIBRARY                       lib-NA18507
UNPAIRED_READS_EXAMINED       3787
READ_PAIRS_EXAMINED           737426
UNMAPPED_READS                19603
UNPAIRED_READ_DUPLICATES      146
READ_PAIR_DUPLICATES          29
READ_PAIR_OPTICAL_DUPLICATES  10
PERCENT_DUPLICATION           0.000138
ESTIMATED_LIBRARY_SIZE        14309816538
 
# counts for:
# hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_duplicate_metrics.txt
LIBRARY                       lib-NA18507
UNPAIRED_READS_EXAMINED       1734
READ_PAIRS_EXAMINED           747327
UNMAPPED_READS                1854
UNPAIRED_READ_DUPLICATES      105
READ_PAIR_DUPLICATES          60
READ_PAIR_OPTICAL_DUPLICATES  12
PERCENT_DUPLICATION           0.00015
ESTIMATED_LIBRARY_SIZE        5817247850
 
# counts for: hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_duplicate_metrics.txt
LIBRARY                       lib-NA18507
UNPAIRED_READS_EXAMINED       37324
READ_PAIRS_EXAMINED           7379933
UNMAPPED_READS                193004
UNPAIRED_READ_DUPLICATES      8673
READ_PAIR_DUPLICATES          3099
READ_PAIR_OPTICAL_DUPLICATES  1058
PERCENT_DUPLICATION           0.001005
ESTIMATED_LIBRARY_SIZE        13336049883
 
# counts for: hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_duplicate_metrics.txt
LIBRARY                       lib-NA18507
UNPAIRED_READS_EXAMINED       17268
READ_PAIRS_EXAMINED           7477196
UNMAPPED_READS                18534
UNPAIRED_READ_DUPLICATES      4725
READ_PAIR_DUPLICATES          5428
READ_PAIR_OPTICAL_DUPLICATES  1310
PERCENT_DUPLICATION           0.001041
ESTIMATED_LIBRARY_SIZE        6783431981

Alternatively, changing to 'REMOVE_DUPLICATES=TRUE' physically removes the duplicates from the output. This physical removal is not required for variant calling and is only added here for information.

use samtools and some [R] code to compute the distribution of insert distances between paired read

The following code was adapted from an online tutorial files <http://www.embl.de/~rausch> and is presented as a prototype for [R] analysis of NGS data. Note that Picard CollectInsertSizeMetrics does something very similar and produces a nice histogram plot.

#! /usr/bin/env bash
## script: 'comp_insert-distance.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2014-04-08
 
# required: 
# Samtools: Version: 0.1.19-96b5f2294a
 
# extract mate distances from BAM mappings
# run both commands in terminal before running the following R code
 
# fetch data pre-computed by the trainer
infolder="Final_Results/hg19_bwa-mapping"
 
# create a new folder for your own results
outfolder="hg19_bwa-mapping"
mkdir -p ${outfolder}
 
## full aln data
#infolder="Final_Results/hg19_bwa-mapping/full_chromosome"
#infile=shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.bam
#pfx="all"
#outfolder="hg19_bwa-mapping/full_chromosome"
 
# 10% aln sample
infile=shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.bam
pfx="sample"
 
## the following piped command will:
# keep first in pair from paired reads from bam 
# cut the 9th column TLEN='observed Template LENgth' (with a minus sign when mapped to the reverse strand)
# remove leading dash when in reverse direction
# store the obtained numbers to a file
samtools view -f 66 ${infolder}/${infile} | \
	cut -f 9 | sed "s/^-//" > ${outfolder}/${pfx}_aln_inserts.txt
 
## full mem data
#infolder="Final_Results/hg19_bwa-mapping/full_chromosome"
#infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.bam
#pfx="all"
 
# 10% mem sample
infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.bam
pfx="sample"
 
samtools view -f 66 ${infolder}/${infile} | \
	cut -f 9 | sed "s/^-//" > ${outfolder}/${pfx}_mem_inserts.txt

The first samtools commands parse the results and generate summary text files with counts for 'read mapped in proper pair' & 'first in pair' (filter flag=66, see http://picard.sourceforge.net/explain-flags.html[1], BITS mirrored copy).


The following [R] script takes the prepared summary files and generates plots.

#!/usr/bin/Rscript
# script: scripts/mate-distance.R
 
# required:
# 'comp_insert-distance.sh' run before this script
 
# please modify the path to fit your setup
# trainer
# base="/home/splaisan/Desktop/NGS_DNASeq-training/hg19_bwa-mapping/"
# full chromosome samples
# base="/home/bits/NGS_DNASeq-training/hg19_bwa-mapping/full_chromosome/"
 
# 10 pct samples
# training laptops, running from the $BASE folder
base="hg19_bwa-mapping/"
setwd(base)
 
# limit plotting to max 500x coverage
maxcvg=500
 
#pfx="all_"
pfx="sample_"
 
xfile=paste(base, pfx, "aln_inserts.txt", sep="")
pngfile=paste(base, pfx, "aln_inserts.png", sep="")
sumfile=paste(base, pfx, "aln_stats.txt", sep="")
 
x2file=paste(base, pfx, "mem_inserts.txt", sep="")
png2file=paste(base, pfx, "mem_inserts.png", sep="")
sum2file=paste(base, pfx, "mem_stats.txt", sep="")
 
x = scan(xfile)
png(file=pngfile, bg="white")
op<-par(mfrow=c(2,1))
hist(x[x < maxcvg], xlim=c(0, maxcvg), main="insert sizes for BWA-aln", xlab="insert size (bps)")
boxplot (x, ylim=c(0, maxcvg), horizontal=T)
dev.off()
 
sink(file=sumfile)
print("# quartiles")
summary(x)
print("# standard deviation")
sd(x)
print("# median absolute deviation")
mad(x)
sink()
 
x2 = scan(x2file)
png(file=png2file, bg="white")
op<-par(mfrow=c(2,1))
hist(x2[x2 < maxcvg], xlim=c(0, maxcvg), main="insert sizes for BWA-mem", xlab="insert size (bps)")
boxplot (x2[x2 < maxcvg], ylim=c(0, maxcvg), horizontal=T)
dev.off()
 
sink(file=sum2file)
print("# quartiles")
summary(x2)
print("# standard deviation")
sd(x2)
print("# median absolute deviation")
mad(x)
sink()

Results for the two sample runs are shown below

comparison for sample between aln and mem methods
sample_aln_inserts.png
sample_mem_inserts.png
[1] "# quartiles"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  100.0   301.0   312.0   311.4   324.0   426.0 
[1] "# standard deviation"
[1] 19.49675
[1] "# median absolute deviation"
[1] 16.3086
[1] "# quartiles"
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0      301      312      374      324 38170000 
[1] "# standard deviation"
[1] 37085.36
[1] "# median absolute deviation"
[1] 16.3086
comparison for all between aln and mem methods
all_aln_inserts.png
all_mem_inserts.png
[1] "# quartiles"
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0      301      312      374      324 38170000 
[1] "# standard deviation"
[1] 37085.36
[1] "# median absolute deviation"
[1] 16.3086
[1] "# quartiles"
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
       0      301      312      374      324 38170000 
[1] "# standard deviation"
[1] 37085.36
[1] "# median absolute deviation"
[1] 16.3086

summarise mapping results and perform mapping QC using Picard tools

As shown before, the samtools flagstat command identifies library issues leading to a lot of unmapped reads (obviously not the case for this pre-filtered dataset). for more complex issues, other types of control are necessary adn can be performed using specific picard tools.

apply Picard CollectGcBiasMetrics

Monotonous GC/AT rich regions lead to ambiguous mapping and eventually to no-mapping at all if they exceed the mapper capabilities. This is a well known problem in short read sequencing and although paired end reads limit this bias, it is, together together with repeats still responsible for the absence of coverage in highly repetitive regions, telomers, and centromers. The following QC run will show the limits of mappability associated with GC content for a given BAM file and possibly identify a technical issue.

CollectGcBiasMetrics is a High level metrics that capture how biased the coverage in a certain BAM file is.

#! /usr/bin/env bash
## script: 'picard_CollectGcBiasMetrics.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2014-05-23
# Picard Version: 1.112+
 
#full chromosome sample
#infolder=Final_Results/hg19_bwa-mapping/full_chromosome
#outfolder=bwa-mappingQC/full chromosome
 
#10 pct sample
infolder=Final_Results/hg19_bwa-mapping
outfolder=hg19_bwa-mapping/bwa-mappingQC
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# for each '_mdup.bam' file in folder
for b in ${infolder}/*_mdup.bam; do
outname=$(basename ${b} .bam)
 
echo "# processing ${b}"
 
# send errors log file with '2>'
java -Xmx4g -jar $PICARD/picard.jar CollectGcBiasMetrics \
	VERBOSITY=ERROR \
	R=${ref} \
	I=${b} \
	O=${outfolder}/${outname}_GCbias.txt \
	CHART=${outfolder}/${outname}_GCbias.pdf \
	S=${outfolder}/${outname}_GCbias_summary.txt \
	2>${outfolder}/${outname}_CollectGcBiasMetrics.err
done

what did we learn here?

Cli tools.png basename - strip directory and suffix from filenames

  • 'basename ${b} .bam' returns the name of ${b} without path and without .bam
  • we could use 'basename' on multiple files too
    • with '-a' works on all provided files
    • with '-s .bam' removes all '.bam' suffices
      (eg basename -s .bam -a myfolder/*.bam # will return the file names without .bam for all bam files in myfolder)

The CollectGcBiasMetrics program produces 2 text files and a corresponding graph in pdf format.

comparison for sample & all between aln and mem methods
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_GCbias.png
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_GCbias.png
shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_GCbias.png
shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_GCbias.png

apply Picard CollectAlignmentSummaryMetrics

This collects High level metrics about the alignment of reads within a SAM file. The result is a table that can be transposed for better readability (below)

#! /usr/bin/env bash
## script: 'picard_CollectAlignmentSummaryMetrics.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2014-05-23
# Picard Version: 1.112+
# transpose
 
#full chromosome sample
#infolder=Final_Results/hg19_bwa-mapping/full_chromosome
#outfolder=bwa-mappingQC/full chromosome
 
#10 pct sample
# process all bam files in 'hg19_bwa-mapping'
#infolder=hg19_bwa-mapping
infolder=Final_Results/hg19_bwa-mapping
outfolder=bwa-mappingQC
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
for b in ${infolder}/*_mdup.bam; do
outname=$(basename ${b} .bam)
 
echo "# processing ${b}"
 
# collect information
(java -Xmx4g -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \
	R=${ref} \
	I=${b} \
	O=${outfolder}/${outname}_AlignmentSummaryMetrics.txt \
	2>${outfolder}/${outname}_CollectAlignmentSummaryMetrics.err)
 
# ignore header and empty lines and rotate the remaining content
grep -v -E '(^#|^$)' \
	${outfolder}/${outname}_AlignmentSummaryMetrics.txt | \
	transpose -t | \
	column -t > \
	${outfolder}/tr_${outname}_AlignmentSummaryMetrics.txt
 
# end of loop
done

what did we learn here?

Cli tools.png 'grep -E' (--extended-regexp) allows complex querries

  • eg: grep -v -E '(^#|^$)' will ignore '-v' lines that start with either '#' OR are empty '^$' lines
  • '^$' means here starting '^' and ending '$' with nothing in between

more regular expression symbols

Cli tools.png this list is not exhaustive but already good to know
^ (Caret)        =    match expression at the start of a line, as in ^A.
$ (Question)     =    match expression at the end of a line, as in A$.
\ (Back Slash)   =    turn off the special meaning of the next character, as in \^.
[ ] (Brackets)   =    match any one of the enclosed characters, as in [aeiou].
                      Use Hyphen "-" for a range, as in [0-9].
[^ ]             =    match any one character except those enclosed in [ ], as in [^0-9].
. (Period)       =    match a single character of any value, except end of line.
* (Asterisk)     =    match zero or more of the preceding character or expression.
\{x,y\}          =    match x to y occurrences of the preceding.
\{x\}            =    match exactly x occurrences of the preceding.
\{x,\}           =    match x or more occurrences of the preceding.

As often, text results are obfuscating and more clearly shown using the bash utility command column -t after transposition

Counts for sample and all and bwa aln and bwa mem analyses

## for f in ${outfolder}/tr*; do echo "# "$f; cat $f; echo; done
# tr_shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_AlignmentSummaryMetrics.txt
CATEGORY                    FIRST_OF_PAIR  SECOND_OF_PAIR  PAIR
TOTAL_READS                 749121         749121          1498242
PF_READS                    749121         749121          1498242
PCT_PF_READS                1              1               1
PF_NOISE_READS              3              0               3
PF_READS_ALIGNED            740309         738330          1478639
PCT_PF_READS_ALIGNED        0.988237       0.985595        0.986916
PF_ALIGNED_BASES            73821765       73278262        147100027
PF_HQ_ALIGNED_READS         732782         731014          1463796
PF_HQ_ALIGNED_BASES         73080734       72573054        145653788
PF_HQ_ALIGNED_Q20_BASES     71927567       70594488        142522055
PF_HQ_MEDIAN_MISMATCHES     76             75              76
PF_MISMATCH_RATE            0.788667       0.788507        0.788587
PF_HQ_ERROR_RATE            0.788738       0.788588        0.788664
PF_INDEL_RATE               0.000252       0.000251        0.000252
MEAN_READ_LENGTH            100            100             100
READS_ALIGNED_IN_PAIRS      737426         737426          1474852
PCT_READS_ALIGNED_IN_PAIRS  0.996106       0.998776        0.997439
BAD_CYCLES                  0              0               0
STRAND_BALANCE              0.499779       0.500205        0.499992
PCT_CHIMERAS                0.001745       0.001745        0.001745
PCT_ADAPTER                 0              0.000001        0.000001
SAMPLE
LIBRARY
READ_GROUP
 
# tr_shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_AlignmentSummaryMetrics.txt
CATEGORY                    FIRST_OF_PAIR  SECOND_OF_PAIR  PAIR
TOTAL_READS                 749121         749121          1498242
PF_READS                    749121         749121          1498242
PCT_PF_READS                1              1               1
PF_NOISE_READS              0              0               0
PF_READS_ALIGNED            748909         747479          1496388
PCT_PF_READS_ALIGNED        0.999717       0.997808        0.998763
PF_ALIGNED_BASES            74537188       73945391        148482579
PF_HQ_ALIGNED_READS         740170         738549          1478719
PF_HQ_ALIGNED_BASES         73753309       73165657        146918966
PF_HQ_ALIGNED_Q20_BASES     72540468       71162933        143703401
PF_HQ_MEDIAN_MISMATCHES     76             75              75
PF_MISMATCH_RATE            0.788767       0.788523        0.788646
PF_HQ_ERROR_RATE            0.788689       0.788462        0.788576
PF_INDEL_RATE               0.000279       0.000271        0.000275
MEAN_READ_LENGTH            100            100             100
READS_ALIGNED_IN_PAIRS      747327         747327          1494654
PCT_READS_ALIGNED_IN_PAIRS  0.997888       0.999797        0.998841
BAD_CYCLES                  0              0               0
STRAND_BALANCE              0.500136       0.500712        0.500424
PCT_CHIMERAS                0.007445       0.007445        0.007445
PCT_ADAPTER                 0              0               0
SAMPLE
LIBRARY
READ_GROUP
 
# tr_shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_AlignmentSummaryMetrics.txt
CATEGORY                    FIRST_OF_PAIR  SECOND_OF_PAIR  PAIR
TOTAL_READS                 7495097        7495097         14990194
PF_READS                    7495097        7495097         14990194
PCT_PF_READS                1              1               1
PF_NOISE_READS              28             23              51
PF_READS_ALIGNED            7408248        7388942         14797190
PCT_PF_READS_ALIGNED        0.988413       0.985837        0.987125
PF_ALIGNED_BASES            738774576      733365277       1472139853
PF_HQ_ALIGNED_READS         7332524        7315616         14648140
PF_HQ_ALIGNED_BASES         731325109      726297327       1457622436
PF_HQ_ALIGNED_Q20_BASES     719764321      706528172       1426292493
PF_HQ_MEDIAN_MISMATCHES     76             75              76
PF_MISMATCH_RATE            0.788711       0.788536        0.788624
PF_HQ_ERROR_RATE            0.78878        0.78862         0.7887
PF_INDEL_RATE               0.000252       0.000254        0.000253
MEAN_READ_LENGTH            100            100             100
READS_ALIGNED_IN_PAIRS      7379933        7379933         14759866
PCT_READS_ALIGNED_IN_PAIRS  0.996178       0.998781        0.997478
BAD_CYCLES                  0              0               0
STRAND_BALANCE              0.500208       0.499733        0.499971
PCT_CHIMERAS                0.001717       0.001717        0.001717
PCT_ADAPTER                 0              0               0
SAMPLE
LIBRARY
READ_GROUP
 
# tr_shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_AlignmentSummaryMetrics.txt
CATEGORY                    FIRST_OF_PAIR  SECOND_OF_PAIR  PAIR
TOTAL_READS                 7495097        7495097         14990194
PF_READS                    7495097        7495097         14990194
PCT_PF_READS                1              1               1
PF_NOISE_READS              0              0               0
PF_READS_ALIGNED            7492960        7478700         14971660
PCT_PF_READS_ALIGNED        0.999715       0.997812        0.998764
PF_ALIGNED_BASES            745771802      739837082       1485608884
PF_HQ_ALIGNED_READS         7406255        7389885         14796140
PF_HQ_ALIGNED_BASES         737997945      732085200       1470083145
PF_HQ_ALIGNED_Q20_BASES     725876220      712113681       1437989901
PF_HQ_MEDIAN_MISMATCHES     76             75              75
PF_MISMATCH_RATE            0.788805       0.788566        0.788686
PF_HQ_ERROR_RATE            0.788749       0.788513        0.788632
PF_INDEL_RATE               0.000278       0.000274        0.000276
MEAN_READ_LENGTH            100            100             100
READS_ALIGNED_IN_PAIRS      7477196        7477196         14954392
PCT_READS_ALIGNED_IN_PAIRS  0.997896       0.999799        0.998847
BAD_CYCLES                  0              0               0
STRAND_BALANCE              0.50058        0.500195        0.500388
PCT_CHIMERAS                0.007323       0.007323        0.007323
PCT_ADAPTER                 0              0               0
SAMPLE
LIBRARY
READ_GROUP

apply Picard CollectMultipleMetrics

Alternatively, one can run 4 of the 5 Picard metric QC commands in one go by using the global Picard command CollectMultipleMetrics that will generate PDF plots using R (CollectGcBiasMetrics.jar should be run separately and with a coordinate sorted BAM as input).

#! /usr/bin/env bash
# script: 'picard_CollectMultipleMetrics.sh'
## ©SP-BITS, 2013 v1.0
# last edit: 2014-05-23
# Picard Version: 1.112+
 
#full chromosome sample
#infolder=Final_Results/hg19_bwa-mapping/full_chromosome
#outfolder=bwa-mappingQC/full chromosome
 
#10 pct sample
#infolder=hg19_bwa-mapping
infolder=Final_Results/hg19_bwa-mapping
outfolder=bwa-mappingQC
mkdir -p ${outfolder}
 
ref=ref/HiSeq_UCSC_hg19.fa
 
# OPT: PROGRAM={CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, 
#                       QualityScoreDistribution, MeanQualityByCycle} 
 
for b in ${infolder}/*_mdup.bam; do
outname=$(basename ${b} .bam)
 
echo "# processing ${b}"
 
# run in parallel if the computer has enough capacity
( java -Xmx4g -jar $PICARD/CollectMultipleMetrics.jar \
	R=${ref} \
	I=${b} \
	O=${outfolder}/${outname} \
	PROGRAM=CollectAlignmentSummaryMetrics \
	PROGRAM=CollectInsertSizeMetrics \
	PROGRAM=QualityScoreDistribution \
	PROGRAM=MeanQualityByCycle \
	2>${outfolder}/${outname}_CollectMultipleMetrics.err )
	# rem: you could add '&' after the closing ')' to run such lighter command in parallel
done

Handicon.png Picard results are stored in separate PDF files (tip: use 'evince filename.pdf' in terminal to open PDF files)

how can I convert all those PDF files to lightweight PNG?

By using the great ImageMagick[2] convert command in a simple little loop
for p in *.pdf; do 
     convert -density 300 -depth 8 -quality 100 -resize 25% $p ${p%%.pdf}.png; 
done

Even better, create a custom function to ease this and add it to your .myfunctions home file (done!)

pdf2png ()
{
# convert PDF to PNG (all pages with suffix) using ImageMagick.
# take PDF file from 1
# use PDF file name for PNG with extension swap
# or take PNG file name from $2 
 
if [ $# -ge 1 ]
then
    convert -density 300 -depth 8 -quality 100 -resize 25% ${1} ${2:-${1%%.pdf}.png};    
else
    echo "usage: pdf2png <input.pdf> <opt:output.png>"
fi
}
comparison for 10pc samples between aln and mem methods
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.insert_size_histogram.png
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.insert_size_histogram.png
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.quality_by_cycle.png
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.quality_by_cycle.png
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.quality_distribution.png
shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.quality_distribution.png
comparison for all between aln and mem methods
shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.insert_size_histogram.png
shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.insert_size_histogram.png
shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.quality_by_cycle.png
shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.quality_by_cycle.png
shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.quality_distribution.png
shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.quality_distribution.png

As seen above, both sample and full data give very similar results for all metrics.

Perform integrated BAM QC with Qualimap

start_analysis.png

Like what FastQC does for FastQ data, Qualimap can integrate several quality control steps of your BAM data in one go. Qualimap[3] is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI, online help page) for the shy and hurried among you and a command-line interface (online help page) to facilitate the quality control of alignment sequencing data (PDF manual link). We next will show the use of the CLI version for basic BAM QC but as seen in the hidden documentation below, Qualimap proposes many more very attractive tools that are left for you to explore.

full qualimap command line help

#################################
## To show available tools use command:
 
qualimap --help
 
QualiMap v.0.8
Built on 2014-03-05 17:17
 
usage: qualimap <tool> [options]
 
To launch GUI leave <tool> empty.
 
Available tools:
 
    bamqc            Evaluates NGS mapping to a reference genome
    counts           Counts data analysis (RNA-seq data evaluation)
    clustering       Clustering epigenomic signals
    comp-counts      Compute feature counts
 
Special arguments: 
 
    --java-mem-size  Use this argument to set Java memory heap size. Example:
                     qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G
 
### BAM QC #######################################
## The following command allows to perform BAM QC analysis:
 
usage: qualimap bamqc -bam <arg> [-c] [-gd <arg>] [-gff <arg>] [-nr <arg>] [-nt
       <arg>] [-nw <arg>] [-os] [-outdir <arg>] [-outformat <arg>]
 -bam <arg>                     input mapping file
 -c,--paint-chromosome-limits   paint chromosome limits inside charts
 -gd <arg>                      compare with genome distribution (possible
                                values: HUMAN or MOUSE)
 -gff <arg>                     region file (in GFF/GTF or BED format)
 -hm <arg>                      minimum size for a homopolymer to be considered
                                in indel analysis (default is 3)
 -nr <arg>                      number of reads in the chunk (default is 500)
 -nt <arg>                      number of threads (default equals the number of cores)
 -nw <arg>                      number of windows (default is 400)
 -os,--outside-stats            compute region outside stats (only with -gff
                                option)
 -outdir <arg>                  output folder
 -outformat <arg>               output report format (PDF or HTML, default is
                                HTML)
 -p <arg>                       specify protocol to calculate correct strand
                                reads (works only with -gff option, possible
                                values are STRAND-SPECIFIC-FORWARD or
                                STRAND-SPECIFIC-REVERSE, default is
                                NON-STRAND-SPECIFIC)
The only required parameter is bam – the input mapping file.
If outdir is not provided, it will be created automatically in the same folder where BAM file is located.
 
Detailed explanation of available options can be found here.
Example (data available here):
 
qualimap bamqc -bam ERR089819.bam -c
 
###  Counts QC  ##########################################################
## To perform counts QC analysis (evaluation of RNA-seq data) use the following command:
 
usage: qualimap counts -d1 <arg> [-d2 <arg>] [-i <arg>] [-k <arg>] [-n1 <arg>]
       [-n2 <arg>] [-outdir <arg>] [-outformat <arg>] [-s <arg>]
 -d1,--data1 <arg>      first file with counts
 -d2,--data2 <arg>      second file with counts
 -i,--info <arg>        info file
 -k,--threshold <arg>   threshold for the number of counts
 -n1,--name1 <arg>      name for the first sample
 -n2,--name2 <arg>      name for second sample
 -outdir <arg>          output folder
 -outformat <arg>       output report format (PDF or HTML, default is HTML)
 -s,--species <arg>     use default file for the given species [human | mouse]
Detailed explanation of available options can be found here.
Example (data available here):
 
qualimap counts -d1 kidney.counts -d2 liver.counts -s human -outdir results
 
### Clustering ##############################################
## To perform clustering of epigenomic signals use the following command:
 
usage: qualimap clustering [-b <arg>] [-c <arg>] -control <arg> [-expr <arg>]
       [-f <arg>] [-l <arg>] [-name <arg>] [-outdir <arg>] [-outformat <arg>]
       [-r <arg>] -regions <arg> -sample <arg> [-viz <arg>]
 -b,--bin-size <arg>          size of the bin (default is 100)
 -c,--clusters <arg>          comma-separated list of cluster sizes
 -control <arg>               comma-separated list of control BAM files
 -expr <arg>                  name of the experiment
 -f,--fragment-length <arg>   smoothing length of a fragment
 -l <arg>                     upstream offset (default is 2000)
 -name <arg>                  comma-separated names of the replicates
 -outdir <arg>                output folder
 -outformat <arg>             output report format (PDF or HTML, default is
                              HTML)
 -r <arg>                     downstream offset (default is 500)
 -regions <arg>               path to regions file
 -sample <arg>                comma-separated list of sample BAM files
 -viz <arg>                   visualization type: heatmap or line
Detailed explanation of available options can be found here.
Example (data available here):
 
qualimap clustering -sample clustering/hmeDIP.bam -control clustering/input.bam 
    -regions annotations/transcripts.human.64.bed -outdir clustering_result
 
### Compute counts ######################################
## To compute counts from mapping data use the following command:
 
usage: qualimap comp-counts [-algorithm <arg>] -bam <arg> -gtf <arg> [-id <arg>]
       [-out <arg>] [-protocol <arg>] [-type <arg>]
 -algorithm <arg>   uniquely-mapped-reads(default) or proportional
 -b                 calculate 5' and 3' coverage bias
 -bam <arg>         mapping file in BAM format)
 -gtf <arg>         region file in GTF format
 -id <arg>          attribute of the GTF to be used as feature ID. Regions with
                    the same ID will be aggregated as part of the same feature.
                    Default: gene_id.
 -out <arg>         path to output file
 -protocol <arg>    forward-stranded,reverse-stranded or non-strand-specific
 -type <arg>        Value of the third column of the GTF considered for
                    counting. Other types will be ignored. Default: exon
Detailed explanation of available options can be found here.
Example (data available here):
 
qualimap comp-counts -bam kidney.bam -gtf ../annotations/human.64.gtf  -out kidney.counts

The Qualimap GUI is rather simple to use and returns integrated results similarly to FastQC.


output.png

A QC on the sample BAM files can be performed in the GUI or with a simple command


Current human GRCh37.75 GTF annotations were downloaded from ensEMBL and converted to hg19 notation

# run the following code from the base folder (~/bits/NGS_DNASeq-training)
# get the data from the ensEMBL FTP and save it in the ref folder
wget -P ref ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
 
# add chr and change MT to chrM and recompress
zcat ref/Homo_sapiens.GRCh37.75.gtf.gz | awk '{print "chr"$0}'  | sed 's/chrMT/chrM/g' | bgzip -c > ref/Homo_sapiens.hg19.gtf.gz
 
# subset chr21 for our purpose and keep it in uncompressed
zgrep "^chr21\b" ref/Homo_sapiens.hg19.gtf.gz > ref/Homo_sapiens.hg19_chr21.gtf

We now combine the chr21 GTF subset and both '_mdup.bam' files with Qualimap bamqc.

#! /usr/bin/env bash
## script: 'qualimap_bamqc.sh'
## ©SP-BITS, 2014 v1.0
# last edit: 2014-04-09
 
# required: 
# qualimap 0.8
# R dependencies
 
#full chromosome sample
#infolder=Final_Results/hg19_bwa-mapping/full_chromosome
#outfolder=bwa-mapping-qualimapQC/full chromosome
 
#10 pct sample
# process all bam files in 'hg19_bwa-mapping'
infolder=Final_Results/hg19_bwa-mapping
outfolder=bwa-mapping-qualimapQC
mkdir -p ${outfolder}
 
mygtf=ref/Homo_sapiens.hg19_chr21.gtf
 
for b in ${infolder}/*_mdup.bam; do
outname=$(basename ${b} .bam)
 
echo "# processing ${b}"
 
# read the full help to find more about other parameters
# we output here to HTML but PDF is also available for portability and compactness
# we restrict the analysis to chr21 using annotation data from the Qualimap site
## Human genome annotations from Ensembl database (v.64) 
qualimap bamqc --java-mem-size=4G \
	-bam ${b} \
	-gd HUMAN \
	-c \
	-gff ${mygtf} \
	-outdir ${outfolder}/${outname}_qualimap-results \
	-outformat HTML
 
# for PDF, replace last line by
#	-outformat PDF
#	-outfile ${outname}_chr21_report.pdf \
 
done

extract all reads mapping to chr21

Because the re-mapping was done against the full human genome, some of the reads may now map to other chromosomes. For each method, we extract all reads mapping at least by one end to chr21 to a new bam file using samtools to get a clear chr21-specific picture out of our results.

#! /usr/bin/env bash
# script: samtools_extract-chr21.sh
 
# required:
# samtools Version: 0.1.19-96b5f2294a
 
infolder=Final_Results/hg19_bwa-mapping
 
# process all files in 'infolder' whose name ends with '_mdup.bam'
## note that the relative path to the file is included in ${b}
for b in ${infolder}/*_mdup.bam; do
 
echo "# processing ${b}"
 
# subset and index ONLY chr21 calls
# also produce a text summary with samtools flagstat
samtools view -b ${b} chr21 > ${b%%.bam}_chr21.bam && \
 	samtools index ${b%%.bam}_chr21.bam && \
 	samtools flagstat ${b%%.bam}_chr21.bam >  ${b%%.bam}_chr21.flagstat.txt
done

samtools flagstat results for the chr21 subset (full data)

# flagstat results for: shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.bam
 
14770127 + 0 in total (QC-passed reads + QC-failed reads)
14124 + 0 duplicates
14740138 + 0 mapped (99.80%:-nan%)
14770127 + 0 paired in sequencing
7385096 + 0 read1
7385031 + 0 read2
14690200 + 0 properly paired (99.46%:-nan%)
14710149 + 0 with itself and mate mapped
29989 + 0 singletons (0.20%:-nan%)
11843 + 0 with mate mapped to a different chr
11667 + 0 with mate mapped to a different chr (mapQ>=5)
 
# flagstat results for: shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.bam
 
14861665 + 0 in total (QC-passed reads + QC-failed reads)
13154 + 0 duplicates
14845463 + 0 mapped (99.89%:-nan%)
14861665 + 0 paired in sequencing
7431627 + 0 read1
7430038 + 0 read2
14700219 + 0 properly paired (98.91%:-nan%)
14829215 + 0 with itself and mate mapped
16248 + 0 singletons (0.11%:-nan%)
46189 + 0 with mate mapped to a different chr
42441 + 0 with mate mapped to a different chr (mapQ>=5)

samtools flagstat results for the 10% chr21 subset

# flagstat results for: shuffled_NA18507_GAIIx_100_chr21_aln_mdup_chr21.bam
 
1475974 + 0 in total (QC-passed reads + QC-failed reads)
184 + 0 duplicates
1472936 + 0 mapped (99.79%:-nan%)
1475974 + 0 paired in sequencing
737967 + 0 read1
738007 + 0 read2
1467878 + 0 properly paired (99.45%:-nan%)
1469898 + 0 with itself and mate mapped
3038 + 0 singletons (0.21%:-nan%)
1210 + 0 with mate mapped to a different chr
1194 + 0 with mate mapped to a different chr (mapQ>=5)
 
# flagstat results for: shuffled_NA18507_GAIIx_100_chr21_mem_mdup_chr21.bam
 
1485431 + 0 in total (QC-passed reads + QC-failed reads)
178 + 0 duplicates
1483795 + 0 mapped (99.89%:-nan%)
1485431 + 0 paired in sequencing
742775 + 0 read1
742656 + 0 read2
1469154 + 0 properly paired (98.90%:-nan%)
1482155 + 0 with itself and mate mapped
1640 + 0 singletons (0.11%:-nan%)
4663 + 0 with mate mapped to a different chr
4291 + 0 with mate mapped to a different chr (mapQ>=5)

what did we learn here?

Cli tools.png variable content can be manipulated like in the 'indexing part' above

  • ${b%%.bam} # removes the final .bam from the variable name
  • ${b##.bam} # would keep ONLY the final .bam from the name (not very useful here!)
  • look for more on the provided Bash_Quick_Reference.pdf

Quickly compare the BWA aln and BWA mem algorithm results

The Picard tool CompareSAMs compares the headers of the two input SAM or BAM files, and, if possible, the SAMRecords. For SAMRecords, compares only the readUnmapped flag, reference name, start position and strand. Reports the number of SAMRecords that match, differ in alignment, are mapped in only one input, or are missing in one of the files.

Handicon.png Only the subset of hg19 mappings attributed to chr21 is analyzed here using data obtained by the trainer

#! /usr/bin/env bash
# script: picard_CompareSAMs.sh
 
# required: 
# Picard Version: 1.112+
 
infolder=Final_Results/hg19_bwa-mapping/
outfolder=hg19_bwa-mapping
 
# full
#name=shuffled_PE_NA18507_GAIIx_100_chr21
 
#sample
name=shuffled_10pc_PE_NA18507_GAIIx_100_chr21
 
java -Xmx4g -jar $PICARD/CompareSAMs.jar \
	${infolder}/${name}_aln-pe_mdup_chr21.bam \
	${infolder}/${name}_mem-pe_mdup_chr21.bam >\
	${outfolder}/CompareSAMs_${name}.txt \
	2>${outfolder}/CompareSAMs_${name}.err

comparison results for sample & all and both bwa methods

# CompareSAMs_shuffled_10pc_PE_NA18507_GAIIx_100_chr21.txt 
Sample for read group NA18507 differs.
hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam: GAIIx-chr21-BWA.aln
hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: GAIIx-chr21-BWA.mem
Number of program records differs.
hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam: 3
hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: 2
Match	1454476
Differ	12746
Unmapped_both	16
Unmapped_left	2387
Unmapped_right	214
Missing_left	8682
Missing_right	2351
SAM files differ.
 
# CompareSAMs_shuffled_PE_NA18507_GAIIx_100_chr21.txt 
Sample for read group NA18507 differs.
hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam: GAIIx-chr21-BWA.aln
hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: GAIIx-chr21-BWA.mem
Number of program records differs.
hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam: 3
hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: 2
Match	14555662
Differ	127423
Unmapped_both	153
Unmapped_left	23500
Unmapped_right	2131
Missing_left	85465
Missing_right	23604

The comparison shows >97% agreement between both results but apparently the mem algorithm maps more reads than the aln method to the chr21 target based on comparing the resulting BAM files.

download exercise files

Download exercise files here

Use the right application to open the files present in ex3b-files

References:
  1. http://picard.sourceforge.net/explain-flags.html
  2. http://www.imagemagick.org/script/index.php
  3. http://qualimap.bioinfo.cipf.es

[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.3 | NGS Exercise.3c | NGS Exercise.4 ]