NGS Exercise.3b
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.3 | NGS Exercise.3c | NGS Exercise.4 ]
# updated 2014 version
Perform QC on the BWA alignment data
Contents
- 1 Introduction
- 2 use samtools to rapidely estimate the quality of BWA mapping results
- 3 summarise mapping results and perform mapping QC using Picard tools
- 4 Perform integrated BAM QC with Qualimap
- 5 extract all reads mapping to chr21
- 6 Quickly compare the BWA aln and BWA mem algorithm results
- 7 download exercise files
Introduction
Each mapping is unique and can be biased by many unknown factors. Performing QC is primordial in each experiment to avoid carryover of errors to the downstream analyses.
use samtools to rapidely estimate the quality of BWA mapping results
Review duplicate counts in both mapping runs
duplicate metrics for sample & all and both bwa methods
infolder="Final_Results/hg19_bwa-mapping" for metrics in ${infolder}/*_duplicate_metrics.txt; do echo -e "# counts for: \n# "${metrics} # keep only lines starting ('^') with 'lib-' or 'LIBRARY' grep -E "^lib-|LIBRARY" ${metrics} | transpose -t | column -t echo done
what did we learn here?
# counts for: # hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_duplicate_metrics.txt LIBRARY lib-NA18507 UNPAIRED_READS_EXAMINED 3787 READ_PAIRS_EXAMINED 737426 UNMAPPED_READS 19603 UNPAIRED_READ_DUPLICATES 146 READ_PAIR_DUPLICATES 29 READ_PAIR_OPTICAL_DUPLICATES 10 PERCENT_DUPLICATION 0.000138 ESTIMATED_LIBRARY_SIZE 14309816538 # counts for: # hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_duplicate_metrics.txt LIBRARY lib-NA18507 UNPAIRED_READS_EXAMINED 1734 READ_PAIRS_EXAMINED 747327 UNMAPPED_READS 1854 UNPAIRED_READ_DUPLICATES 105 READ_PAIR_DUPLICATES 60 READ_PAIR_OPTICAL_DUPLICATES 12 PERCENT_DUPLICATION 0.00015 ESTIMATED_LIBRARY_SIZE 5817247850 # counts for: hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_duplicate_metrics.txt LIBRARY lib-NA18507 UNPAIRED_READS_EXAMINED 37324 READ_PAIRS_EXAMINED 7379933 UNMAPPED_READS 193004 UNPAIRED_READ_DUPLICATES 8673 READ_PAIR_DUPLICATES 3099 READ_PAIR_OPTICAL_DUPLICATES 1058 PERCENT_DUPLICATION 0.001005 ESTIMATED_LIBRARY_SIZE 13336049883 # counts for: hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_duplicate_metrics.txt LIBRARY lib-NA18507 UNPAIRED_READS_EXAMINED 17268 READ_PAIRS_EXAMINED 7477196 UNMAPPED_READS 18534 UNPAIRED_READ_DUPLICATES 4725 READ_PAIR_DUPLICATES 5428 READ_PAIR_OPTICAL_DUPLICATES 1310 PERCENT_DUPLICATION 0.001041 ESTIMATED_LIBRARY_SIZE 6783431981
Alternatively, changing to 'REMOVE_DUPLICATES=TRUE' physically removes the duplicates from the output. This physical removal is not required for variant calling and is only added here for information.
use samtools and some [R] code to compute the distribution of insert distances between paired read
The following code was adapted from an online tutorial files <http://www.embl.de/~rausch> and is presented as a prototype for [R] analysis of NGS data. Note that Picard CollectInsertSizeMetrics does something very similar and produces a nice histogram plot.
#! /usr/bin/env bash ## script: 'comp_insert-distance.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2014-04-08 # required: # Samtools: Version: 0.1.19-96b5f2294a # extract mate distances from BAM mappings # run both commands in terminal before running the following R code # fetch data pre-computed by the trainer infolder="Final_Results/hg19_bwa-mapping" # create a new folder for your own results outfolder="hg19_bwa-mapping" mkdir -p ${outfolder} ## full aln data #infolder="Final_Results/hg19_bwa-mapping/full_chromosome" #infile=shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.bam #pfx="all" #outfolder="hg19_bwa-mapping/full_chromosome" # 10% aln sample infile=shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.bam pfx="sample" ## the following piped command will: # keep first in pair from paired reads from bam # cut the 9th column TLEN='observed Template LENgth' (with a minus sign when mapped to the reverse strand) # remove leading dash when in reverse direction # store the obtained numbers to a file samtools view -f 66 ${infolder}/${infile} | \ cut -f 9 | sed "s/^-//" > ${outfolder}/${pfx}_aln_inserts.txt ## full mem data #infolder="Final_Results/hg19_bwa-mapping/full_chromosome" #infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.bam #pfx="all" # 10% mem sample infile=shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.bam pfx="sample" samtools view -f 66 ${infolder}/${infile} | \ cut -f 9 | sed "s/^-//" > ${outfolder}/${pfx}_mem_inserts.txt
The first samtools commands parse the results and generate summary text files with counts for 'read mapped in proper pair' & 'first in pair' (filter flag=66, see http://picard.sourceforge.net/explain-flags.html[1], BITS mirrored copy).
The following [R] script takes the prepared summary files and generates plots.
#!/usr/bin/Rscript # script: scripts/mate-distance.R # required: # 'comp_insert-distance.sh' run before this script # please modify the path to fit your setup # trainer # base="/home/splaisan/Desktop/NGS_DNASeq-training/hg19_bwa-mapping/" # full chromosome samples # base="/home/bits/NGS_DNASeq-training/hg19_bwa-mapping/full_chromosome/" # 10 pct samples # training laptops, running from the $BASE folder base="hg19_bwa-mapping/" setwd(base) # limit plotting to max 500x coverage maxcvg=500 #pfx="all_" pfx="sample_" xfile=paste(base, pfx, "aln_inserts.txt", sep="") pngfile=paste(base, pfx, "aln_inserts.png", sep="") sumfile=paste(base, pfx, "aln_stats.txt", sep="") x2file=paste(base, pfx, "mem_inserts.txt", sep="") png2file=paste(base, pfx, "mem_inserts.png", sep="") sum2file=paste(base, pfx, "mem_stats.txt", sep="") x = scan(xfile) png(file=pngfile, bg="white") op<-par(mfrow=c(2,1)) hist(x[x < maxcvg], xlim=c(0, maxcvg), main="insert sizes for BWA-aln", xlab="insert size (bps)") boxplot (x, ylim=c(0, maxcvg), horizontal=T) dev.off() sink(file=sumfile) print("# quartiles") summary(x) print("# standard deviation") sd(x) print("# median absolute deviation") mad(x) sink() x2 = scan(x2file) png(file=png2file, bg="white") op<-par(mfrow=c(2,1)) hist(x2[x2 < maxcvg], xlim=c(0, maxcvg), main="insert sizes for BWA-mem", xlab="insert size (bps)") boxplot (x2[x2 < maxcvg], ylim=c(0, maxcvg), horizontal=T) dev.off() sink(file=sum2file) print("# quartiles") summary(x2) print("# standard deviation") sd(x2) print("# median absolute deviation") mad(x) sink()
Results for the two sample runs are shown below
comparison for sample between aln and mem methods | |
---|---|
[1] "# quartiles" Min. 1st Qu. Median Mean 3rd Qu. Max. 100.0 301.0 312.0 311.4 324.0 426.0 [1] "# standard deviation" [1] 19.49675 [1] "# median absolute deviation" [1] 16.3086 |
[1] "# quartiles" Min. 1st Qu. Median Mean 3rd Qu. Max. 0 301 312 374 324 38170000 [1] "# standard deviation" [1] 37085.36 [1] "# median absolute deviation" [1] 16.3086 |
comparison for all between aln and mem methods | |
---|---|
[1] "# quartiles" Min. 1st Qu. Median Mean 3rd Qu. Max. 0 301 312 374 324 38170000 [1] "# standard deviation" [1] 37085.36 [1] "# median absolute deviation" [1] 16.3086 |
[1] "# quartiles" Min. 1st Qu. Median Mean 3rd Qu. Max. 0 301 312 374 324 38170000 [1] "# standard deviation" [1] 37085.36 [1] "# median absolute deviation" [1] 16.3086 |
summarise mapping results and perform mapping QC using Picard tools
As shown before, the samtools flagstat command identifies library issues leading to a lot of unmapped reads (obviously not the case for this pre-filtered dataset). for more complex issues, other types of control are necessary adn can be performed using specific picard tools.
apply Picard CollectGcBiasMetrics
Monotonous GC/AT rich regions lead to ambiguous mapping and eventually to no-mapping at all if they exceed the mapper capabilities. This is a well known problem in short read sequencing and although paired end reads limit this bias, it is, together together with repeats still responsible for the absence of coverage in highly repetitive regions, telomers, and centromers. The following QC run will show the limits of mappability associated with GC content for a given BAM file and possibly identify a technical issue.
CollectGcBiasMetrics is a High level metrics that capture how biased the coverage in a certain BAM file is.
#! /usr/bin/env bash ## script: 'picard_CollectGcBiasMetrics.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2014-05-23 # Picard Version: 1.112+ #full chromosome sample #infolder=Final_Results/hg19_bwa-mapping/full_chromosome #outfolder=bwa-mappingQC/full chromosome #10 pct sample infolder=Final_Results/hg19_bwa-mapping outfolder=hg19_bwa-mapping/bwa-mappingQC mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # for each '_mdup.bam' file in folder for b in ${infolder}/*_mdup.bam; do outname=$(basename ${b} .bam) echo "# processing ${b}" # send errors log file with '2>' java -Xmx4g -jar $PICARD/picard.jar CollectGcBiasMetrics \ VERBOSITY=ERROR \ R=${ref} \ I=${b} \ O=${outfolder}/${outname}_GCbias.txt \ CHART=${outfolder}/${outname}_GCbias.pdf \ S=${outfolder}/${outname}_GCbias_summary.txt \ 2>${outfolder}/${outname}_CollectGcBiasMetrics.err done
what did we learn here?
basename - strip directory and suffix from filenames
- 'basename ${b} .bam' returns the name of ${b} without path and without .bam
- we could use 'basename' on multiple files too
- with '-a' works on all provided files
- with '-s .bam' removes all '.bam' suffices
(eg basename -s .bam -a myfolder/*.bam # will return the file names without .bam for all bam files in myfolder)
The CollectGcBiasMetrics program produces 2 text files and a corresponding graph in pdf format.
comparison for sample & all between aln and mem methods | |
---|---|
apply Picard CollectAlignmentSummaryMetrics
This collects High level metrics about the alignment of reads within a SAM file. The result is a table that can be transposed for better readability (below)
#! /usr/bin/env bash ## script: 'picard_CollectAlignmentSummaryMetrics.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2014-05-23 # Picard Version: 1.112+ # transpose #full chromosome sample #infolder=Final_Results/hg19_bwa-mapping/full_chromosome #outfolder=bwa-mappingQC/full chromosome #10 pct sample # process all bam files in 'hg19_bwa-mapping' #infolder=hg19_bwa-mapping infolder=Final_Results/hg19_bwa-mapping outfolder=bwa-mappingQC mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa for b in ${infolder}/*_mdup.bam; do outname=$(basename ${b} .bam) echo "# processing ${b}" # collect information (java -Xmx4g -jar $PICARD/picard.jar CollectAlignmentSummaryMetrics \ R=${ref} \ I=${b} \ O=${outfolder}/${outname}_AlignmentSummaryMetrics.txt \ 2>${outfolder}/${outname}_CollectAlignmentSummaryMetrics.err) # ignore header and empty lines and rotate the remaining content grep -v -E '(^#|^$)' \ ${outfolder}/${outname}_AlignmentSummaryMetrics.txt | \ transpose -t | \ column -t > \ ${outfolder}/tr_${outname}_AlignmentSummaryMetrics.txt # end of loop done
what did we learn here?
more regular expression symbols
^ (Caret) = match expression at the start of a line, as in ^A. $ (Question) = match expression at the end of a line, as in A$. \ (Back Slash) = turn off the special meaning of the next character, as in \^. [ ] (Brackets) = match any one of the enclosed characters, as in [aeiou]. Use Hyphen "-" for a range, as in [0-9]. [^ ] = match any one character except those enclosed in [ ], as in [^0-9]. . (Period) = match a single character of any value, except end of line. * (Asterisk) = match zero or more of the preceding character or expression. \{x,y\} = match x to y occurrences of the preceding. \{x\} = match exactly x occurrences of the preceding. \{x,\} = match x or more occurrences of the preceding.
As often, text results are obfuscating and more clearly shown using the bash utility command column -t after transposition
Counts for sample and all and bwa aln and bwa mem analyses
## for f in ${outfolder}/tr*; do echo "# "$f; cat $f; echo; done # tr_shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_AlignmentSummaryMetrics.txt CATEGORY FIRST_OF_PAIR SECOND_OF_PAIR PAIR TOTAL_READS 749121 749121 1498242 PF_READS 749121 749121 1498242 PCT_PF_READS 1 1 1 PF_NOISE_READS 3 0 3 PF_READS_ALIGNED 740309 738330 1478639 PCT_PF_READS_ALIGNED 0.988237 0.985595 0.986916 PF_ALIGNED_BASES 73821765 73278262 147100027 PF_HQ_ALIGNED_READS 732782 731014 1463796 PF_HQ_ALIGNED_BASES 73080734 72573054 145653788 PF_HQ_ALIGNED_Q20_BASES 71927567 70594488 142522055 PF_HQ_MEDIAN_MISMATCHES 76 75 76 PF_MISMATCH_RATE 0.788667 0.788507 0.788587 PF_HQ_ERROR_RATE 0.788738 0.788588 0.788664 PF_INDEL_RATE 0.000252 0.000251 0.000252 MEAN_READ_LENGTH 100 100 100 READS_ALIGNED_IN_PAIRS 737426 737426 1474852 PCT_READS_ALIGNED_IN_PAIRS 0.996106 0.998776 0.997439 BAD_CYCLES 0 0 0 STRAND_BALANCE 0.499779 0.500205 0.499992 PCT_CHIMERAS 0.001745 0.001745 0.001745 PCT_ADAPTER 0 0.000001 0.000001 SAMPLE LIBRARY READ_GROUP # tr_shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_AlignmentSummaryMetrics.txt CATEGORY FIRST_OF_PAIR SECOND_OF_PAIR PAIR TOTAL_READS 749121 749121 1498242 PF_READS 749121 749121 1498242 PCT_PF_READS 1 1 1 PF_NOISE_READS 0 0 0 PF_READS_ALIGNED 748909 747479 1496388 PCT_PF_READS_ALIGNED 0.999717 0.997808 0.998763 PF_ALIGNED_BASES 74537188 73945391 148482579 PF_HQ_ALIGNED_READS 740170 738549 1478719 PF_HQ_ALIGNED_BASES 73753309 73165657 146918966 PF_HQ_ALIGNED_Q20_BASES 72540468 71162933 143703401 PF_HQ_MEDIAN_MISMATCHES 76 75 75 PF_MISMATCH_RATE 0.788767 0.788523 0.788646 PF_HQ_ERROR_RATE 0.788689 0.788462 0.788576 PF_INDEL_RATE 0.000279 0.000271 0.000275 MEAN_READ_LENGTH 100 100 100 READS_ALIGNED_IN_PAIRS 747327 747327 1494654 PCT_READS_ALIGNED_IN_PAIRS 0.997888 0.999797 0.998841 BAD_CYCLES 0 0 0 STRAND_BALANCE 0.500136 0.500712 0.500424 PCT_CHIMERAS 0.007445 0.007445 0.007445 PCT_ADAPTER 0 0 0 SAMPLE LIBRARY READ_GROUP # tr_shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_AlignmentSummaryMetrics.txt CATEGORY FIRST_OF_PAIR SECOND_OF_PAIR PAIR TOTAL_READS 7495097 7495097 14990194 PF_READS 7495097 7495097 14990194 PCT_PF_READS 1 1 1 PF_NOISE_READS 28 23 51 PF_READS_ALIGNED 7408248 7388942 14797190 PCT_PF_READS_ALIGNED 0.988413 0.985837 0.987125 PF_ALIGNED_BASES 738774576 733365277 1472139853 PF_HQ_ALIGNED_READS 7332524 7315616 14648140 PF_HQ_ALIGNED_BASES 731325109 726297327 1457622436 PF_HQ_ALIGNED_Q20_BASES 719764321 706528172 1426292493 PF_HQ_MEDIAN_MISMATCHES 76 75 76 PF_MISMATCH_RATE 0.788711 0.788536 0.788624 PF_HQ_ERROR_RATE 0.78878 0.78862 0.7887 PF_INDEL_RATE 0.000252 0.000254 0.000253 MEAN_READ_LENGTH 100 100 100 READS_ALIGNED_IN_PAIRS 7379933 7379933 14759866 PCT_READS_ALIGNED_IN_PAIRS 0.996178 0.998781 0.997478 BAD_CYCLES 0 0 0 STRAND_BALANCE 0.500208 0.499733 0.499971 PCT_CHIMERAS 0.001717 0.001717 0.001717 PCT_ADAPTER 0 0 0 SAMPLE LIBRARY READ_GROUP # tr_shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_AlignmentSummaryMetrics.txt CATEGORY FIRST_OF_PAIR SECOND_OF_PAIR PAIR TOTAL_READS 7495097 7495097 14990194 PF_READS 7495097 7495097 14990194 PCT_PF_READS 1 1 1 PF_NOISE_READS 0 0 0 PF_READS_ALIGNED 7492960 7478700 14971660 PCT_PF_READS_ALIGNED 0.999715 0.997812 0.998764 PF_ALIGNED_BASES 745771802 739837082 1485608884 PF_HQ_ALIGNED_READS 7406255 7389885 14796140 PF_HQ_ALIGNED_BASES 737997945 732085200 1470083145 PF_HQ_ALIGNED_Q20_BASES 725876220 712113681 1437989901 PF_HQ_MEDIAN_MISMATCHES 76 75 75 PF_MISMATCH_RATE 0.788805 0.788566 0.788686 PF_HQ_ERROR_RATE 0.788749 0.788513 0.788632 PF_INDEL_RATE 0.000278 0.000274 0.000276 MEAN_READ_LENGTH 100 100 100 READS_ALIGNED_IN_PAIRS 7477196 7477196 14954392 PCT_READS_ALIGNED_IN_PAIRS 0.997896 0.999799 0.998847 BAD_CYCLES 0 0 0 STRAND_BALANCE 0.50058 0.500195 0.500388 PCT_CHIMERAS 0.007323 0.007323 0.007323 PCT_ADAPTER 0 0 0 SAMPLE LIBRARY READ_GROUP
apply Picard CollectMultipleMetrics
Alternatively, one can run 4 of the 5 Picard metric QC commands in one go by using the global Picard command CollectMultipleMetrics that will generate PDF plots using R (CollectGcBiasMetrics.jar should be run separately and with a coordinate sorted BAM as input).
#! /usr/bin/env bash # script: 'picard_CollectMultipleMetrics.sh' ## ©SP-BITS, 2013 v1.0 # last edit: 2014-05-23 # Picard Version: 1.112+ #full chromosome sample #infolder=Final_Results/hg19_bwa-mapping/full_chromosome #outfolder=bwa-mappingQC/full chromosome #10 pct sample #infolder=hg19_bwa-mapping infolder=Final_Results/hg19_bwa-mapping outfolder=bwa-mappingQC mkdir -p ${outfolder} ref=ref/HiSeq_UCSC_hg19.fa # OPT: PROGRAM={CollectAlignmentSummaryMetrics, CollectInsertSizeMetrics, # QualityScoreDistribution, MeanQualityByCycle} for b in ${infolder}/*_mdup.bam; do outname=$(basename ${b} .bam) echo "# processing ${b}" # run in parallel if the computer has enough capacity ( java -Xmx4g -jar $PICARD/CollectMultipleMetrics.jar \ R=${ref} \ I=${b} \ O=${outfolder}/${outname} \ PROGRAM=CollectAlignmentSummaryMetrics \ PROGRAM=CollectInsertSizeMetrics \ PROGRAM=QualityScoreDistribution \ PROGRAM=MeanQualityByCycle \ 2>${outfolder}/${outname}_CollectMultipleMetrics.err ) # rem: you could add '&' after the closing ')' to run such lighter command in parallel done
Picard results are stored in separate PDF files (tip: use 'evince filename.pdf' in terminal to open PDF files)
how can I convert all those PDF files to lightweight PNG?
for p in *.pdf; do convert -density 300 -depth 8 -quality 100 -resize 25% $p ${p%%.pdf}.png; done
Even better, create a custom function to ease this and add it to your .myfunctions home file (done!)
pdf2png () { # convert PDF to PNG (all pages with suffix) using ImageMagick. # take PDF file from 1 # use PDF file name for PNG with extension swap # or take PNG file name from $2 if [ $# -ge 1 ] then convert -density 300 -depth 8 -quality 100 -resize 25% ${1} ${2:-${1%%.pdf}.png}; else echo "usage: pdf2png <input.pdf> <opt:output.png>" fi }
comparison for 10pc samples between aln and mem methods | |
---|---|
comparison for all between aln and mem methods | |
---|---|
As seen above, both sample and full data give very similar results for all metrics.
Perform integrated BAM QC with Qualimap
Like what FastQC does for FastQ data, Qualimap can integrate several quality control steps of your BAM data in one go. Qualimap[3] is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI, online help page) for the shy and hurried among you and a command-line interface (online help page) to facilitate the quality control of alignment sequencing data (PDF manual link). We next will show the use of the CLI version for basic BAM QC but as seen in the hidden documentation below, Qualimap proposes many more very attractive tools that are left for you to explore.
full qualimap command line help
################################# ## To show available tools use command: qualimap --help QualiMap v.0.8 Built on 2014-03-05 17:17 usage: qualimap <tool> [options] To launch GUI leave <tool> empty. Available tools: bamqc Evaluates NGS mapping to a reference genome counts Counts data analysis (RNA-seq data evaluation) clustering Clustering epigenomic signals comp-counts Compute feature counts Special arguments: --java-mem-size Use this argument to set Java memory heap size. Example: qualimap bamqc -bam very_large_alignment.bam --java-mem-size=4G ### BAM QC ####################################### ## The following command allows to perform BAM QC analysis: usage: qualimap bamqc -bam <arg> [-c] [-gd <arg>] [-gff <arg>] [-nr <arg>] [-nt <arg>] [-nw <arg>] [-os] [-outdir <arg>] [-outformat <arg>] -bam <arg> input mapping file -c,--paint-chromosome-limits paint chromosome limits inside charts -gd <arg> compare with genome distribution (possible values: HUMAN or MOUSE) -gff <arg> region file (in GFF/GTF or BED format) -hm <arg> minimum size for a homopolymer to be considered in indel analysis (default is 3) -nr <arg> number of reads in the chunk (default is 500) -nt <arg> number of threads (default equals the number of cores) -nw <arg> number of windows (default is 400) -os,--outside-stats compute region outside stats (only with -gff option) -outdir <arg> output folder -outformat <arg> output report format (PDF or HTML, default is HTML) -p <arg> specify protocol to calculate correct strand reads (works only with -gff option, possible values are STRAND-SPECIFIC-FORWARD or STRAND-SPECIFIC-REVERSE, default is NON-STRAND-SPECIFIC) The only required parameter is bam – the input mapping file. If outdir is not provided, it will be created automatically in the same folder where BAM file is located. Detailed explanation of available options can be found here. Example (data available here): qualimap bamqc -bam ERR089819.bam -c ### Counts QC ########################################################## ## To perform counts QC analysis (evaluation of RNA-seq data) use the following command: usage: qualimap counts -d1 <arg> [-d2 <arg>] [-i <arg>] [-k <arg>] [-n1 <arg>] [-n2 <arg>] [-outdir <arg>] [-outformat <arg>] [-s <arg>] -d1,--data1 <arg> first file with counts -d2,--data2 <arg> second file with counts -i,--info <arg> info file -k,--threshold <arg> threshold for the number of counts -n1,--name1 <arg> name for the first sample -n2,--name2 <arg> name for second sample -outdir <arg> output folder -outformat <arg> output report format (PDF or HTML, default is HTML) -s,--species <arg> use default file for the given species [human | mouse] Detailed explanation of available options can be found here. Example (data available here): qualimap counts -d1 kidney.counts -d2 liver.counts -s human -outdir results ### Clustering ############################################## ## To perform clustering of epigenomic signals use the following command: usage: qualimap clustering [-b <arg>] [-c <arg>] -control <arg> [-expr <arg>] [-f <arg>] [-l <arg>] [-name <arg>] [-outdir <arg>] [-outformat <arg>] [-r <arg>] -regions <arg> -sample <arg> [-viz <arg>] -b,--bin-size <arg> size of the bin (default is 100) -c,--clusters <arg> comma-separated list of cluster sizes -control <arg> comma-separated list of control BAM files -expr <arg> name of the experiment -f,--fragment-length <arg> smoothing length of a fragment -l <arg> upstream offset (default is 2000) -name <arg> comma-separated names of the replicates -outdir <arg> output folder -outformat <arg> output report format (PDF or HTML, default is HTML) -r <arg> downstream offset (default is 500) -regions <arg> path to regions file -sample <arg> comma-separated list of sample BAM files -viz <arg> visualization type: heatmap or line Detailed explanation of available options can be found here. Example (data available here): qualimap clustering -sample clustering/hmeDIP.bam -control clustering/input.bam -regions annotations/transcripts.human.64.bed -outdir clustering_result ### Compute counts ###################################### ## To compute counts from mapping data use the following command: usage: qualimap comp-counts [-algorithm <arg>] -bam <arg> -gtf <arg> [-id <arg>] [-out <arg>] [-protocol <arg>] [-type <arg>] -algorithm <arg> uniquely-mapped-reads(default) or proportional -b calculate 5' and 3' coverage bias -bam <arg> mapping file in BAM format) -gtf <arg> region file in GTF format -id <arg> attribute of the GTF to be used as feature ID. Regions with the same ID will be aggregated as part of the same feature. Default: gene_id. -out <arg> path to output file -protocol <arg> forward-stranded,reverse-stranded or non-strand-specific -type <arg> Value of the third column of the GTF considered for counting. Other types will be ignored. Default: exon Detailed explanation of available options can be found here. Example (data available here): qualimap comp-counts -bam kidney.bam -gtf ../annotations/human.64.gtf -out kidney.counts
The Qualimap GUI is rather simple to use and returns integrated results similarly to FastQC.
A QC on the sample BAM files can be performed in the GUI or with a simple command
Current human GRCh37.75 GTF annotations were downloaded from ensEMBL and converted to hg19 notation
# run the following code from the base folder (~/bits/NGS_DNASeq-training) # get the data from the ensEMBL FTP and save it in the ref folder wget -P ref ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz # add chr and change MT to chrM and recompress zcat ref/Homo_sapiens.GRCh37.75.gtf.gz | awk '{print "chr"$0}' | sed 's/chrMT/chrM/g' | bgzip -c > ref/Homo_sapiens.hg19.gtf.gz # subset chr21 for our purpose and keep it in uncompressed zgrep "^chr21\b" ref/Homo_sapiens.hg19.gtf.gz > ref/Homo_sapiens.hg19_chr21.gtf
We now combine the chr21 GTF subset and both '_mdup.bam' files with Qualimap bamqc.
#! /usr/bin/env bash ## script: 'qualimap_bamqc.sh' ## ©SP-BITS, 2014 v1.0 # last edit: 2014-04-09 # required: # qualimap 0.8 # R dependencies #full chromosome sample #infolder=Final_Results/hg19_bwa-mapping/full_chromosome #outfolder=bwa-mapping-qualimapQC/full chromosome #10 pct sample # process all bam files in 'hg19_bwa-mapping' infolder=Final_Results/hg19_bwa-mapping outfolder=bwa-mapping-qualimapQC mkdir -p ${outfolder} mygtf=ref/Homo_sapiens.hg19_chr21.gtf for b in ${infolder}/*_mdup.bam; do outname=$(basename ${b} .bam) echo "# processing ${b}" # read the full help to find more about other parameters # we output here to HTML but PDF is also available for portability and compactness # we restrict the analysis to chr21 using annotation data from the Qualimap site ## Human genome annotations from Ensembl database (v.64) qualimap bamqc --java-mem-size=4G \ -bam ${b} \ -gd HUMAN \ -c \ -gff ${mygtf} \ -outdir ${outfolder}/${outname}_qualimap-results \ -outformat HTML # for PDF, replace last line by # -outformat PDF # -outfile ${outname}_chr21_report.pdf \ done
Results of Qualimap analysis of 'all' chr21 mappings
extract all reads mapping to chr21
Because the re-mapping was done against the full human genome, some of the reads may now map to other chromosomes. For each method, we extract all reads mapping at least by one end to chr21 to a new bam file using samtools to get a clear chr21-specific picture out of our results.
#! /usr/bin/env bash # script: samtools_extract-chr21.sh # required: # samtools Version: 0.1.19-96b5f2294a infolder=Final_Results/hg19_bwa-mapping # process all files in 'infolder' whose name ends with '_mdup.bam' ## note that the relative path to the file is included in ${b} for b in ${infolder}/*_mdup.bam; do echo "# processing ${b}" # subset and index ONLY chr21 calls # also produce a text summary with samtools flagstat samtools view -b ${b} chr21 > ${b%%.bam}_chr21.bam && \ samtools index ${b%%.bam}_chr21.bam && \ samtools flagstat ${b%%.bam}_chr21.bam > ${b%%.bam}_chr21.flagstat.txt done
samtools flagstat results for the chr21 subset (full data)
# flagstat results for: shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup.bam 14770127 + 0 in total (QC-passed reads + QC-failed reads) 14124 + 0 duplicates 14740138 + 0 mapped (99.80%:-nan%) 14770127 + 0 paired in sequencing 7385096 + 0 read1 7385031 + 0 read2 14690200 + 0 properly paired (99.46%:-nan%) 14710149 + 0 with itself and mate mapped 29989 + 0 singletons (0.20%:-nan%) 11843 + 0 with mate mapped to a different chr 11667 + 0 with mate mapped to a different chr (mapQ>=5) # flagstat results for: shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup.bam 14861665 + 0 in total (QC-passed reads + QC-failed reads) 13154 + 0 duplicates 14845463 + 0 mapped (99.89%:-nan%) 14861665 + 0 paired in sequencing 7431627 + 0 read1 7430038 + 0 read2 14700219 + 0 properly paired (98.91%:-nan%) 14829215 + 0 with itself and mate mapped 16248 + 0 singletons (0.11%:-nan%) 46189 + 0 with mate mapped to a different chr 42441 + 0 with mate mapped to a different chr (mapQ>=5)
samtools flagstat results for the 10% chr21 subset
# flagstat results for: shuffled_NA18507_GAIIx_100_chr21_aln_mdup_chr21.bam 1475974 + 0 in total (QC-passed reads + QC-failed reads) 184 + 0 duplicates 1472936 + 0 mapped (99.79%:-nan%) 1475974 + 0 paired in sequencing 737967 + 0 read1 738007 + 0 read2 1467878 + 0 properly paired (99.45%:-nan%) 1469898 + 0 with itself and mate mapped 3038 + 0 singletons (0.21%:-nan%) 1210 + 0 with mate mapped to a different chr 1194 + 0 with mate mapped to a different chr (mapQ>=5) # flagstat results for: shuffled_NA18507_GAIIx_100_chr21_mem_mdup_chr21.bam 1485431 + 0 in total (QC-passed reads + QC-failed reads) 178 + 0 duplicates 1483795 + 0 mapped (99.89%:-nan%) 1485431 + 0 paired in sequencing 742775 + 0 read1 742656 + 0 read2 1469154 + 0 properly paired (98.90%:-nan%) 1482155 + 0 with itself and mate mapped 1640 + 0 singletons (0.11%:-nan%) 4663 + 0 with mate mapped to a different chr 4291 + 0 with mate mapped to a different chr (mapQ>=5)
what did we learn here?
variable content can be manipulated like in the 'indexing part' above
- ${b%%.bam} # removes the final .bam from the variable name
- ${b##.bam} # would keep ONLY the final .bam from the name (not very useful here!)
- look for more on the provided Bash_Quick_Reference.pdf
Quickly compare the BWA aln and BWA mem algorithm results
The Picard tool CompareSAMs compares the headers of the two input SAM or BAM files, and, if possible, the SAMRecords. For SAMRecords, compares only the readUnmapped flag, reference name, start position and strand. Reports the number of SAMRecords that match, differ in alignment, are mapped in only one input, or are missing in one of the files.
Only the subset of hg19 mappings attributed to chr21 is analyzed here using data obtained by the trainer
#! /usr/bin/env bash # script: picard_CompareSAMs.sh # required: # Picard Version: 1.112+ infolder=Final_Results/hg19_bwa-mapping/ outfolder=hg19_bwa-mapping # full #name=shuffled_PE_NA18507_GAIIx_100_chr21 #sample name=shuffled_10pc_PE_NA18507_GAIIx_100_chr21 java -Xmx4g -jar $PICARD/CompareSAMs.jar \ ${infolder}/${name}_aln-pe_mdup_chr21.bam \ ${infolder}/${name}_mem-pe_mdup_chr21.bam >\ ${outfolder}/CompareSAMs_${name}.txt \ 2>${outfolder}/CompareSAMs_${name}.err
comparison results for sample & all and both bwa methods
# CompareSAMs_shuffled_10pc_PE_NA18507_GAIIx_100_chr21.txt Sample for read group NA18507 differs. hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam: GAIIx-chr21-BWA.aln hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: GAIIx-chr21-BWA.mem Number of program records differs. hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam: 3 hg19_bwa-mapping/shuffled_10pc_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: 2 Match 1454476 Differ 12746 Unmapped_both 16 Unmapped_left 2387 Unmapped_right 214 Missing_left 8682 Missing_right 2351 SAM files differ. # CompareSAMs_shuffled_PE_NA18507_GAIIx_100_chr21.txt Sample for read group NA18507 differs. hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam: GAIIx-chr21-BWA.aln hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: GAIIx-chr21-BWA.mem Number of program records differs. hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_aln-pe_mdup_chr21.bam: 3 hg19_bwa-mapping/shuffled_PE_NA18507_GAIIx_100_chr21_mem-pe_mdup_chr21.bam: 2 Match 14555662 Differ 127423 Unmapped_both 153 Unmapped_left 23500 Unmapped_right 2131 Missing_left 85465 Missing_right 23604
The comparison shows >97% agreement between both results but apparently the mem algorithm maps more reads than the aln method to the chr21 target based on comparing the resulting BAM files.
download exercise files
Download exercise files here
References:
- ↑ http://picard.sourceforge.net/explain-flags.html
- ↑ http://www.imagemagick.org/script/index.php
- ↑ http://qualimap.bioinfo.cipf.es
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.3 | NGS Exercise.3c | NGS Exercise.4 ]