NGS RNASeq DE Exercise.2
Map paired-end chr22 reads to the human hg19 reference genome using Tophat2
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.3 ]
Contents
Introduction
Mapping cDNA reads to a reference genome can be done against the full reference genome (unbiased) or by selecting a transcriptome model to map the reads that correspond to known transcripts (faster but biased). The alignment is done using Tophat that makes use of the Bowtie2 aligner while allowing splice-broken alignment across exon junctions. One additional feature of Tophat is the optional discovery of new splice events. The full experiment alignment was run on a high end server and the code is provided to you, the mapping results are not fully analyzed in this general protocol but rather used to extract a one-chromosome full read set that is processed during the session.
Analyze paired-end fastq read data with fastQC
As done on the full data, we perform a QC on the freshly extracted reads
run_chr22-fastqc.sh script
#!/bin/bash # run_fastqc.sh # full data #export RAWDATA=$(pwd)/SRP033351_fastq #export READQC=$(pwd)/SRP033351_read_qc-Results # trimmed data #export RAWDATA=$(pwd)/trimmed-SRP033351_fastq #export READQC=$(pwd)/trimmed-SRP033351_read_qc-Results # chr22 data export RAWDATA=$(pwd)/SRR1039509-chr22_fastq export READQC=$(pwd)/SRR1039509-chr22_read_qc-Results mkdir -p $READQC # create an empty error log cat /dev/null > $READQC/error.log for fq in $RAWDATA/*.fastq.gz; do name=${fq##*/} prefix=${name%%.fastq.gz} # perform a full QC control on the reads (-q for quiet, 4 threads) # then convert results to PDF echo "# run fastqc on ${fq}" cmd="fastqc -f fastq --noextract -t 4 -o $READQC -q ${fq}" echo "# ${cmd}" eval ${cmd} ## add conversion to PDF # htmldoc --webpage -f ${prefix}_fastqc.pdf ${prefix}_fastqc.html # generate statistics for plots # secret option '-Q 33' was added to fit with the phred scale echo "# checking: ${name}" zcat ${fq} | fastx_quality_stats -Q 33 -o $READQC/${name}_stats.txt \ 2>> $READQC/error.log # plot from the text summary file fastq_quality_boxplot_graph.sh \ -i $READQC/${name}_stats.txt \ -o $READQC/${prefix}_boxplot.png fastx_nucleotide_distribution_graph.sh \ -i $READQC/${name}_stats.txt \ -o $READQC/${prefix}_nuclplot.png # also plot normalized base frequency using R (script code in appendix) scripts/avgQdist2linePlot.R $READQC/${name}_stats.txt $READQC done # convert results to a nice PDF file # requires htmldoc and dependencies installed on your machine # htmldoc --webpage \ # --browserwidth 800 \ # --fontsize 7 \ # -f ${outfolder}/fastqc_report.pdf \ # ${outfolder}/fastqc_report.html
FastQC plots for SRR1039509_1 reads PDF | ||
---|---|---|
FastQC plots for SRR1039509_2 reads PDF | ||
---|---|---|
FastQC plots for chr22-SRR1039509_1 reads PDF | ||
---|---|---|
FastQC plots for chr22-SRR1039509_2 reads PDF | ||
---|---|---|
Map the paired FastQ reads to the human genome & transcriptome
The raw FastQ files obtained from the SRA data were mapped directly using Tophat2 and the associated Bowtie2 program without read trimming and cleaning. The effect of trimming on read mapping was analyzed on one sample and found to be relative as presented in NGS_RNASeq_DE_Exercise.1c.
We used here the latest versions of Tophat2 and bowtie2 as opposed to what is often done using Galaxy or reported in relatively outdated pages, our results will therefore slightly differ from those published originally by the authors
The mapping was done both against the entire hg19 genome sequence (run#1) or against the defined transcriptome as present in the Ensembl GTF model (run#2). Both methods are equally valid when users ultimately consider known genes and their documented transcripts (which is what we will do in this session) but the first method may present additional advantages when the user wish to discover new splice variants (not covered in this training) or exon junctions between separate genes as a result of gene fusions (not covered in this training).
The following scripts take each pair of FastQ files and performs mapping either against the full genome ('all') or against the transcript models ('gtf') and using the maximal capacity of the BITS laptops (1 thread and 4GB RAM). Note that if you use this code on a higher-end machine, you will need to edit the parameters controlling for memory usage and number of threads accordingly
tophat_map-all.sh script
#!/bin/bash # tophat_map-all.sh # the script should be called from the base folder # the base folder includes the 'reads' folder named below ## full data # reads="SRP033351_fastq" ## one sample #reads="SRR1039509_fastq" ## one sample chr22 subset reads="SRR1039509-chr22_fastq" results="tophat2.0.13_results" mkdir -p ${results} # folder and file options fasta=$BOWTIE2_INDEXES/hg19.fa index=$BOWTIE2_INDEXES/hg19 gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf gtfindex=$BOWTIE2_INDEXES/hg19_ensGene # how hard can your computer work? (max cpu# that may be used) nthr=1 # using 'nthr' processors maxram="4G" # on a stronger, this will run faster # nthr=24 # using 'nthr' processors # maxram="64G" # map single ends in (un-guided|guided) modes (not|using) the gtf file for fq1 in $(ls ${reads}/*_1.fastq.gz); do # get the base-name without end_suffix name=$(basename "${fq1}" "_1.fastq.gz") echo echo "# mapping the ${name} pair against the whole genome" # deduce second read file name fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/') # unguided mode # execute the command and write summary results in a 'log' file cmd="tophat -p ${nthr} \ -o ${results}/${name}_all-mappings \ --no-coverage-search \ ${index} \ ${fq1} ${fq2} \ 2>&1 | tee -a ${results}/tophat-all-log_${name}.txt" echo "# ${cmd}" eval ${cmd} # if tophat was a success, sort, rename, and index mapping results # run in background with good resource while starting the next loop on 1 thread if [ $? = 0 ]; then # use 4 threads and up to 4GB RAM (default are 1 thread and 768M) samtools sort -@ ${nthr} -m ${maxram} \ ${results}/${name}_all-mappings/accepted_hits.bam \ ${results}/${name}_all-tophat \ && samtools index ${results}/${name}_all-tophat.bam # also perform basic BAM QC on mappings # the code is provided in a separate script 'mapping_metrics.sh' scripts/mapping_metrics.sh ${results}/${name}_all-tophat.bam fi echo # print separator A=$(printf "%50s\n") echo ${A// /#} echo # abort after first loop debug-mode # exit 0 done
tophat_map-gtf.sh script
#!/bin/bash # tophat_map-all.sh # the script should be called from the base folder # the base folder includes the 'reads' folder named below ## one sample chr22 subset reads="SRR1039509-chr22_fastq" results="tophat2.0.13_results_gtf" mkdir -p ${results} # folder and file options fasta=$BOWTIE2_INDEXES/hg19.fa index=$BOWTIE2_INDEXES/hg19 gtffile=$BOWTIE2_INDEXES/hg19_ensGene.gtf gtfindex=$BOWTIE2_INDEXES/hg19_ensGene # how hard can your computer work? (max cpu# that may be used) nthr=1 # using 'nthr' processors maxram="4G" # on a stronger, this will run faster # nthr=24 # using 'nthr' processors # maxram="64G" # map single ends in (un-guided|guided) modes (not|using) the gtf file for fq1 in $(ls ${reads}/*_1.fastq.gz); do # get the base-name without end_suffix name=$(basename "${fq1}" "_1.fastq.gz") echo echo "# mapping the ${name} pair against the whole genome" # deduce second read file name fq2=$(echo "$fq1" | sed -r 's/_1.fastq.gz/_2.fastq.gz/') # guided mode with 'ensGene' as model # execute the command and write summary results in a 'log' file # first time -G ${gtffile} then --transcriptome-index ${gtfindex} cmd="tophat \ -p ${nthr} \ -o ${results}/${name}_gtf-mappings \ --transcriptome-index ${gtfindex} \ --no-coverage-search \ ${index} \ ${fq1} ${fq2} \ 2>&1 | tee -a ${results}/tophat-gtf-log_${name}.txt" echo "# ${cmd}" eval ${cmd} # if tophat was a success, sort, rename, and index mapping results # run in background with good resource while starting the next loop on 1 thread if [ $? = 0 ]; then # use 4 threads and up to 4GB RAM (default are 1 thread and 768M) samtools sort -@ ${nthr} -m ${maxram} \ ${results}/${name}_gtf-mappings/accepted_hits.bam \ ${results}/${name}_gtf-tophat \ && samtools index ${results}/${name}_gtf-tophat.bam # also perform basic BAM QC on mappings # the code is provided in a separate script 'mapping_metrics.sh' scripts/mapping_metrics.sh ${results}/${name}_gtf-tophat.bam fi echo # print separator A=$(printf "%50s\n") echo ${A// /#} echo done
Quick evaluation of the SRR1039509 mapping results for both reference models
Several rapid QC steps were added in the mapping loop to count mapped and mated read obtained with chr22 data from the N61311_Dex (SRR1039509) sample.
mapping_metrics.sh script
#! /usr/bin/env bash ## script: 'mapping_metrics.sh' ## ©SP-BITS, 2015-01-05 # required: # samtools 1.2 # Picard Version (broad build): 1.129 maxmem=4G infile=$1 filename=$(basename ${infile}) outbase="${infile:0:${#infile} - ${#filename}}" ## samtools flagstats ## cmd="samtools flagstat \ ${infile} \ >${infile%%.bam}_flagstat.txt" echo echo "# ${cmd}" eval ${cmd} ## samtools stats ## cmd="samtools stats \ ${infile} \ >${infile%%.bam}_stats.txt" echo echo "# ${cmd}" eval ${cmd} ## samtools plot-bamstats ## cmd="plot-bamstats \ ${infile%%.bam}_stats.txt \ -p ${infile%%.bam}_plots/" echo echo "# ${cmd}" eval ${cmd} ## plot mapping quality score distribution cmd="java -Xmx${maxmem} -jar $PICARD/picard.jar \ QualityScoreDistribution \ CHART=${infile%%.bam}-QS_distribution.pdf \ I=${infile} \ O=${infile%%.bam}-mappings_QS.txt" echo echo "# ${cmd}" eval ${cmd}
samtools plot-bamstats plots for N61311_Dex (SRR1039509) and full genome mapping | |||
---|---|---|---|
samtools plot-bamstats plots for N61311_Dex (SRR1039509) and ensGene mapping | |||
---|---|---|---|
Other tools are available to analyze and compare BAM data and are covered in the next exercise. Furthermore, a better evaluation will be possible after converting mapping data (BAM) to gene counts as detailed in a later exercise.
download exercise files
Download exercise files here.
References:
[ Main_Page | Hands-on_introduction_to_NGS_RNASeq_DE_analysis | NGS_RNASeq_DE_Exercise.1 | NGS_RNASeq_DE_Exercise.3 ]