Extract fastQ reads from BAM and perform basic read quality control

This exercise was removed from the session program but is left to teach you how to recycle public NGS data from the internet.

Introduction

This step was inspired by feedback from the GATK team. Please read Geraldine's comments on the GATK forum. Quoted: 'The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked'. ^[1]

shuffle a BAM file & extract reads back to fastQ format

Extracting reads from a sorted BAM file is not recommended when the reads need to be remapped to the reference genome. In such case, the reads should first be shuffled to remove positional bias in the resulting fastQ data as shown below.

hstcmd bamshuf used in a former training was part of a previous version of htslib/htscmd/bcftools that is no longer available. A replacement code can be obtained when using the latest samtools (1.x) which comes with a 'collate' command

shuffle bam records in pairs

# with full data
#infolder="Final_Results/ori-Illumina_hg18_mapping_results"
#outfile="shuffled_PE_NA18507_GAIIx_100_chr21.bam"
 
# if using the 10% subset, use the following lines
infolder="Final_Results/Illumina_hg18_mapping"
outfile="shuffled_NA18507_GAIIx_100_chr21.bam"
 
outfolder="Illumina_hg18_mapping"
mkdir -p ${outfolder}
 
infile="NA18507_10pc_GAIIx_100_chr21.bam"
 
# store the shuffled pieces to the /tmp folder
tmpfolder="/tmp"
 
# shuffle BAM and keep records in pairs
# create 128 intermediate files for shuffling with prefix 'shuffled_piece'
# reassemble the shuffled data to one new file
htscmd bamshuf -O -n128 ${infolder}/${infile} \
	${tmpfolder}/shuffled_piece > \
	${outfolder}/${outfile} \
	2>${outfolder}/htscmd_bamshuf-${infile%%.bam}.err

collate bam records in pairs using samtools 1.x

infolder="Illumina_hg18_mapping"
infile="NA18507_10pc_GAIIx_100_chr21.bam"
outfile="shuffled_NA18507_GAIIx_100_chr21.bam"
 
# collate BAM using 1000 intermediate files for shuffling
samtools collate -n1000 ${infolder}/${infile} ${infolder}/${infile%%.bam} \
	2>${infolder}/samtools_collate-${infile%%.bam}.err

create two separate fastQ files from shuffled BAM records

# write reads to the read folder with prefix
 
# with full data
# infolder="Final_Results/ori-Illumina_hg18_mapping_results"
# infile="shuffled_PE_NA18507_GAIIx_100_chr21.bam"
# outprefix="shuffled_NA18507_GAIIx_100_chr21"
 
# using the 10% subset
infolder="Final_Results/Illumina_hg18_mapping"
infile="shuffled_NA18507_GAIIx_100_chr21.bam"
outprefix="shuffled_10pc_PE_NA18507_GAIIx_100_chr21"
 
# create new folder
outfolder=reads
mkdir -p ${outfolder}
 
# note the on the fly compression for the error log to save disk space with two '>' signs
( java -Xmx4g -jar $PICARD/picard.jar SamToFastq \
	I=${infolder}/${infile} \
	F=${outfolder}/${outprefix}_2_1.fq \
	F2=${outfolder}/${outprefix}_2_2.fq \
	VALIDATION_STRINGENCY=LENIENT ) \
	2>${outfolder}/SamToFastq-${infile%%.bam}.err
 
# We could have bee even smarter and compress the large error file on the fly
# to do so we replace the simple path to the error file by a nested command as shown below
#	2> >(gzip > ${outfolder}/SamToFastq-${infile%%.bam}.err.gz)
# where '>( ...)' redirects the large error log to a compression process between '()'
 
# gzip files if last command did exit without error ($? -eq 0)
# beware, there needs to be a space left and right from '$? -eq 0' to have a correct syntax
if [ $? -eq 0 ] then
	gzip ${outfolder}/${outprefix}_2_1.fq 
	gzip ${outfolder}/${outprefix}_2_2.fq
fi

what did we learn here?

Two new bash goodies here:

when a lot of errors are expected, the error log can be compressed on the fly with a kind of pipe of the form '>(command)'
we can test for the success of a command to decide what to do with its output (in the example, the big fastq data is replaced by the compressed .gz format ONLY when 'Picard SamToFastq' succeeded)

Two error types are reported in SamToFastq.err and can be safely ignored in our context

Ignoring SAM validation error: ERROR: Record 5, Read name EAS51_0210:6:93:16628:4680, Mapped mate should have mate reference name
Ignoring SAM validation error: ERROR: Record 5, Read name EAS51_0210:6:93:16628:4680, MAPQ should be 0 for unmapped read.

perform QC on the FastQ data

# do QC on the trainer full read files
# infolder="Final_Results/reads_results"
# inprefix="shuffled_PE_NA18507_GAIIx_100_chr21"
 
# do QC on the trainer 10%-sample read files
# infolder="Final_Results/reads_results"
# inprefix ="shuffled_10pc_PE_NA18507_GAIIx_100_chr21"
 
# do QC on your own files
infolder="reads"
inprefix="shuffled_10pc_PE_NA18507_GAIIx_100_chr21"
 
# create folder for QC-results inside the read folder
outfolder=readQC
mkdir -p ${infolder}/${outfolder}
 
# on each reads file
for fq in ${infolder}/${inprefix}*.fq.gz; do
	# perform QC test using 4 threads and zip results
	fastqc -t 4 -o ${infolder}/${outfolder} --noextract ${fq}
done

what did we learn here?

Using for-loops saves typing when multiple input exist:

do not forget the ';do' and 'done' markers required to delimit the for-loop block

inspect QC results

Shown here for all left-end reads.

# for all trainer reads
# infolder="Final_Results/reads_results"
# inprefix="shuffled_PE_NA18507_GAIIx_100_chr21"
# infile=${inprefix}"_2_1.fq_fastqc.zip"
# for the other end, replace by 
# infile=infile=${inprefix}"_2_2.fq_fastqc.zip"
 
# for the 10pc trainer sample
infolder="Final_Results/reads_results/readQC"
inprefix="shuffled_10pc_PE_NA18507_GAIIx_100_chr21"
infile=${inprefix}"_2_1.fq_fastqc.zip"
# for the other end, replace by 
# infile=infile=${inprefix}"_2_2.fq_fastqc.zip"
 
# for Your 10pc sample
# infolder="reads/readQC"
# inprefix="shuffled_10pc_PE_NA18507_GAIIx_100_chr21"
# infile=${inprefix}"_2_1_fastqc.zip"
# for the other end, replace by 
# infile=${inprefix}"_2_2_fastqc.zip"
 
# decompress one of the files
unzip ${infolder}/${infile} -d ${infolder}/
 
# inspect the content using the local web-browser
cd ${infolder}/${infile%%.zip}/ && firefox fastqc_report.html &

what did we learn here?

Another way to test for success

the 'firefox' call placed after '&&' is executed only if the left command succeeds
the closing '&' tells firefox to start in background and give us control back on the terminal (otherwise busy as long as firefox runs)

You should get something like this.

download exercise files

Download exercise files here

Use the right application to open the files present in ex2-files

References:

↑ http://gatkforums.broadinstitute.org/discussion/2908/howto-revert-a-bam-file-to-fastq-format

[1] ttp://gatkforums.broadinstitute.org/discussion/2908/howto-revert-a-bam-file-to-fastq-format

[1]

NGS Exercise.2

Contents

Introduction

shuffle a BAM file & extract reads back to fastQ format

shuffle bam records in pairs

collate bam records in pairs using samtools 1.x

create two separate fastQ files from shuffled BAM records

perform QC on the FastQ data

inspect QC results

download exercise files

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Resources

Toolbox