NGS Exercise.2
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.1 | NGS Exercise.3 ]
Extract fastQ reads from BAM and perform basic read quality control
This exercise was removed from the session program but is left to teach you how to recycle public NGS data from the internet.
Contents
Introduction
This step was inspired by feedback from the GATK team. Please read Geraldine's comments on the GATK forum. Quoted: 'The aligner uses blocks of paired reads to estimate the insert size. If you don’t shuffle your original bam, the blocks of insert size will not be randomly distributed across the genome, rather they will all come from the same region, biasing the insert size calculation. This is a very important step which is unfortunately often overlooked'. [1]
shuffle a BAM file & extract reads back to fastQ format
Extracting reads from a sorted BAM file is not recommended when the reads need to be remapped to the reference genome. In such case, the reads should first be shuffled to remove positional bias in the resulting fastQ data as shown below.
hstcmd bamshuf used in a former training was part of a previous version of htslib/htscmd/bcftools that is no longer available. A replacement code can be obtained when using the latest samtools (1.x) which comes with a 'collate' command
shuffle bam records in pairs
# with full data #infolder="Final_Results/ori-Illumina_hg18_mapping_results" #outfile="shuffled_PE_NA18507_GAIIx_100_chr21.bam" # if using the 10% subset, use the following lines infolder="Final_Results/Illumina_hg18_mapping" outfile="shuffled_NA18507_GAIIx_100_chr21.bam" outfolder="Illumina_hg18_mapping" mkdir -p ${outfolder} infile="NA18507_10pc_GAIIx_100_chr21.bam" # store the shuffled pieces to the /tmp folder tmpfolder="/tmp" # shuffle BAM and keep records in pairs # create 128 intermediate files for shuffling with prefix 'shuffled_piece' # reassemble the shuffled data to one new file htscmd bamshuf -O -n128 ${infolder}/${infile} \ ${tmpfolder}/shuffled_piece > \ ${outfolder}/${outfile} \ 2>${outfolder}/htscmd_bamshuf-${infile%%.bam}.err
collate bam records in pairs using samtools 1.x
infolder="Illumina_hg18_mapping" infile="NA18507_10pc_GAIIx_100_chr21.bam" outfile="shuffled_NA18507_GAIIx_100_chr21.bam" # collate BAM using 1000 intermediate files for shuffling samtools collate -n1000 ${infolder}/${infile} ${infolder}/${infile%%.bam} \ 2>${infolder}/samtools_collate-${infile%%.bam}.err
create two separate fastQ files from shuffled BAM records
# write reads to the read folder with prefix # with full data # infolder="Final_Results/ori-Illumina_hg18_mapping_results" # infile="shuffled_PE_NA18507_GAIIx_100_chr21.bam" # outprefix="shuffled_NA18507_GAIIx_100_chr21" # using the 10% subset infolder="Final_Results/Illumina_hg18_mapping" infile="shuffled_NA18507_GAIIx_100_chr21.bam" outprefix="shuffled_10pc_PE_NA18507_GAIIx_100_chr21" # create new folder outfolder=reads mkdir -p ${outfolder} # note the on the fly compression for the error log to save disk space with two '>' signs ( java -Xmx4g -jar $PICARD/picard.jar SamToFastq \ I=${infolder}/${infile} \ F=${outfolder}/${outprefix}_2_1.fq \ F2=${outfolder}/${outprefix}_2_2.fq \ VALIDATION_STRINGENCY=LENIENT ) \ 2>${outfolder}/SamToFastq-${infile%%.bam}.err # We could have bee even smarter and compress the large error file on the fly # to do so we replace the simple path to the error file by a nested command as shown below # 2> >(gzip > ${outfolder}/SamToFastq-${infile%%.bam}.err.gz) # where '>( ...)' redirects the large error log to a compression process between '()' # gzip files if last command did exit without error ($? -eq 0) # beware, there needs to be a space left and right from '$? -eq 0' to have a correct syntax if [ $? -eq 0 ] then gzip ${outfolder}/${outprefix}_2_1.fq gzip ${outfolder}/${outprefix}_2_2.fq fi
what did we learn here?
- when a lot of errors are expected, the error log can be compressed on the fly with a kind of pipe of the form '>(command)'
- we can test for the success of a command to decide what to do with its output (in the example, the big fastq data is replaced by the compressed .gz format ONLY when 'Picard SamToFastq' succeeded)
Two error types are reported in SamToFastq.err and can be safely ignored in our context
- Ignoring SAM validation error: ERROR: Record 5, Read name EAS51_0210:6:93:16628:4680, Mapped mate should have mate reference name
- Ignoring SAM validation error: ERROR: Record 5, Read name EAS51_0210:6:93:16628:4680, MAPQ should be 0 for unmapped read.
perform QC on the FastQ data
# do QC on the trainer full read files # infolder="Final_Results/reads_results" # inprefix="shuffled_PE_NA18507_GAIIx_100_chr21" # do QC on the trainer 10%-sample read files # infolder="Final_Results/reads_results" # inprefix ="shuffled_10pc_PE_NA18507_GAIIx_100_chr21" # do QC on your own files infolder="reads" inprefix="shuffled_10pc_PE_NA18507_GAIIx_100_chr21" # create folder for QC-results inside the read folder outfolder=readQC mkdir -p ${infolder}/${outfolder} # on each reads file for fq in ${infolder}/${inprefix}*.fq.gz; do # perform QC test using 4 threads and zip results fastqc -t 4 -o ${infolder}/${outfolder} --noextract ${fq} done
what did we learn here?
inspect QC results
Shown here for all left-end reads.
# for all trainer reads # infolder="Final_Results/reads_results" # inprefix="shuffled_PE_NA18507_GAIIx_100_chr21" # infile=${inprefix}"_2_1.fq_fastqc.zip" # for the other end, replace by # infile=infile=${inprefix}"_2_2.fq_fastqc.zip" # for the 10pc trainer sample infolder="Final_Results/reads_results/readQC" inprefix="shuffled_10pc_PE_NA18507_GAIIx_100_chr21" infile=${inprefix}"_2_1.fq_fastqc.zip" # for the other end, replace by # infile=infile=${inprefix}"_2_2.fq_fastqc.zip" # for Your 10pc sample # infolder="reads/readQC" # inprefix="shuffled_10pc_PE_NA18507_GAIIx_100_chr21" # infile=${inprefix}"_2_1_fastqc.zip" # for the other end, replace by # infile=${inprefix}"_2_2_fastqc.zip" # decompress one of the files unzip ${infolder}/${infile} -d ${infolder}/ # inspect the content using the local web-browser cd ${infolder}/${infile%%.zip}/ && firefox fastqc_report.html &
what did we learn here?
You should get something like this.
download exercise files
Download exercise files here
References:
[ Main_Page | Hands-on introduction to NGS variant analysis | NGS-formats |
| NGS Exercise.1 | NGS Exercise.3 ]