Clean adaptor containing reads from FastQ data at command line
[ Main_Page ]
This workflow is intimately linked to a greater scope BITS training session dedicated to ChIP-Seq data analysis (Feb 24, 2014).
Contents
Aim
This short tutorial aims at demonstrating the effect of adaptor contamination, we present there a simple workflow including:
- download a fastQ file related to a ChIP-Seq experiment used for training.
- perform QC on the fastQ file with fastQC
- remove reads with adaptor contamination using cutadapt
- perform a second fastQC run to control for improvement
Analysis Workflow
The workflow presented here is by no mean complete but only aimed at identifying adaptor contamination in a real dataset and showing how removing this contamination will affect the overall quality of the read collection.
download SRR576933 from the European Nucleotide Archive (ENA|EBI)
We choose the European member of the consortium for obvious bandwidth reasons. The file is located by searching for its ID and the ftp link to the reads copied from the result page (http://www.ebi.ac.uk/ena/data/view/SRX189778)
reads=fastq_reads # get the 126Mb compressed fastq-file wget -P ${reads} --no-check-certificate ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR576/SRR576933/SRR576933.fastq.gz # preview the data for the first 3 reads (always safer) zcat SRR576933.fastq.gz | head -12 @SRR576933.1 HWUSI-EAS1789_0000:2:20:1269:14140/1 AAGCATGGAATAACCGCCTGGTGAATGCTCGCCATA + dcd`\dddddaeacecdac`c\cca`bTbbdddYd_ @SRR576933.2 HWUSI-EAS1789_0000:2:20:1270:19579/1 TGGAGGCTGACCACGATAAGCTGCCGCTGGTGGTGC + dceYc^\cddd^dddTccc`daYdbdaad`]``XTU @SRR576933.3 HWUSI-EAS1789_0000:2:20:1270:17351/1 AGTGCGATGCCGTTCACCCGGTTTTCTTTATCATTA + dddddc\cc^`c\ccddadcdaadbbc]]]aa^ddT
run fastQC test on the data
Provided fastQC is installed on your machine, you can proceed copy-padsting the code below. If not, please install the fastQC command line version from ()
reads=fastq_reads results=fastqc_results mkdir -p ${results} fastqc -o fastqc_results --noextract -f ${reads}/SRR576933.fastq.gz
Inspect the results and identify the issue with adaptor contamination. Many issues are present in the report and could be linked to adaptor contamination as well (we will check this at the end of this tutorial).
fastQC details in presence of adaptors
- per_base_quality.png
- per_sequence_quality.png
- per_base_sequence_content.png
- per_base_gc_content.png
- per_sequence_gc_content.png
- per_base_n_content.png
- sequence_length_distribution.png
- duplication_levels.png
- kmer_profiles.png
FastQC figures (full results) | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
As seen here, one sequence is present in more than 29% of the reads. Such abundance cannot come from a true bacterial sequence and has to be a primer contamination, left over from the library construction process or from a PCR amplification gone wild. In the current case, the primer is identified as "TruSeq Adapter, Index 5" and reveals a library cleanup problem.
We now use the cutadapt utility written by Marcel Martin to remove reads containing the TruSeq Adapter, Index 5 identified by its sequence "GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA".
cutadapt([1]) will clip or simply filter-out reads that contain a provided linker sequence. It can be tuned to be fault-tolerant and can also be used in reverse-mode to keep only linker-containing reads if this makes sense in your workflow.
# process from the gzipped file and save into a new gzip file cutadapt -e 0.1 \ -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCC \ --discard ${reads}/SRR576933.fastq.gz | \ bgzip -c > ${reads}/SRR576933-cutadapt.fastq.gz
detailed terminal cutadapt output
cutadapt -e 0.1 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCC --discard SRR576933.fastq.gz > SRR576933-cutadapt.fastq cutadapt version 1.3 Command line parameters: -e 0.1 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCC --discard SRR576933.fastq.gz Maximum error rate: 10.00% No. of adapters: 1 Processed reads: 3603544 Processed bases: 129727584 bp (129.7 Mbp) Trimmed reads: 1197923 (33.2%) Trimmed bases: 41076056 bp (41.1 Mbp) (31.66% of total) Too short reads: 0 (0.0% of processed reads) Too long reads: 0 (0.0% of processed reads) Total time: 169.78 s Time per read: 0.047 ms === Adapter 1 === Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCC', length 36, was trimmed 1197923 times. No. of allowed errors: 0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-36 bp: 3 Overview of removed sequences length count expect max.err error counts 3 45620 56305.4 0 45620 4 10550 14076.3 0 10550 5 2717 3519.1 0 2717 6 655 879.8 0 655 7 146 219.9 0 146 8 39 55.0 0 39 9 100 13.7 0 54 46 10 166 3.4 1 102 64 11 37 0.9 1 14 23 12 31 0.2 1 21 10 13 12 0.1 1 12 14 2 0.0 1 0 2 16 1233 0.0 1 1163 70 17 862 0.0 1 29 833 18 206 0.0 1 197 8 1 19 27 0.0 1 27 20 3 0.0 2 1 2 21 1244 0.0 2 1191 48 5 22 4 0.0 2 4 23 1642 0.0 2 1568 69 5 24 131 0.0 2 124 7 25 27 0.0 2 11 15 1 26 60 0.0 2 54 6 28 6 0.0 2 2 3 1 35 72 0.0 3 0 63 1 8 36 1132331 0.0 3 1470 6868 1068935 55058
run fastQC test on the cleaned data
reads=fastq_reads results=fastqc_results mkdir -p fastqc_results fastqc -o fastqc_results --noextract -f fastq ${reads}/SRR576933-cutadapt.fastq.gz
Inspect the results and look at the different sections of the report. Many problems seem to have vanished after removing the adaptor-containig reads, only the low quality issue with the last (3') base and some kmer noise remain.
fastQC details after adaptor removal
- per_base_quality.png
- per_sequence_quality.png
- per_base_sequence_content.png
- per_base_gc_content.png
- per_sequence_gc_content.png
- per_base_n_content.png
- sequence_length_distribution.png
- duplication_levels.png
- kmer_profiles.png
FastQC figures (full results) | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
Conclusion
A simple removal of contaminating adaptors can make a huge difference in the overall quality control of fastq reads. Most mappers should not be affected by contaminating adaptors that should not map to the reference genome and are discarded during that step. However, it can be a good practice to remove such reads prior to mapping if the user want to prevent any unexpected behavior during NGS analysis.
Download Howto files here
References:
[ Main_Page ]