Cutadapt
Remove contaminant adapter sequences from your reads prior to other NGS processing
Aims
Written by Marcel Martin, cutadapt([1]) will clip or simply filter-out reads that contain a provided linker sequence. It can be tuned to be fault-tolerant and can also be used in reverse-mode to keep only linker-containing reads if this makes sense in your workflow.
Documentation and download link [2]
Download and install
cutadapt (v1.4.1) is a complete command able to find adaptor sequences in short reads and treat them as they diserve (choice of the user). The command line application can be downloaded [3] and was described in a short EMBL publication [4]
Installations and example command to clip adaptors from 'infected' reads, leaving the remaining sequence untouched; Please read the command help for the rich list of options.
# simply install/upgrade with pip if you have it pip install cutadapt --upgrade # OR # download cd ${BIOWARE}/download/ wget --no-check-certificate https://pypi.python.org/packages/source/c/cutadapt/cutadapt-1.4.1.tar.gz #decompress it tar -xzvf cutadapt-1.4.1.tar.gz # the result is a folder named <cutadapt-1.4.1> # install the python package cd cutadapt-1.4.1 python2.7 setup.py install # empty run to get command details cutadapt # run example syntax cutadapt -e ERROR-RATE -a ADAPTER-SEQUENCE input.fastq > output.fastq
command arguments
Reads a FASTA or FASTQ file, finds and removes adapters,
and writes the changed sequence to standard output.
When finished, statistics are printed to standard error.
Use a dash "-" as file name to read from standard input
(FASTA/FASTQ is autodetected).
If two file names are given, the first must be a .fasta or .csfasta
file and the second must be a .qual file. This is the file format
used by some 454 software and by the SOLiD sequencer.
If you have color space data, you still need to provide the -c option
to correctly deal with color space!
If the name of any input or output file ends with '.gz' or '.bz2', it is
assumed to be gzip-/bzip2-compressed.
If you want to search for the reverse complement of an adapter, you must
provide an additional adapter sequence using another -a, -b or -g parameter.
If the input sequences are in color space, the adapter
can be given in either color space (as a string of digits 0, 1, 2, 3) or in
nucleotide space.
EXAMPLE
Assuming your sequencing data is available as a FASTQ file, use this
command line:
$ cutadapt -e ERROR-RATE -a ADAPTER-SEQUENCE input.fastq > output.fastq
See the README file for more help and examples.
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-f FORMAT, --format=FORMAT
Input file format; can be either 'fasta', 'fastq' or
'sra-fastq'. Ignored when reading csfasta/qual files
(default: auto-detect from file name extension).
Options that influence how the adapters are found:
Each of the following three parameters (-a, -b, -g) can be used
multiple times and in any combination to search for an entire set of
adapters of possibly different types. All of the given adapters will
be searched for in each read, but only the best matching one will be
trimmed (but see the --times option).
-a ADAPTER, --adapter=ADAPTER
Sequence of an adapter that was ligated to the 3' end.
The adapter itself and anything that follows is
trimmed.
-b ADAPTER, --anywhere=ADAPTER
Sequence of an adapter that was ligated to the 5' or
3' end. If the adapter is found within the read or
overlapping the 3' end of the read, the behavior is
the same as for the -a option. If the adapter overlaps
the 5' end (beginning of the read), the initial
portion of the read matching the adapter is trimmed,
but anything that follows is kept.
-g ADAPTER, --front=ADAPTER
Sequence of an adapter that was ligated to the 5' end.
If the adapter sequence starts with the character '^',
the adapter is 'anchored'. An anchored adapter must
appear in its entirety at the 5' end of the read (it
is a prefix of the read). A non-anchored adapter may
appear partially at the 5' end, or it may occur within
the read. If it is found within a read, the sequence
preceding the adapter is also trimmed. In all cases,
the adapter itself is trimmed.
-e ERROR_RATE, --error-rate=ERROR_RATE
Maximum allowed error rate (no. of errors divided by
the length of the matching region) (default: 0.1)
--no-indels Do not allow indels in the alignments, that is, allow
only mismatches. This option is currently only
supported for anchored 5' adapters ('-g ^ADAPTER')
(default: both mismatches and indels are allowed)
-n COUNT, --times=COUNT
Try to remove adapters at most COUNT times. Useful
when an adapter gets appended multiple times (default:
1).
-O LENGTH, --overlap=LENGTH
Minimum overlap length. If the overlap between the
read and the adapter is shorter than LENGTH, the read
is not modified.This reduces the no. of bases trimmed
purely due to short random adapter matches (default:
3).
--match-read-wildcards
Allow 'N's in the read as matches to the adapter
(default: False).
-N, --no-match-adapter-wildcards
Do not treat 'N' in the adapter sequence as wildcards.
This is needed when you want to search for literal 'N'
characters.
Options for filtering of processed reads:
--discard-trimmed, --discard
Discard reads that contain the adapter instead of
trimming them. Also use -O in order to avoid throwing
away too many randomly matching reads!
--discard-untrimmed, --trimmed-only
Discard reads that do not contain the adapter.
-m LENGTH, --minimum-length=LENGTH
Discard trimmed reads that are shorter than LENGTH.
Reads that are too short even before adapter removal
are also discarded. In colorspace, an initial primer
is not counted (default: 0).
-M LENGTH, --maximum-length=LENGTH
Discard trimmed reads that are longer than LENGTH.
Reads that are too long even before adapter removal
are also discarded. In colorspace, an initial primer
is not counted (default: no limit).
--no-trim Match and redirect reads to output/untrimmed-output as
usual, but don't remove the adapters. (default: False.
Remove the adapters)
Options that influence what gets output to where:
-o FILE, --output=FILE
Write the modified sequences to this file instead of
standard output and send the summary report to
standard output. The format is FASTQ if qualities are
available, FASTA otherwise. (default: standard output)
--info-file=FILE Write information about each read and its adapter
matches into FILE. Currently experimental: Expect the
file format to change!
-r FILE, --rest-file=FILE
When the adapter matches in the middle of a read,
write the rest (after the adapter) into a file. Use -
for standard output.
--wildcard-file=FILE
When the adapter has wildcard bases ('N's) write
adapter bases matching wildcard positions to FILE. Use
- for standard output. When there are indels in the
alignment, this may occasionally not be quite
accurate.
--too-short-output=FILE
Write reads that are too short (according to length
specified by -m) to FILE. (default: discard reads)
--too-long-output=FILE
Write reads that are too long (according to length
specified by -M) to FILE. (default: discard reads)
--untrimmed-output=FILE
Write reads that do not contain the adapter to FILE,
instead of writing them to the regular output file.
(default: output to same file as trimmed)
-p FILE, --paired-output=FILE
Write reads from the paired end input to FILE.
Additional modifications to the reads:
-q CUTOFF, --quality-cutoff=CUTOFF
Trim low-quality ends from reads before adapter
removal. The algorithm is the same as the one used by
BWA (Subtract CUTOFF from all qualities; compute
partial sums from all indices to the end of the
sequence; cut sequence at the index at which the sum
is minimal) (default: 0)
--quality-base=QUALITY_BASE
Assume that quality values are encoded as
ascii(quality + QUALITY_BASE). The default (33) is
usually correct, except for reads produced by some
versions of the Illumina pipeline, where this should
be set to 64. (default: 33)
-x PREFIX, --prefix=PREFIX
Add this prefix to read names
-y SUFFIX, --suffix=SUFFIX
Add this suffix to read names
--strip-suffix=STRIP_SUFFIX
Remove this suffix from read names if present. Can be
given multiple times.
-c, --colorspace Colorspace mode: Also trim the color that is adjacent
to the found adapter.
-d, --double-encode
When in color space, double-encode colors (map
0,1,2,3,4 to A,C,G,T,N).
-t, --trim-primer When in color space, trim primer base and the first
color (which is the transition to the first
nucleotide)
--strip-f3 For color space: Strip the _F3 suffix of read names
--maq, --bwa MAQ- and BWA-compatible color space output. This
enables -c, -d, -t, --strip-f3, -y '/1' and -z.
--length-tag=TAG Search for TAG followed by a decimal number in the
name of the read (description/comment field of the
FASTA or FASTQ file). Replace the decimal number with
the correct length of the trimmed read. For example,
use --length-tag 'length=' to correct fields like
'length=123'.
-z, --zero-cap Change negative quality values to zero (workaround to
avoid segmentation faults in old BWA versions)
example run
We present here an example command with a bacterial fastq file containing the linker "GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA"
The FastQC report tells us that one specific adaptor is present in 30% of the reads
The next command will filter out reads bearing this adaptor sequence (with 10% error tolerance)
cutadapt version 1.3
Command line parameters: -e 0.1 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA SRR576933.fastq
Maximum error rate: 10.00%
No. of adapters: 1
Processed reads: 3603544
Processed bases: 129727584 bp (129.7 Mbp)
Trimmed reads: 1200971 (33.3%)
Trimmed bases: 41185913 bp (41.2 Mbp) (31.75% of total)
Too short reads: 0 (0.0% of processed reads)
Too long reads: 0 (0.0% of processed reads)
Total time: 139.56 s
Time per read: 0.039 ms
=== Adapter 1 ===
Adapter 'GATCGGAAGAGCACACGTCTGAACTCCAGTCACACA', length 36, was trimmed 1200971 times.
No. of allowed errors:
0-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-36 bp: 3
Overview of removed sequences
length count expect max.err error counts
3 45617 56305.4 0 45617
4 10549 14076.3 0 10549
5 2717 3519.1 0 2717
6 655 879.8 0 655
7 146 219.9 0 146
8 39 55.0 0 39
9 100 13.7 0 54 46
10 166 3.4 1 102 64
11 37 0.9 1 14 23
12 31 0.2 1 21 10
13 12 0.1 1 12
14 2 0.0 1 0 2
16 1233 0.0 1 1163 70
17 862 0.0 1 29 833
18 206 0.0 1 197 8 1
19 27 0.0 1 27
20 3 0.0 2 1 2
21 1244 0.0 2 1191 48 5
22 4 0.0 2 4
23 1642 0.0 2 1568 69 5
24 131 0.0 2 124 7
25 27 0.0 2 11 15 1
26 60 0.0 2 54 6
28 6 0.0 2 2 3 1
35 74 0.0 3 63 9 1 1
36 1135381 0.0 3 1060621 67369 6345 1046
cutadapt removed the 1060621 linker-containing reads identified by FastQC, as well as a number of additional imperfect match for a total of 1135381 reads.
References:
- ↑ https://code.google.com/p/cutadapt/wiki/documentation
- ↑ http://journal.embnet.org/index.php/embnetjournal/article/view/200
- ↑ https://pypi.python.org/packages/source/c/cutadapt/cutadapt-1.4.1.tar.gz
- ↑ http://journal.embnet.org/index.php/embnetjournal/article/view/200