Blat
Align short read sequence to a reference genome
, Bfast
BLAT - BLAST-like Alignment Tool
This page was copied from http://hpcwiki.it.okstate.edu/index.php/BLAT ([1])
BLAT Suite Program Specifications and User Guide
General: Blat produces two major classes of alignments: at the DNA level between two sequences that are of 95% or greater identity, but which may include large inserts, and at the protein or translated DNA level between sequences that are of 80% or greater identity and may also include large inserts. The output of BLAT is flexible. By default it is a simple tab-delimited file which describes the alignment, but which does not include the sequence of the alignment itself. Optionally it can produce BLAST and WU-BLAST compatable output as well as a number of other formats.
The main programs in the blat suite are: gfServer – a server that maintains an index of the genome in memory and uses the index to quickly find regions with high levels of sequence similarity to a query sequence. gfClient – a program that queries gfServer over the network, and then does a detailed alignment of the query sequence with regions found by gfServer. blat –combines client and server into a single program, first building the index, then using the index, and then exiting. webBlat – a web based version of gfClient that presents the alignments in an interactive fashion. Building an index of the genome typically takes 10 or 15 minutes. Typically for interactive applications one uses gfServer to build a whole genome index. At that point gfClient or webBlat can align a single query within few seconds. If one is aligning a lot of sequences in a batch mode then blat can be more efficient, particularly if run on a cluster of computers. Each blat run is typically done against a single chromosome, but with a large number of query sequences.
Other programs in the blat suite are: pslSort – combines and sorts the output of multiple blat runs. (The blat default output format is .psl). pslReps – selects the best alignments for a particular query sequence, using a ‘near best in genome’ approach. pslPretty – converts alignments from the psl format, which is tab-delimited format and does not include the bases themselves, to a more readable alignment format. faToTwoBit – convert Fasta format sequence files to a dense randomly-accessable .2bit format that gfClient can use. twoBitToFa – convert from the .2bit format back to fasta faToNib – convert from Fasta to a somewhat less dense randomly accessible format that predates .2bit. Note each .nib file can only contain a single sequence. nibFrag – convert portions of a nib file back to fasta.
In addition you may be interested in the following programs which are not part of the BLAT suite: In Silico PCR – given two primers quickly find the sequence between them. Available from Kent Informatics. This includes webPCR, an interface similar to webBlat. The Genome Browser – display annotations as a series of tracks on top of the genome. Available from the University of California Santa Cruz. See http://genome.ucsc.edu/license/.
Running the Programs
The command line options of each of the programs is described below. Similar summaries of usage are printed when a command is run with no arguments. See the next section for info on installing webBlat.
blat
blat - Standalone BLAT sequence search command line tool usage:
blat database query [-ooc=11.ooc] output.psl
where:
database and query are each either a .fa , .nib or .2bit file, or a list these files one file name per line. -ooc=11.ooc tells the program to load over-occurring 11-mers from and external file. This will increase the speed by a factor of 40 in many cases, but is not required output.psl is where to put the output. Subranges of nib and .2bit files may specified using the syntax: /path/file.nib:seqid:start-end or /path/file.2bit:seqid:start-end or /path/file.nib:start-end With the second form, a sequence id of file:start-end will be used.
options:
-t=type Database type. Type is one of: dna - DNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein The default is dna -q=type Query type. Type is one of: dna - DNA sequence rna - RNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein rnax - DNA sequence translated in three frames to protein The default is dna -prot Synonymous with -t=prot -q=prot -ooc=N.ooc Use overused tile file N.ooc. N should correspond to the tileSize -tileSize=N sets the size of match that triggers an alignment. Usually between 8 and 12 Default is 11 for DNA and 5 for protein. -stepSize=N spacing between tiles. Default is tileSize. -oneOff=N If set to 1 this allows one mismatch in tile and still triggers an alignments. Default is 0. -minMatch=N sets the number of tile matches. Usually set from 2 to 4 Default is 2 for nucleotide, 1 for protein. -minScore=N sets minimum score. This is the matches minus the mismatches minus some sort of gap penalty. Default is 30 -minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide searches, 25 for protein or translated protein searches. -maxGap=N sets the size of maximum gap between tiles in a clump. Usually set from 0 to 3. Default is 2. Only relevent for minMatch > 1. -noHead suppress .psl header (so it's just a tab-separated file) -makeOoc=N.ooc Make overused tile file. Target needs to be complete genome. -repMatch=N sets the number of repetitions of a tile allowed before it is marked as overused. Typically this is 256 for tileSize 12, 1024 for tile size 11, 4096 for tile size 10. Default is 1024. Typically only comes into play with makeOoc. Also affected by stepSize. When stepSize is halved repMatch is doubled to compensate. -mask=type Mask out repeats. Alignments won't be started in masked region but may extend through it in nucleotide searches. Masked areas are ignored entirely in protein or translated searches. Types are lower - mask out lower cased sequence upper - mask out upper cased sequence out - mask according to database.out RepeatMasker .out file file.out - mask database according to RepeatMasker file.out -qMask=type Mask out repeats in query sequence. Similar to -mask above but for query rather than target sequence. -repeats=type Type is same as mask types above. Repeat bases will not be masked in any way, but matches in repeat areas will be reported separately from matches in other areas in the psl output. -minRepDivergence=NN - minimum percent divergence of repeats to allow them to be unmasked. Default is 15. Only relevant for masking using RepeatMasker .out files. -dots=N Output dot every N sequences to show program's progress -trimT Trim leading poly-T -noTrimA Don't trim trailing poly-A -trimHardA Remove poly-A tail from qSize as well as alignments in psl output -fastMap Run for fast DNA/DNA remapping - not allowing introns, requiring high %ID -out=type Controls output file format. Type is one of: psl - Default. Tab separated format, no sequence pslx - Tab separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format sim4 - similar to sim4 format wublast - similar to wublast format blast - similar to NCBI blast format blast8- NCBI blast tabular format blast9 - NCBI blast tabular format with comments -fine For high quality mRNAs look harder for small initial and terminal exons. Not recommended for ESTs -maxIntron=N Sets maximum intron size. Default is 750000 -extendThroughN - Allows extension of alignment through large blocks of N's Common usage scenarios:
1) Mapping ESTs to the genome within the same species
-ooc=11.ooc
2) Mapping full length mRNAs to the genome in the same species
-ooc=11.ooc -fine -q=rna
3) Mapping ESTs to the genome across species
-q=dnax -t=dnax
4) Mapping mRNA to the genome across species
-q=rnax -t=dnax
5) Mapping proteins to the genome
-q=prot -t=dnax
6) Mapping DNA to DNA in the same species
-ooc=11.ooc -fastMap
7) Mapping DNA from one species to another species
-q=dnax -t=dnax When mapping DNA from one species to another the query side of the alignment should be cut up into chunks of 25kb or less for best performance.
gfServer gfServer - Make a server to quickly find where DNA occurs in genome. To set up a server:
gfServer start host port file(s) Where the files are in .nib or .2bit format
To remove a server:
gfServer stop host port
To query a server with DNA sequence:
gfServer query host port probe.fa
To query a server with protein sequence:
gfServer protQuery host port probe.fa
To query a server with translated dna sequence:
gfServer transQuery host port probe.fa
To process one probe fa file against a .nib format genome (not starting server):
gfServer direct probe.fa file(s).nib
To figure out usage level
gfServer status host port
To get input file list
gfServer files host port
Options:
-tileSize=N size of n-mers to index. Default is 11 for nucleotides, 4 for proteins (or translated nucleotides). -stepSize=N spacing between tiles. Default is tileSize. -minMatch=N Number of n-mer matches that trigger detailed alignment Default is 2 for nucleotides, 3 for protiens. -maxGap=N Number of insertions or deletions allowed between n-mers. Default is 2 for nucleotides, 0 for protiens. -trans Translate database to protein in 6 frames. Note: it is best to run this on RepeatMasked data in this case. -log=logFile keep a log file that records server requests. -seqLog Include sequences in log file (not logged with -syslog) -syslog Log to syslog -logFacility=facility log to the specified syslog facility - default local0. -mask Use masking from nib file. -repMatch=N Number of occurrences of a tile (nmer) that trigger repeat masking the tile. Default is 1024. -maxDnaHits=N Maximum number of hits for a dna query that are sent from the server. Default is 100. -maxTransHits=N Maximum number of hits for a translated query that are sent from the server. Default is 200. -maxNtSize=N Maximum size of untranslated DNA query sequence Default is 40000 -maxAsSize=N Maximum size of protein or translated DNA queries Default is 8000 -canStop If set then a quit message will actually take down the server
gfClient gfClient - A client for the genomic finding program usage:
gfClient host port nibDir in.fa out.psl
where
host is the name of the machine running the gfServer port is the same as you started the gfServer with nibDir is the path of the nib files relative to the current dir (note these are needed by the client as well as the server) in.fa a fasta format file. May contain multiple records out.psl where to put the output
options:
-t=type Database type. Type is one of: dna - DNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein The default is dna -q=type Query type. Type is one of: dna - DNA sequence rna - RNA sequence prot - protein sequence dnax - DNA sequence translated in six frames to protein rnax - DNA sequence translated in three frames to protein -dots=N Output a dot every N query sequences -nohead Suppresses psl five line header -minScore=N sets minimum score. This is twice the matches minus the mismatches minus some sort of gap penalty. Default is 30 -minIdentity=N Sets minimum sequence identity (in percent). Default is 90 for nucleotide searches, 25 for protein or translated protein searches. -out=type Controls output file format. Type is one of: psl - Default. Tab separated format without actual sequence pslx - Tab separated format with sequence axt - blastz-associated axt format maf - multiz-associated maf format wublast - similar to wublast format blast - similar to NCBI blast format -maxIntron=N Sets maximum intron size. Default is 750000
File Formats
.psl files A .psl file describes a series of alignments in a dense easily parsed text format. It begins with a five line header which describes each field. Following this is one line for each alignment with a tab between each field. The fields are describe below in a format suitable for many relational databases.
matches int unsigned , # Number of bases that match that aren't repeats misMatches int unsigned , # Number of bases that don't match repMatches int unsigned , # Number of bases that match but are part of repeats nCount int unsigned , # Number of 'N' bases qNumInsert int unsigned , # Number of inserts in query qBaseInsert int unsigned , # Number of bases inserted in query tNumInsert int unsigned , # Number of inserts in target tBaseInsert int unsigned , # Number of bases inserted in target strand char(2) , # + or - for query strand, optionally followed by + or – for target strand qName varchar(255) , # Query sequence name qSize int unsigned , # Query sequence size qStart int unsigned , # Alignment start position in query qEnd int unsigned , # Alignment end position in query tName varchar(255) , # Target sequence name tSize int unsigned , # Target sequence size tStart int unsigned , # Alignment start position in target tEnd int unsigned , # Alignment end position in target blockCount int unsigned , # Number of blocks in alignment. A block contains no gaps. blockSizes longblob , # Size of each block in a comma separated list qStarts longblob , # Start of each block in query in a comma separated list tStarts longblob , # Start of each block in target in a comma separated list
In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is a another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.
Here's an example of a 30-mer that has 2 blocks that align on the minus strand and 2 blocks on the plus strand (this sort of stuff happens in real life in response to assembly errors sometimes). 0 1 2 3 tens position in query 0123456789012345678901234567890 ones position in query
++++ +++++ plus strand alignment on query -------- ---------- minus strand alignment on query
Plus strand:
qStart 12 qEnd 31 blockSizes 4,5 qStarts 12,26
Minus strand:
qStart 4 qEnd 26 blockSizes 10,8 qStarts 5,19
Essentially the minus strand blockSizes and qStarts are what you would get if you reverse complemented the query.However the qStart and qEnd are non-reversed. To get from one to the other:
qStart = qSize - revQEnd qEnd = qSize - revQStart
.2bit files A .2bit file can store multiple DNA sequence (up to 4 gig total) in a compact randomly accessible format. The two bit files contain masking information as well as the DNA itself. The file begins with a 16 byte header containing the following fields: 1) signature – the number 0x1A412743 in the architecture of the machine that created the file. 2) version – zero for now. Readers should abort if they see a version number higher than 0. 3) sequenceCount – the number of sequences in the file 4) reserved – always zero for now. All fields are 32 bits unless noted. If the signature value is not as given, the reader program should byte swap the signature and see if the swapped version matches. If so all multiple-byte entities in the file will need to be byte-swapped. This enables these binary files to be used unchanged on different architectures.
The header is followed by a file index. There is one entry in the index for each sequence. Each index entry contains three fields: 1) nameSize – a byte containing the length of the name field 2) name – this contains the sequence name itself, and is variable length depending on nameSize. 3) offset – 32 bit offset of the sequence data relative to the start of the file
The index is followed by the sequence records. These contain 9 fields: 1) dnaSize – number of bases of DNA in the sequence. 2) nBlockCount – the number of blocks of N’s in the file (representing unknown sequence). 3) nBlockStarts – a starting position for each block of N’s 4) nBlockSizes – the size of each block of N’s 5) maskBlockCount – the number of masked (lower case) blocks 6) maskBlockStarts – starting position for each masked block 7) maskBlockSizes – the size of each masked block 8) packedDna – the dna packed to two bits per base as so: 00 – T, 01 – C, 10 – A, 11 – G. The first base is in the most significant 2 bits byte, and the last base in the least significant 2 bits, so that the sequence TCAG would be represented as 00011011. The packedDna field will be padded with 0 bits as necessary so that it takes an even multiple of 32 bit in the file, as this improves i/o performance on some machines.
.nib files A .nib file describes a DNA sequence packing two bases into each byte. Each nib file contains only a single sequence. A nib file begins with a 32 bit signature which is 0x6BE93D3A in the architecture of the machine that created the file, and possibly a byte-swapped version of the same number on another machine. This is followed by a 32 bit number in the same format which describes the number of bases in the file. This is followed by the bases themselves packed two bases to the byte. The first base is packed in the high order 4 bits (nibble), the second base in the low order four bits. In C code: byte = (base1<<4) + base2 The numerical values for the bases are: 0 – T, 1 – C, 2 – A, 3 – G, 4 – N (unknown) The most significant bit in a nibble is set if the base is masked. Limits The gfServer program requires approximately 1 byte for every 3 bases in the genome it is indexing in DNA mode, and 1.5 bytes for each unmasked base in translated mode. The blat program requires approximately two bytes for each base in the genome in DNA mode, and three bytes for each base in translated mode. The other programs use relatively little memory.
References: