National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894.
J Mol Biol 215: 403-10 (1990)
Abstract
A new approach to rapid sequence comparison, basic local alignment
search tool (BLAST), directly approximates alignments that optimize a
measure of local similarity, the maximal segment pair (MSP) score.
Recent mathematical results on the stochastic properties of MSP scores
allow an analysis of the performance of this method as well as the
statistical significance of alignments it generates. The basic algorithm
is simple and robust; it can be implemented in a number of ways and
applied in a variety of contexts including straightforward DNA and
protein sequence database searches, motif searches, gene identification
searches, and in the analysis of multiple regions of similarity in long
DNA sequences. In addition to its flexibility and tractability to
mathematical analysis, BLAST is an order of magnitude faster than
existing sequence comparison tools of comparable sensitivity.
Mesh Headings
Unique Identifier: 91039304
Many similarity measures, including the one we employ, begin with a matrix of similarity scores for all possible pairs of residues. Identities and conservative replacements have positive scores, while unlikely replacements have negative scores. For amino acid sequence comparisons we generally use the PAM-120 matrix (a variation of that in Dayhoff et al., 1978), while for DNA sequence comparisons we score identities +5, and mismatches -4; other scores are of course possible. A sequence segment is a contiguous stretch of residues of any length, and the similarity score for two aligned segments of the same length is the sum of the similarity values for each pair of aligned residues.
Given these rules, we define a Maximal Segment Pair (MSP) to be the highest scoring pair of identical length segments chosen from two sequences. The boundaries of an MSP are chosen to maximize its score, so an MSP may be of any length. The MSP score, which BLAST heuristically attempts to calculate, provides a measure of local similarity for any pair of sequences. A molecular biologist, however, may be interested in all conserved regions shared by two proteins, not only in their highest scoring pair. We therefore define a segment pair to be locally maximal if its score can not be improved either by extending or by shortening both segments (Sellers, 1984). BLAST can seek all locally maximal segment pairs with scores above some cutoff.
Like many other similarity measures, the MSP score for two sequences may be computed in time proportional to the product of their lengths using a simple dynamic programming algorithm. An important advantage of the MSP measure is that recent mathematical results allow the statistical significance of MSP scores to be estimated under an appropriate random sequence model (Karlin & Altschul, 1990; Karlin et al., 1990). Furthermore, for any particular scoring matrix (e.g. PAM-120) one can estimate the frequencies of paired residues in maximal segments. This tractability to mathematical analysis is a crucial feature of the BLAST algorithm.
Recent results (Karlin & Altschul, 1990; Karlin et al., 1990) allow us to estimate the highest MSP score S at which chance similarities are likely to appear. To accelerate database searches, BLAST minimizes the time spent on sequence regions whose similarity with the query has little chance of exceeding this score. Let a word pair be a segment pair of fixed length w. BLAST's main strategy is to seek only segment pairs that contain a word pair with score at least T. Scanning through a sequence, one can quickly determine whether it contains a word of length w that can pair with the query sequence to produce a word pair with score greater than or equal to the threshold T. Any such hit is extended to determine if it is contained within a segment pair whose score is greater than or equal to S. The lower the threshold T, the greater the chance that a segment pair with score at least S will contain a word pair with score at least T. A small T, however, increases the number of hits and therefore the execution time of the algorithm. Random simulation permits us to select a threshold T that balances these considerations.
The scanning phase raises a classic algorithmic problem, i.e. search a long sequence for all occurrences of certain short sequences. We investigated two approaches. Simplified, the first works as follows. Suppose that w = 4 and map each word to an integer between 1 and 20 sup 4, so a word can be used as an index into an array of size 20 sup 4 = 160,000. Let the i sup th entry of such an array point to the list of all occurrences in the query sequence of the i sup th word. Thus, as we scan the database, each database word leads us immediately to the corresponding hits. Typically only a few thousand of the 20 sup 4 possible words will be in this table, and it is easy to modify the approach to use far fewer than 20 sup 4 pointers.
The second approach we explored for the scanning phase was the use of a deterministic finite automaton or finite state machine (Mealy, 1955; Hopcroft & Ullman, 1979). An important feature of our construction was to signal acceptance on transitions (Mealy paradigm) as opposed to on states (Moore paradigm). In the automaton's construction, this saved a factor in space and time roughly proportional to the size of the underlying alphabet. This method yielded a program that ran faster and we prefer this approach for general use. With typical query lengths and parameter settings, this version of BLAST scans a protein database at approximately 500,000 residues/second.
Extending a hit to find a locally maximal segment pair containing that hit is straightforward. To economize time, we terminate the process of extending in one direction when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions. This introduces a further departure from the ideal of finding guaranteed MSPs, but the added inaccuracy is negligible, as can be demonstrated by both experiment and analysis (e.g. for protein comparisons the default distance is 20, and the probability of missing a higher scoring extension is about 0.001).
For DNA, we use a simpler word list, i.e. the list of all contiguous w-mers in the query sequence, often with w = 12. Thus, a query sequence of length n yields a list of n - w + 1 words, and again there are commonly a few thousand words in the list. It is advantageous to compress the database by packing four nucleotides into a single byte, using an auxiliary table to delimit the boundaries between adjacent sequences. Assuming w >= 11, each hit must contain an 8-mer hit that lies on a byte boundary. This observation allows us to scan the database byte-wise and thereby increase speed four-fold. For each 8-mer hit, we check for an enclosing w-mer hit; if found, we extend as before. Running on a SUN4, with a query of typical length (e.g. several kilobases), BLAST scans at approximately 2 megabases/second. At facilities which run many such searches a day, loading the compressed database into memory once in a shared memory scheme affords a substantial savings in subsequent search times.
It should be noted that DNA sequences are highly nonrandom, with locally biased base composition (e.g. AT rich regions), and repeated sequence elements (e.g. Alu sequences) and this has important consequences for the design of a DNA database search tool. If a given query sequence has, for example, an AT rich subsequence, or a commonly occurring repetitive element, then a database search will produce a copious output of matches with little interest. We have designed a somewhat ad hoc but effective means of dealing with these two problems. The program which produces the compressed version of the DNA database tabulates the frequencies of all 8-tuples. Those which occur much more frequently than expected by chance (controllable by parameter) are stored and used to filter "uninformative" words from the query word list. Also, preceding full database searches, a search of a sublibrary of repetitive elements is performed, and the locations in the query of significant matches are stored. Words generated by these regions are removed from the query word list for the full search. Matches to the sublibrary, however, are reported in the final output. These two filters allow alignments to regions with biased composition, or to regions containing repetitive elements to be reported, as long as adjacent regions not containing such features share significant similarity to the query sequence.
The BLAST strategy admits numerous variations. We implemented a version of BLAST that uses dynamic programming to extend hits so as to allow gaps in the resulting alignments. Needless to say, this greatly slows the extension process. While the sensitivity of amino acid searches was improved in some cases, the selectivity was reduced as well. Given the tradeoff of speed and selectivity for sensitivity, it is questionable whether the gap version of BLAST constitutes an improvement. We also implemented the alternative of making a table of all occurrences of w-mers in the database , then scanning the query sequence and processing hits. The disk space requirements are considerable - approximately two computer words for every residue in the database. More damaging was that for query sequences of typical length, the need for random access into the database (as opposed to sequential access) made the approach slower, on the computer systems we used, than scanning the entire database.
Dayhoff, M. O., Schwartz R. M. & Orcutt, B. C. (1978).In Atlas of Protein Sequence and Structure (Dayhoff, M. O., ed.). 5(3):345-352 ,Natl. Biomed. Res. Found., Washington, DC.
Goad, W. B. & Kanehisa, M. I. (1982). Nucl. Acids Res. 10:247-263.
Hopcroft, J. E. & Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation , 42-45, Addison-Wesley, Reading, MA.
Karlin, S. & Altschul, S. F. (1990). Proc. Natl. Acad. Sci. USA 87:2264-2268.
Karlin, S., Dembo, A. & Kawabata, T. (1990). Ann. Stat. , 18:571-581.
Mealy, G. H. (1955). Bell System Technical J. 34:1045-1079.
Needleman, S. B. & Wunsch, C. D. (1970). J. Mol. Biol. 48:443-453.
Sellers, P. H. (1974). SIAM J. Appl. Math. 26:787-793.
Smith, T. F. & Waterman, M. S. (1981). Adv. Appl. Math. 2:482-489.