Basic local alignment search tool.

Altschul SF; Gish W; Miller W; Myers EW; Lipman DJ

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894.

J Mol Biol 215: 403-10 (1990)

Abstract
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

Mesh Headings

Algorithms
Amino Acid Sequence
Base Sequence*
Databases, Bibliographic
Mutation*
Sensitivity and Specificity
Sequence Homology, Nucleic Acid
Software*
Support, U.S. Gov't, P.H.S.

Unique Identifier: 91039304


BLAST Theory and Algorithm


Note: The following text is extracted from Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman (1990). Basic local alignment search tool. J. Mol. Biol. 215:403-10.

BLAST Theory

a- The Maximal Segment Pair Measure

Sequence similarity measures generally can be classified as either global or local. Global similarity algorithms optimize the overall alignment of two sequences, which may include large stretches of low similarity (Needleman & Wunsch, 1970). Local similarity algorithms seek only relatively conserved subsequences, and a single comparison may yield several distinct subsequence alignments; unconserved regions do not contribute to the measure of similarity (Smith & Waterman, 1981; Goad & Kanehisa, 1982; Sellers, 1984). Local similarity measures are generally preferred for database searches, where cDNA's may be compared with partially sequenced genes, and where distantly related proteins may share only isolated regions of similarity, e.g. in the vicinity of an active site.

Many similarity measures, including the one we employ, begin with a matrix of similarity scores for all possible pairs of residues. Identities and conservative replacements have positive scores, while unlikely replacements have negative scores. For amino acid sequence comparisons we generally use the PAM-120 matrix (a variation of that in Dayhoff et al., 1978), while for DNA sequence comparisons we score identities +5, and mismatches -4; other scores are of course possible. A sequence segment is a contiguous stretch of residues of any length, and the similarity score for two aligned segments of the same length is the sum of the similarity values for each pair of aligned residues.

Given these rules, we define a Maximal Segment Pair (MSP) to be the highest scoring pair of identical length segments chosen from two sequences. The boundaries of an MSP are chosen to maximize its score, so an MSP may be of any length. The MSP score, which BLAST heuristically attempts to calculate, provides a measure of local similarity for any pair of sequences. A molecular biologist, however, may be interested in all conserved regions shared by two proteins, not only in their highest scoring pair. We therefore define a segment pair to be locally maximal if its score can not be improved either by extending or by shortening both segments (Sellers, 1984). BLAST can seek all locally maximal segment pairs with scores above some cutoff.

Like many other similarity measures, the MSP score for two sequences may be computed in time proportional to the product of their lengths using a simple dynamic programming algorithm. An important advantage of the MSP measure is that recent mathematical results allow the statistical significance of MSP scores to be estimated under an appropriate random sequence model (Karlin & Altschul, 1990; Karlin et al., 1990). Furthermore, for any particular scoring matrix (e.g. PAM-120) one can estimate the frequencies of paired residues in maximal segments. This tractability to mathematical analysis is a crucial feature of the BLAST algorithm.

b- Rapid Approximation of MSP Scores

In searching a database of thousands of sequences, generally only a handful, if any, will be homologous to the query sequence. The scientist is therefore interested in identifying only those sequence entries with MSP scores over some cutoff score S. These sequences include those sharing highly significant similarity with the query as well as some sequences with borderline scores. This latter set of sequences may include high scoring random matches as well as sequences distantly related to the query. The biological significance of the high scoring sequences may be inferred almost solely on the basis of the similarity score, while the biological context of the borderline sequences may be helpful in distinguishing biologically interesting relationships.

Recent results (Karlin & Altschul, 1990; Karlin et al., 1990) allow us to estimate the highest MSP score S at which chance similarities are likely to appear. To accelerate database searches, BLAST minimizes the time spent on sequence regions whose similarity with the query has little chance of exceeding this score. Let a word pair be a segment pair of fixed length w. BLAST's main strategy is to seek only segment pairs that contain a word pair with score at least T. Scanning through a sequence, one can quickly determine whether it contains a word of length w that can pair with the query sequence to produce a word pair with score greater than or equal to the threshold T. Any such hit is extended to determine if it is contained within a segment pair whose score is greater than or equal to S. The lower the threshold T, the greater the chance that a segment pair with score at least S will contain a word pair with score at least T. A small T, however, increases the number of hits and therefore the execution time of the algorithm. Random simulation permits us to select a threshold T that balances these considerations.


BLAST Algorithm

In our implementations of this approach, details of the three algorithmic steps (namely compiling a list of high-scoring words, scanning the database for hits, and extending hits) vary somewhat depending on whether the database contains proteins or DNA sequences. For proteins, the list consists of all words (w-mers) that score at least T when compared to some word in the query sequence. Thus, a query word may be represented by no words in the list (e.g., for common w-mers using PAM-120 scores) or by many. (One may, of course, insist that every w-mer in the query sequence be included in the word list, irrespective of whether pairing the word with itself yields a score of at least T.) For values of w and T that we have found most useful (see below), there are typically on the order of fifty words in the list for every residue in the query sequence, e.g. 12500 words for a sequence of length 250. If a little care is taken in programming, the list of words can be generated in time essentially proportional to the length of the list.

The scanning phase raises a classic algorithmic problem, i.e. search a long sequence for all occurrences of certain short sequences. We investigated two approaches. Simplified, the first works as follows. Suppose that w = 4 and map each word to an integer between 1 and 20 sup 4, so a word can be used as an index into an array of size 20 sup 4 = 160,000. Let the i sup th entry of such an array point to the list of all occurrences in the query sequence of the i sup th word. Thus, as we scan the database, each database word leads us immediately to the corresponding hits. Typically only a few thousand of the 20 sup 4 possible words will be in this table, and it is easy to modify the approach to use far fewer than 20 sup 4 pointers.

The second approach we explored for the scanning phase was the use of a deterministic finite automaton or finite state machine (Mealy, 1955; Hopcroft & Ullman, 1979). An important feature of our construction was to signal acceptance on transitions (Mealy paradigm) as opposed to on states (Moore paradigm). In the automaton's construction, this saved a factor in space and time roughly proportional to the size of the underlying alphabet. This method yielded a program that ran faster and we prefer this approach for general use. With typical query lengths and parameter settings, this version of BLAST scans a protein database at approximately 500,000 residues/second.

Extending a hit to find a locally maximal segment pair containing that hit is straightforward. To economize time, we terminate the process of extending in one direction when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions. This introduces a further departure from the ideal of finding guaranteed MSPs, but the added inaccuracy is negligible, as can be demonstrated by both experiment and analysis (e.g. for protein comparisons the default distance is 20, and the probability of missing a higher scoring extension is about 0.001).

For DNA, we use a simpler word list, i.e. the list of all contiguous w-mers in the query sequence, often with w = 12. Thus, a query sequence of length n yields a list of n - w + 1 words, and again there are commonly a few thousand words in the list. It is advantageous to compress the database by packing four nucleotides into a single byte, using an auxiliary table to delimit the boundaries between adjacent sequences. Assuming w >= 11, each hit must contain an 8-mer hit that lies on a byte boundary. This observation allows us to scan the database byte-wise and thereby increase speed four-fold. For each 8-mer hit, we check for an enclosing w-mer hit; if found, we extend as before. Running on a SUN4, with a query of typical length (e.g. several kilobases), BLAST scans at approximately 2 megabases/second. At facilities which run many such searches a day, loading the compressed database into memory once in a shared memory scheme affords a substantial savings in subsequent search times.

It should be noted that DNA sequences are highly nonrandom, with locally biased base composition (e.g. AT rich regions), and repeated sequence elements (e.g. Alu sequences) and this has important consequences for the design of a DNA database search tool. If a given query sequence has, for example, an AT rich subsequence, or a commonly occurring repetitive element, then a database search will produce a copious output of matches with little interest. We have designed a somewhat ad hoc but effective means of dealing with these two problems. The program which produces the compressed version of the DNA database tabulates the frequencies of all 8-tuples. Those which occur much more frequently than expected by chance (controllable by parameter) are stored and used to filter "uninformative" words from the query word list. Also, preceding full database searches, a search of a sublibrary of repetitive elements is performed, and the locations in the query of significant matches are stored. Words generated by these regions are removed from the query word list for the full search. Matches to the sublibrary, however, are reported in the final output. These two filters allow alignments to regions with biased composition, or to regions containing repetitive elements to be reported, as long as adjacent regions not containing such features share significant similarity to the query sequence.

The BLAST strategy admits numerous variations. We implemented a version of BLAST that uses dynamic programming to extend hits so as to allow gaps in the resulting alignments. Needless to say, this greatly slows the extension process. While the sensitivity of amino acid searches was improved in some cases, the selectivity was reduced as well. Given the tradeoff of speed and selectivity for sensitivity, it is questionable whether the gap version of BLAST constitutes an improvement. We also implemented the alternative of making a table of all occurrences of w-mers in the database , then scanning the query sequence and processing hits. The disk space requirements are considerable - approximately two computer words for every residue in the database. More damaging was that for query sequences of typical length, the need for random access into the database (as opposed to sequential access) made the approach slower, on the computer systems we used, than scanning the entire database.


BLAST References

Dayhoff, M. O., Schwartz R. M. & Orcutt, B. C. (1978).In Atlas of Protein Sequence and Structure (Dayhoff, M. O., ed.). 5(3):345-352 ,Natl. Biomed. Res. Found., Washington, DC.

Goad, W. B. & Kanehisa, M. I. (1982). Nucl. Acids Res. 10:247-263.

Hopcroft, J. E. & Ullman, J. D. (1979). Introduction to Automata Theory, Languages, and Computation , 42-45, Addison-Wesley, Reading, MA.

Karlin, S. & Altschul, S. F. (1990). Proc. Natl. Acad. Sci. USA 87:2264-2268.

Karlin, S., Dembo, A. & Kawabata, T. (1990). Ann. Stat. , 18:571-581.

Mealy, G. H. (1955). Bell System Technical J. 34:1045-1079.

Needleman, S. B. & Wunsch, C. D. (1970). J. Mol. Biol. 48:443-453.

Sellers, P. H. (1974). SIAM J. Appl. Math. 26:787-793.

Smith, T. F. & Waterman, M. S. (1981). Adv. Appl. Math. 2:482-489.