FASTA Algorithm

This text is an excerpt from Pearson W.R 1990 Rapid and Sentive Sequence Comparison with PASTP and FASTA. Methods Enzymol 183: 63-98.
FASTA uses four steps to calculate three scores that characterize sequence similarity. These steps are outlined below. A representation of these steps is reported in a postscript format figure drawn from Barton (1994) Protein Sequence Alignment and Database Scanning. Step 1 : Identify regions shared by the two sequences with the highest density of identities (ktup=1) or pairs of identities (ktup=2). The first step uses a rapid technique for finding identities shared between two sequences; the method is similar to an earlier technique described by Wilbur and Lipman. FASTA achieves much of its speed and selectivity in this step by using a lookup table to locate all identities or groups of identities between two DNA or amino acid sequences during the first step of the comparison. The ktup parameter determines how many consecutive identities are required in a match. A ktup value of 2 is frequrntly used for protein sequence comparison, which means that the program examines only those portions of the two sequences being compared that have at least two adjacent identical residues in both sequences. More sensitive searches can be done using ktup = 1. For DNA sequence comparisons, the ktup parameter can range from 1 to 6; values between 4 and 6 are recommanded. When the query sequence is a short oliginucleotide of oligopeptude, ktup = 1 should be used. In conjunction with the lookup table, we use the "diagonal" method to find all regions of similarity between the two sequences, counting ktup matches and penalizing for intervening mismatches. This method identified regions of a diagonal that have the highest densitu of ktup matches. The term diagonal refers to the diagonal line that is seen on a dot matrix plot when a sequence is compared with itself, and it denotes an alignment between two sequenves without gaps. FASTA uses a formula for scoring ktup matches that incorporates the actual PAM250 values for the aligned residues. Thus, groups of identities with high similarity scores contribute more to the local diagonal score than to identities with low similarity scores. This more sensitive formula is used for protein sequence comparisons; the constant value for ktup matches is used for DNA sequence comparisons. FASTA saves the 10 best local regions, regardless of whether they are on the same of different diagonals. Step 2 : Rescan the 10 regions with the highest density of identities using the PAM250 matrix. Trim the ends of the region to include only those residues contributing to the highest score. Each region is a partial alignment without gaps. After the 10 best local regions are found in the first step, they are rescored using a scoring matrix that allows runs of identities shorter than ktup residues and conservative replacements to contribute to the similarity score. For protein sequences, this score is usually caculated using the PAM250 matrix, although scoring matrices based on the minimum number of base changes required for a specific replacement, on identities alone, or on an alternative measure of similarity, can also be used with FASTA. The PAM250 scoring matrix was derived from the analysis of the amino acid replacements occuring among related proteins, and it specifies a range of positive scores for replacements that commonly occur among related proteins and negative scores for unlikely replacements. FASTA can also be used for DNA sequence comparisons, and matrices can be constructed that allow separate penalties for transitions and transversions. For each of the best diagonal regions rescanned with the scoring matrix, a subregion with the maximal score is identified. Initial scores are used to rank the library sequences. These scores are referred to as init1 score. Step 3 : If there are several initial regions with scores greater than the CUTOFF value, check to see whether the trimmed initial regions can be joined to form an approximate alignment with gaps. Calculate a similarity score that is the sum of the joined initial regions minus a penalty (usually 20) for each gap. This initial similarity score (initn) is used to rank the library sequences. The score of the single best initial region found in step 2 is reported (init1). FASTA checks, during a library search, to see whether several initial regions can be joined together in a single alignment to increase the initial score. FASTA calculates an optimal alignment of initial regions as a combination of compatible regions with maximal score. This optimal alignment of initial regions can be rapidily calculated using a dynamic programming algorithm. FASTA uses the resulting score, referred to as the initn score, to rank the library sequences. The third "joining" step in the computation of the initial score increases the sensitivity of the search method because it allows for insertions and deletions as well as conservative replacements. The modification does, however, decrease selectivity. The degradation selectivity is limited by including in the optimization step only those initial regions whose scores are above an empirically determined threshold : FASTA joins an initial region only if its similarity score is greater than the cutoff value, a value that is approximately one standard deviation above the average score expected from unrelated sequences in the library. For a 200-residue query sequence and ktup-2, this value is 28. Step 4 : constructs NWS (Needleman-Wunch-Sellers algorithm) optimal alignment of the query sequence and the library sequence, considering only those residues that lie in a band 32 residues wide centered on the best initial region found in Step 2. FASTA reports this score as the optimized (opt) score. After a complete search of the library, FASTA plots the initial scores of each library sequence in a histogram, calculates the mean similarity score for the query sequence against each sequence in the library, and determines the standard deviation of the distribution of initial scores. The initial scores are used to rank the library sequences, and, in the fourth and final step of the comparison, the highest scoring library sequences are aligned using a modification of the standard NWS optimization method. The optimization employs the same scoring matrix used in determining the initial regions; the resulting optimized alignments are calculated for further analysis of potential relationships, and the optimized similarity score is reported.
up Lookup table A lookup table is a rapid method for finding the position of a residue in a sequence. One way to find the "A" in the sequence "NDAPL" is to compare "A" to each residue in the sequence. A faster way, is to make a table of all possible residues (23 for proteins) so that the computer representation for the residue (i.e "A" is 1, "R" is 2, "N" is 3) is the same as its position in the table. A value is then placed in the table that indicates whether the residue is present in the sequence and, if it is, where it is present. For this example the table has the value 1 at position 3, 2 at position 4, 3 at position 1, 4 at 15, 5 at 11, and the remainning 18 positions are 0. The position of the "A" in the sequence can then be determined in a single step by looking it up at position 1 in the table.
Comments and suggestions are welcome. Fredj Tekaia tekaia@pasteur.fr.