From Karlin and Altschul (1990), the principal equation relating the score of an HSP to its expected frequency of chance occurrence is: E = K.N xep {- lambda.S } where E is the expected frequency of chance occurrence of an HSP having score S (or one scoring higher); K and $lambda$ are Karlin-Altschul parameters; N is the product of the query and database sequence lengths, or the size of the search space; and e has a value of approximately 2.718. $lambda$ may be thought of as the expected increase in reli- ability of an alignment associated with a unit increase in alignment score. Reliability in this case is expressed in units of information, such as bits or nats, with one nat being equivalent to 1/log(2) (roughly 1.44) bits. The expectation E (range 0 to infinity) calculated for an alignment between the query sequence and a database sequence can be extrapolated to an expectation over the entire database search, by converting the pairwise expectation to a probability (range 0-1) and multiplying the result by the ratio of the entire database size (expressed in residues) to the length of the matching database sequence. In detail: E sub { database } = { D.( 1 - exp { - E } ) }/d where $D$ is the size of the database; $d$ is the length of the matching database sequence; and the quantity (1 - exp E ) is the probability, P, corresponding to the pair- wise expectation E. Note that in the limit of infinite E, P approaches 1; and in the limit as E approaches 0, E and P approach equality. Due to inaccuracy in the statistical methods as they are applied in the BLAST pro- grams, whenever E and P are less than about 0.05, the two values can be practically treated as being equal. In contrast to the random sequence model used by Karlin- Altschul statistics, biological sequences are often short in length -- an HSP may involve a relatively large fraction of the query or database sequence, which reduces the effective size of the 2-dimensional search space defined by the two sequences. To obtain more accurate significance estimates, the BLAST programs compute effective lengths for the query and database sequences that are their real lengths minus the expected length of the HSP, where the expected length for an HSP is computed from its score. In no event is an effective length for the query or database sequence permitted to go below 1. Thus, the effective length of either the query or the database sequence is computed according to the follow- ing: Length sub { effective } = max ( Length sub { real } - { lambda.S }/{ H } , 1 ) where H is the relative entropy of the target and background residue frequencies (Karlin and Altschul, 1990), one of the statistics reported by the BLAST programs. H may be thought of as the information expected to be obtained from each pair of aligned residues in a real alignment that distinguishes the alignment from a random one.

Presented by Fredj Tekaiatekaia@pasteur.fr