KARLIN-ALTSCHUL STATISTICS
     From Karlin and  Altschul  (1990),  the  principal  equation
     relating  the  score  of an HSP to its expected frequency of
     chance occurrence is:

                   E  = K.N xep {- lambda.S }

     where E is the expected frequency of chance occurrence of an
     HSP  having  score S (or one scoring higher); K and $lambda$
     are Karlin-Altschul parameters; N  is  the  product  of  the
     query  and  database  sequence  lengths,  or the size of the
     search space; and e has a value of approximately 2.718.

     $lambda$ may be thought of as the expected increase in reli-
     ability  of  an alignment associated with a unit increase in
     alignment score.  Reliability in this case is  expressed  in
     units  of  information,  such  as bits or nats, with one nat
     being equivalent to 1/log(2) (roughly 1.44) bits.

     The expectation E (range 0 to infinity)  calculated  for  an
     alignment between the query sequence and a database sequence
     can be  extrapolated  to  an  expectation  over  the  entire
     database search, by converting the pairwise expectation to a
     probability (range 0-1) and multiplying the  result  by  the
     ratio of the entire database size (expressed in residues) to
     the length of the matching database sequence.  In detail:

     E sub { database } = { D.( 1 - exp { - E } ) }/d
     where $D$ is the size of the database; $d$ is the length  of
     the  matching  database sequence; and the quantity (1 - exp E )
     is the probability, P, corresponding to the pair-
     wise  expectation  E.   Note that in the limit of infinite
     E, P approaches 1; and in the limit as E approaches 0,
     E  and  P  approach  equality.  Due to inaccuracy in the
     statistical methods as they are applied in  the  BLAST  pro-
     grams,  whenever  E  and P are less than about 0.05, the
     two values can be practically treated as being equal.

     In contrast to the random sequence  model  used  by  Karlin-
     Altschul statistics, biological sequences are often short in
     length -- an HSP may involve a relatively large fraction  of
     the  query or database sequence, which reduces the effective
     size of the 2-dimensional search space defined  by  the  two
     sequences.   To obtain more accurate significance estimates,
     the BLAST programs compute effective lengths for  the  query
     and database sequences that are their real lengths minus the
     expected length of the HSP, where the expected length for an
     HSP is computed from its score.  In no event is an effective
     length for the query or database sequence  permitted  to  go
     below  1.  Thus, the effective length of either the query or
     the database sequence is computed according to  the  follow-
     ing:

                  Length sub { effective } =
     max ( Length sub { real } - { lambda.S  }/{ H } , 1 )

     where H is the relative entropy of the target and background
     residue  frequencies (Karlin and Altschul, 1990), one of the
     statistics reported by the BLAST programs.  H may be thought
     of as the information expected to be obtained from each pair
     of aligned residues in a real alignment  that  distinguishes
     the alignment from a random one.

Presented by Fredj Tekaia tekaia@pasteur.fr