Table of Contents
coefficient viewing page
Reference list
Related web sites
Best test sequence
Return to HLA peptide motif search page
This Web site allows users to locate and rank 8-mer, 9-mer, or 10-mer peptides
that contain peptide-binding motifs for HLA class I molecules. Said
rankings employ amino acid/position coefficient tables deduced from
the literature by Dr. Kenneth Parker of the National Institute of
Allergy and Infectious Diseases (NIAID) at the National Institutes of
Health (NIH) in Bethesda, Maryland. The Web site was
created by
Ronald Taylor
of the Bioinformatics and Molecular Analysis
Section
(BIMAS),
Computational Bioscience and Engineering Laboratory
(CBEL), Division of Computer Research & Technology (CIT),
National Institutes of Health, in collaboration with Dr. Parker.
The instructions below follow the order of buttons and entry boxes on the
page: top-to-bottom, and within that order, left-to-right:
- Choose which HLA molecule is of interest via the "Molecule"
menu button. The default value is "A_0201". (The choice of the molecule
determines which coefficient table the scoring program uses on your
sequence.)
- Choose the length of the subsequences the program extracts
from your input sequence and then scores and ranks. Use the
"mers" menu button for this. Currently three possibilities are available:
The default value is nine. If this default value is chosen,
then the appropriate 20-by-9 coefficient matrix for the selected HLA
molecule will be used for the scoring. If the subsequence length is
chosen to be ten residues, the same matrix will be used, but the fifth
residue will be ignored in scoring. If eight residues are chosen, a
different 20-by-8 matrix will be employed.
- Choose the method you wish to use to limit the number of
results returned. Do this via the "Results limited by" set of two toggle buttons.
There are two possibilities:
- "Explicit Number" (the default). If you choose this method, then
proceed to the menu directly below this button to select which number of results to
return. The menu default for the number of values to return is set at 20.
- "Predicted T(1/2) >=". If you choose this method, you are requesting
that only those scores (those predicted half-lives) over a given value
be returned. Proceed to the menu directly below this button to select the number to
use as this cutoff score. The menu default cutoff score is set at 100.
- Enter the sequence of the protein to be searched. This is required
input. Either type in the sequence or paste it in from another window. Many
formats
can be used: Raw/Plain, EMBL, Pearson/Fasta, etc.
- Choose whether you wish to display your input sequence
on the output page using the "Echo input sequence" button.
Echoing is recommended, and the default value is set to "echo".
If echoed, your sequence will be shown on the output page as numbered
lines of 50 residues each. (The numbering allows you to
easily find in the input sequence the subsequences returned
in the results table.)
- Submit the job by clicking on the "submit" button.
The current maximum size of the input sequence is arbitrarily set
to 5000 residues. The program will truncate the sequence at that
point, no matter how long the original sequence entered is, and
will only work with the first 5000 residues. This is a safeguard
to prevent the Web site from being overwhelmed.
After you submit your job, a set of scores will be calculated for all
8-mer, 9-mer, or 10-mer subsequences contained in your input sequence,
depending on which option you selected. Based on the scores, the
subsequences will be ranked. This task should be completed within a
few seconds. (Unless your input sequence is extremely large, in which
case somewhat more time will be required.) A display page will then be
returned that shows
- a short table of the user-entered parameters (for verification and
later recall), along with some scoring data and other useful items
(such as the number of scores calculated, the number of scores
requested, the number of scores actually reported back in the output
table, and the length of the user's input sequence)
- the scoring output table, where the results of calculations on the
subsequences will be displayed. This table is described in more detail below.
- a listing of the your input peptide sequence
(if you have asked for the sequence to be echoed back)
Each row of the scoring output table will consist of four columns.
The values for these items represent
- the ranking of the subsequence
- the starting position in your input protein sequence
of the first amino acid residue of the subsequence
- a residue listing of the 8-mer, 9-mer, or 10-mer peptide subsequence itself
- an estimated numerical score for the subsequence (upon which the
rank in the first column is based). In the case of HLA-A2,
this score corresponds to the estimated half-time of dissociation of complexes
containing the peptide at 37 oC at pH 6.5. For other molecules, the estimate is
based on the observed anchor residue preferences, as published in the reference(s)
listed.
The number of rows (entries) in the scoring output table will be
limited by whatever method you chose (cutoff score or explicit
number).
A link back to the HLA motif search page, along with a date/time stamp,
is placed at the bottom of the output page.
The algorithm used to score each 8-mer, 9-mer, or 10-mer peptide
subsequence is simple. It runs as follows:
- The initial (running) score is set to 1.0.
- For each residue position, the program examines
which amino acid is appearing at that position. The running score is
then multiplied by the coefficient for that amino acid type,
at that position, for the chosen HLA molecule.
These coefficients have been pre-calculated
(see the Background section below) and are stored for use by the scoring algorithm
in a separate directory as a collection of HLA coefficient files.
(To view these coefficient values, see the next section below.)
- Using 9-mers, nine multiplications are performed. Using 10-mers,
nine multiplications are again performed, because the residue lying at
the fifth position in the subsequence is skipped. The resulting
running score is multiplied by a final constant to yield an estimate
of the half time of disassociation. (This constant is stored at
the end of the coefficent file for the HLA molecule. It can have a
different value for each HLA molecule.) The final multiplication
yields the score reported in the output table. Using 8-mers, eight
multiplications are performed instead of nine, with a differenct coefficient
matrix employed (20-by-8 rather than 20-by-9).
For each HLA molecule, the coefficient values discussed above are
stored in a file in our separate directory of coefficient files. When
the user selects the HLA molecule type (or, on our restricted Web page
for advanced users, selects the actual filename from the coefficient
filename menu) and selects "9-mer" or "10-mer" as the length of the
subsequence, the "standard" file containing the 20-by-9 coefficient
matrix for the HLA molecule is read on-the-fly, with the 181 values
(180 coefficient values plus one final constant) being read into an
internal array for use by the scoring program. If the user selected
"8-mer" as the subsequence length, then the program proceeds in the
same fashion, but reads in a different coefficient file appropriate
for 8-mer searches on that selected HLA molecule type.
A given HLA molecule can have multiple coefficient files constructed
for it in our coefficient file directory, but only the two "standard"
files for the molecule are available from this site for scoring. For
example, if the user selects "A3" in the HLA molecule scrollable menu
and "9" or "10" as the subsequence length, then the "A3_standard"
coefficient file would be used. If the user selects "A3" in the HLA
molecule scrollable menu and "8" as the subsequence length, then the
"A3_8mer_standard" coefficient file would be used. (The other files
are available for use/modification on the restricted advanced site.)
To view the coefficient values in a file for a selected HLA molecule,
go to our
coefficient viewing page.
All non-alphabetic characters are filtered out from the input
sequence. That is, they are completely removed before any subsequence
extraction and scoring are performed.
The 26 alphabetic characters are handled as follows:
- There are twenty unambiguous alphabetic characters that can be employed
in the input sequence. These twenty characters are simply the
standard single-letter abbreviations for the amino acids given in the
standard amino acid table.
- Six alphabetic characters are treated as ambiguous. These
characters are B, J, O, U, X, and Z. J, O,
and U are meaningless. B, X, and Z do
have standard definitions of asparagine or aspartate, unknown, and
glutamine or glutamate, respectively, but
these three definitions are not used in the coefficient tables used
by the scoring algorithm. The six alphabetic ambiguous characters are
thus treated as follows: in each case, the ambiguous character is given a coefficient of
1.00. Since the scoring algorithm, as described above, simply
multiplies the running score by the coefficient for a given amino acid
type at a given position, the effect of multiplying by the
coefficient of the ambiguous character is to leave the score
unchanged. That is, the ambiguous character can be said to be ignored
when the score for the subsequence containing it is calculated.
However, the ambiguous character is NOT discarded. An alternate
solution to the problem of ambiguous characters is to throw them out
entirely and grab another residue to add to the subsequence for each
ambiguous character thrown out. In effect, this is what we do for all
non-alphabetic characters. However, for the six alphabetic ambiguous
characters, we use the alternate solution described above.
Re display in output: the 20 non-ambiguous alphabetic characters
representing the 20 amino acid types will be displayed in uppercase in
the subsequence matches in the scoring table. Alphabetic ambiguous
characters will be outputted in the subsequence matches as periods
(dots). As stated above, all non-alphabetic characters will be stripped
out, not used in scoring, and hence not displayed.
Note that the coefficient tables used at this Web site have not been published
elsewhere (except for HLA-A2). We intend to update the values in the
tables as we deem appropriate, based on new information,
correspondence, or judgment calls. To cite these tables
appropriately, note the date the program at this Web site was used, and obtain a
printout of the table that was employed. Because the tables
necessarily contain a large number of judgment calls, in the future
we may explore allowing the user to modify a table, or to employ a table
entirely of the user's own devising. Users are encouraged to check
the tables with the references cited, so as to understand the basis
for the numbers in the tables.
Depending on correspondence received, especially from other scientists
in the field that have helped determine anchor residue preferences, we
are likely to extend these tables to include additional class I
molecules, and class II molecules.
Principle of the calculations: The idea behind these tables is the
assumption that, to the first approximation, each amino acid in the
peptide contributes independently to binding to the class I molecule.
Dominant anchor residues, which are critical for binding, have
coefficients in the tables that are significantly different than 1.
Highly favorable amino acids have coefficients substantially greater
than 1, and unfavorable amino acids have positive coefficients that
are less than one. Auxiliary anchor residues have coefficients that
are different from 1 but smaller in magnitude than dominant anchor
residues. When any amino acid has been found to be enriched or
depleted in the endogenous peptides associated with a class I
molecule, the coefficients have been adjusted to take that fact into
account, whether or not the enrichment or depletion has an obvious
structural basis. In some cases, such coefficients may reflect the
peptide-binding properties of other proteins in the peptide / class I
complex pathway. In all cases, many amino acids have coefficients
that have the default value of exactly 1.0. That means that the amino
acid at that position is not known to make either a favorable or
unfavorable contribution to the binding of the peptide. There are
several reported instances where combinations of amino acids appear to
make contributions to peptide binding that are greater than or less
than expected from each amino acid considered separately. These
complications are to be expected, and are best dealt with using this
program by resorting the initial output manually. Perhaps later
versions of this Web site may be able to account for such
considerations. I (Ken Parker) believe that at present a more serious
problem is that many of the coefficients have not been determined
accurately enough. There must be many instances where auxiliary
anchor effects have not been determined yet, and there are probably
even more instances where unfavorable amino acid preferences have not
been elucidated.
Links to references are provided on a separate
References Page.
Links to useful
Web sites can be found on our
Associated Web Site Page.
Lastly, we provide a listing of
concatenated peptide sequences that are known to bind to MHC class I
molecules. This is the best test sequence for the coefficient tables. You can
find it on our page for
Concatenated MHC class I peptide sequences
from Rammensee et al, Immunogenetics 41:178
For Comments/FeedBack regarding the implementation of this Web site contact
Ronald Taylor
at rtaylor@helix.nih.gov
For Comments/Questions regarding the research underlying the motif search algorithm
and the coefficient values for the different HLA molecules
used by this Web site contact Dr. Kenneth Parker at
KPARKER@atlas.niaid.nih.gov
Return to HLA peptide motif search page
rtaylor@helix.nih.gov Ronald Taylor of BIMAS / CBEL / CIT / NIH