R. Overbeek, Mathematics and Computer Science Division, Argonne National Laboratory, 9000 S. Cass Ave. Argonne, IL 60439 USA, overbeek@mcs.anl.gov
C.R. Woese, Ribosomal Database Project, University of Illinois, 407 Goodwin Ave. Urbana IL. 61801 USA, woese@ninja.life.uiecu.edu
W. Gilbert, Harvard University Biolabs, 16 Divinity Ave. Cambridge, MA 12138 USA, gilbert@nucleus.harvard.edu
P.M. Gillevet, Mycoplasma capricolum Genome Project George Mason University, Fairfax, VA 22030-4444, USA, gillevet@uranus.nchgr.nih.gov
The best and most useful algorithms are often incorporated into commercial software packages for use by the biological community (1,2). These packages tend to add a defined, common user interface to a set of programs in order to make them more palatable to end users. During this process, programs are often converted from one programming language to another. File formats may also require conversion. The net result of these conversions are user-friendly, although sometimes costly systems for genetic data analysis. The primary drawback to these commercial packages is the time it may take for novel analyses to be incorporated into the systems. Successful algorithms developed today take months if not years to find their way into commercial packages. Because of this, the average researcher does not immediately benefit from these novel algorithms. Furthermore, in many instances it is impossible to expand these systems with custom-built algorithms as the source code for these packages may not be freely distributed.
A versatile system for genetic data analysis should incorporate a simple to use interface as well as allow the addition of novel analytical tools. Attempts have been made (3) and approaches have been suggested (4) to incorporate all types of genetic data analysis in a single generic environment for computational biology, but to date none have developed to the extent that they are utilized by the general biological community. Thus the key goals to any general-purpose analysis environment should be ease of use and ease of expansion.
DNA and protein sequence analysis are arguably the most common forms of computer analysis used by the biological community and most current methods involve the comparison of two or more sequences at one time. Several multiple-sequence editors have been developed (5,6,7,8) that are used for searching for sequence motifs; others have been developed for phylogenetic analysis (9,10). As they were tailored for a specific field of interest, they are difficult to expand without significant modification to the entire package/program.
We present the paradigm of an expandable graphic user interface and demonstrate its utility with a multiple sequence editor called the Genetic Data Environment (GDE). The GDE is a modular software environment that will enable the user community to immediately benefit from the latest developments in computational biology and allow them to customize a system to meet their specific needs.
The second goal to be addressed was the development of a graphical system for DNA and protein sequence display and manipulation. There is little argument over the fact that the menu and mouse style interface greatly decreases the learning curve associated with sequence analysis software. However, the building of such interfaces is of little interest to most programmers as these interfaces do not improve the quality of the actual algorithm, but merely simplifying its use by others. Thus, any integrated system for analysis must take responsibility for much of the graphic user interface and the representation of various views of the data.
The third goal addressed was the development of the hooks to accommodate end user expandability. If an individual wishes to incorporate his own analysis into the system, he or she should not be required to modify, or even have access to the source code for the environment. It should instead be possible to describe to the environment how it should access a new program, and the environment should handle all details of executing this new function and the presentation of the analysis to the end user. This would then allow a user to customize the environment to suit each particular user's needs.
By meeting these design goals, we hoped to minimize the work required to move a program from the programmer's test-bed to the end users hands. In the process, we would produce an ever- expanding system for biomolecular sequence analysis in which all could benefit from other's expertise in various fields. It would benefit software developers to remain compatible with the system, because the requirements of compatibility are minimal, and the benefits of compatibility are great. It is hoped that this paradigm for user interfaces will prompt programmers to establish a common repository where external programs are accumulated and freely shared within the user community and that this repository be expanded to encompass all areas of computational biology.

The system consists of an internal data representation module, a module that interprets the expandable menu configuration file and a module the executes externally defined programs. These external extensions include basic analytical functions distributed with GDE and any function that that can be run with command line arguments.
User- interface issues are handled by a simple menu definition file which describes the external analysis function in terms of what raw data it requires, what parameters it uses, and what form the returned results will take. By interpreting this file at run time instead of compile time, each individual user at a site can have his or her own customized setup. This definition file uses a simple language that allows easy modification by end users.
In summary, the environment retrieves user input, passes required data in required formats, executes all needed analysis programs, and returns all results to the user. Thus the external analysis programs appear to be part of one integrated system with all behind- the- scenes file conversion, parameter passing, and file cleanup being handled automatically without the intervention of the user.
The prototype implementation of the above paradigm is the Genetic Data Environment (GDE) which represents amino acid and DNA sequence data in a multiple- sequence alignment format. The primary display/editor and menu system was written using the MIT X Windows system under the XView user interface toolkit developed by Sun Microsystems. This greatly improved the esthetic quality of the interface, as well as speeding its development. We are currently running the system on Sun workstations using the Unix operating system, the flexibility of which was essential to the development of the system.
The display editor accepts five types of sequence information: nucleic acid sequence, amino acid sequence, text, sequence masks, and color mask information. It allows for sequence entry/editing under a four stage protection scheme which prevents accidental data corruption by allowing or restricting alignment gap modification, ambiguous character modification, standard character modification, and sequence translation/reversal. Once a set of sequences is aligned, they can be grouped so that modifications in one propagate to all, and sequences or sub-sequences can be duplicated, removed, moved, translated, complemented, and reversed.
The editor supports five coloring schemes: a monochrome mode, color by sequencing direction, a character to color mapping, an alignment wide color mask, and a sequence by sequence color mask. The character-to-color mapping aids in visual evaluation of alignments where each nucleic acid character is mapped to one color. Amino acid characters are grouped into seven categories based on size and charge. The alignment wide color mask gives each column of the alignment a specific color which is useful in representing position by position scoring of alignments. The sequence color mask will give a position by position color coding for a given sequence which is designed for representing position by position scoring on individual sequence. Other features of the primary display editor include the ability to do "split screen" editing, tactile feedback, a checking mode, reduce scale view, variable font sizes, and extended sequence information annotation.
The description language for the expandable menu file is fairly simple; an example of how one might tie in a hypothetical Unix program called db_search which performs a database search and retrieval follows. Assume the command line for this program is
db_search -base DataBankName -field Field Keyword
where DataBankName might be one of GENBANK, PIR, or EMBL; Field might be one of AUTHOR, PUB (for publication) or DESC (for description); Keyword would be some descriptive text for the chosen search field. Figure 2 depicts how such a function would be described in the .GDEmenus configuration file and this would create a new menu item labeled "Data base search" (see Figure 3). Once the user clicks the OK button, program db_search is run with the user's chosen parameters, the results are written out to a temporary file, and the sequences contained in that file are loaded into the GDE editing window.
item:Data base searchitemmethod:db_search -bank $BANK -field $FIELD $KEYWORD > OutputFile
arg:BANK
argtype:chooser
arglabel:Which database to search
argchoice:Genbank:GENBANK
argchoice:PIR:PIR
argchoice:EMBL:EMBL
arg:FIELD
argtype:choice_menu
arglabel:What field to search on?
argchoice:Author:AUTHOR
argchoice:Publication:PUB
argchoice:Description:DESC
arg:KEYWORD
argtype:text
arglabel:Keyword to search for?
out:Outputfile
outformat:genbank
The pop up window label is defined by "item:" and the external Unix command line that is executed is defined by "itemmethod:" Command line arguments are defined in each of the "arg:" sections and these values are substituted into the "itemmethod:" upon it's execution. The results of the database search are output in genbank format which is defined in the "out:" section.

Sequence data in the editor may also be passed as input to an external function by adding a field to the configuration file describing the input format. The user selects sequences by highlighting the sequence names before selecting a menu function, and the selected data are then written out to a temporary file in the appropriate format before being passed as input to the external program. The menu file also allows for handing out data in specific regions by means of a selection mask, as well as controlling temporary file cleanup after a program executes.
The second release of GDE has several external analysis functions tied in through its expandable menu system and are distributed with the core editor. Most of these programs were incorporated with no changes made to the original source code and include programs for multiple sequence alignment (11), similarity search (12,13), contig assembly (14), phylogenetic analysis (16,17,18), RNA secondary structure (18), and other custom-built functions useful in comparative analysis.
The GDE 2.0 currently runs on workstations from Sun Microsystems including Sun 3, Sun 4, and the SparcStation line of workstations. System software requirements are SunOS 4.1 or later, Openwindows 2.0, or the X11 window system from MIT along with the XView 2.0 toolkit. We have also implemented the system on the DECstation from Digital Equipment Corporation, and informal ports of GDE have been accomplished on the Silicon Graphics Iris as well as to the Cray X-MP. The package is available to internet users via anonymous ftp to golgi.harvard.edu
The Genetic Data Environment was originally conceived as a phylogenetic tool at the University of Illinois and incorporated many salient features of previously developed multiple sequence editors (9,10). The need for an expandable system became obvious very early in the effort and quickly evolved into GDE 1.0, first released in February 1991. This version of the editor had an expandable menu system and integrated external programs through input an output files. The base editor was subsequently expanded to include sparse matrix storage, cut and paste, and a flat database using tagged fields that is attached to the sequence entry. The latter version, GDE 2.0, was released in May 1992.
Recently, we incorporated several database features and tools to handle data from the Mycoplasma project at the Harvard Genome Lab (19) and are expanding the environment further to include high level query capabilities (20). This model has proven extremely flexible allowing direct access to remote procedure calls on high-performance computers and network-based servers and can be easily modified on a daily basis as the need arises. The latter point is critical in when one is developing new technology and cannot predict a priori the needs of the system.
In conclusion, as the technical revolutions in DNA mapping and sequencing technology are setting the stage for the complete analysis of entire genomes, it is imperative that an evolutionary perspective be established to provide a systematic and coherent framework for the comparative analysis of such data. The initial tool required for comparative analysis is an expandable multiple sequence editor and, as such, GDE represents the first step in developing this framework. It is hoped that the paradigm outlined for the integration of multiple sequence analytical tools into a common graphic user interfaces can be extended to other views and perspectives of biology, allowing the development of an encompassing environment for computational biology.
The authors thank Jordan Konisky for the initial support of this effort at the Center for Prokaryotic Genome Analysis at the University of Illinois and to Gary Olsen for his initial suggestions on system requirements for Multiple Sequence Editors. We would also like to thank Don Gilbert of Indiana University, Mike Maciukenas at the University of Illinois, and Chunwei Wang at the Harvard Genome Laboratory for their programming efforts and contributions. C.Woese is supported in part by a Ribosomal Database Project grant from NSF DIR 891-17863 at the University of Illinois. Further development of this interface is being supported by the Harvard Genome Laboratory, Harvard University which is funded by NIH grant R01 HG00124.