The Genetic Data Environment:

An Expandable GUI for Multiple Sequence Analysis

S.Smith, R. Overbeek, C.R. Woese, W. Gilbert and P. Gillevet


S. Smith, Harvard Genome Laboratory, Harvard University, 16 Divinity Ave. Cambridge, MA 12138 USA, smith@nucleus.harvard.edu

R. Overbeek, Mathematics and Computer Science Division, Argonne National Laboratory, 9000 S. Cass Ave. Argonne, IL 60439 USA, overbeek@mcs.anl.gov

C.R. Woese, Ribosomal Database Project, University of Illinois, 407 Goodwin Ave. Urbana IL. 61801 USA, woese@ninja.life.uiecu.edu

W. Gilbert, Harvard University Biolabs, 16 Divinity Ave. Cambridge, MA 12138 USA, gilbert@nucleus.harvard.edu

P.M. Gillevet, Mycoplasma capricolum Genome Project George Mason University, Fairfax, VA 22030-4444, USA, gillevet@uranus.nchgr.nih.gov


CONTENTS:

Abstract

An X Windows-based graphic user interface is presented which allows the seamless integration of numerous existing biomolecular programs into a single analysis environment. This environment is based on a core multiple sequence editor that is linked to external programs by a user-expandable menu system and is supported on Sun(TM) and DEC(TM) workstations. There is no limitation to the number of external functions that can be linked to the interface. The length and number of sequences that can be handled are limited only by the size of virtual memory present on the workstation. The sequence data itself is used as the reference point from which analysis is done, and scalable graphic views are supported. It is suggested that future software development utilizing this expandable, user-defined menu system and the I/O linkage of external programs will allow biologists to easily integrate expertise from disparate fields into a single environment.

Introduction

The field of computational molecular biology involves the analysis of a broad spectrum of biological data using various algorithms for molecular modeling, comparative and phylogenetic analysis, database management, genetic mapping and sequence analysis. New programs and algorithms for the analysis of these types of genetic data are constantly being developed using various computer languages, interface methods and file formats. As such, these programs require a wide range of computer expertise to use.

The best and most useful algorithms are often incorporated into commercial software packages for use by the biological community (1,2). These packages tend to add a defined, common user interface to a set of programs in order to make them more palatable to end users. During this process, programs are often converted from one programming language to another. File formats may also require conversion. The net result of these conversions are user-friendly, although sometimes costly systems for genetic data analysis. The primary drawback to these commercial packages is the time it may take for novel analyses to be incorporated into the systems. Successful algorithms developed today take months if not years to find their way into commercial packages. Because of this, the average researcher does not immediately benefit from these novel algorithms. Furthermore, in many instances it is impossible to expand these systems with custom-built algorithms as the source code for these packages may not be freely distributed.

A versatile system for genetic data analysis should incorporate a simple to use interface as well as allow the addition of novel analytical tools. Attempts have been made (3) and approaches have been suggested (4) to incorporate all types of genetic data analysis in a single generic environment for computational biology, but to date none have developed to the extent that they are utilized by the general biological community. Thus the key goals to any general-purpose analysis environment should be ease of use and ease of expansion.

DNA and protein sequence analysis are arguably the most common forms of computer analysis used by the biological community and most current methods involve the comparison of two or more sequences at one time. Several multiple-sequence editors have been developed (5,6,7,8) that are used for searching for sequence motifs; others have been developed for phylogenetic analysis (9,10). As they were tailored for a specific field of interest, they are difficult to expand without significant modification to the entire package/program.

We present the paradigm of an expandable graphic user interface and demonstrate its utility with a multiple sequence editor called the Genetic Data Environment (GDE). The GDE is a modular software environment that will enable the user community to immediately benefit from the latest developments in computational biology and allow them to customize a system to meet their specific needs.

Design Goals

GDE was designed to remove three bottlenecks encountered in the integration of sequence analysis software. The first goal was the development of a flexible system that would allow the incorporation of algorithms written in various programming languages. Existing programs in languages such as FORTRAN, Pascal, C, BASIC, Lisp, and PROLOG are generally rewritten in a common language before they are incorporated under a single user interface. This requires a significant amount of code writing, with the net result being a system that does little more than the components did. Any truly integrated system must accommodate the use of a mixture of languages in order to avoid constant rewrites of working software and allow this existing body of analytical software to be incorporated into the system with minimal effort.

The second goal to be addressed was the development of a graphical system for DNA and protein sequence display and manipulation. There is little argument over the fact that the menu and mouse style interface greatly decreases the learning curve associated with sequence analysis software. However, the building of such interfaces is of little interest to most programmers as these interfaces do not improve the quality of the actual algorithm, but merely simplifying its use by others. Thus, any integrated system for analysis must take responsibility for much of the graphic user interface and the representation of various views of the data.

The third goal addressed was the development of the hooks to accommodate end user expandability. If an individual wishes to incorporate his own analysis into the system, he or she should not be required to modify, or even have access to the source code for the environment. It should instead be possible to describe to the environment how it should access a new program, and the environment should handle all details of executing this new function and the presentation of the analysis to the end user. This would then allow a user to customize the environment to suit each particular user's needs.

By meeting these design goals, we hoped to minimize the work required to move a program from the programmer's test-bed to the end users hands. In the process, we would produce an ever- expanding system for biomolecular sequence analysis in which all could benefit from other's expertise in various fields. It would benefit software developers to remain compatible with the system, because the requirements of compatibility are minimal, and the benefits of compatibility are great. It is hoped that this paradigm for user interfaces will prompt programmers to establish a common repository where external programs are accumulated and freely shared within the user community and that this repository be expanded to encompass all areas of computational biology.

Implementation

We divided the environment into three primary parts (see Figure 1.) to meet these goals. They are data representation and manipulation, an expandable user defined menu system, and external program execution. Data representation is the graphic display and editing of the data. In the case of protein and nucleic acid sequence data, the representation follows along the lines of a text editor for multiple sequences. The emphasis of sequence display is always on flexibility of representation and manipulation. Color highlighting is a modifiable trait, not a fixed scheme so that analysis functions can change color attributes as easily as they change actual sequence data. Alignment specific editing functions such as group insertion/deletion and data locking are needed.

Figure 1

The logical organization of the Genetic Data Environment.

The system consists of an internal data representation module, a module that interprets the expandable menu configuration file and a module the executes externally defined programs. These external extensions include basic analytical functions distributed with GDE and any function that that can be run with command line arguments.

User- interface issues are handled by a simple menu definition file which describes the external analysis function in terms of what raw data it requires, what parameters it uses, and what form the returned results will take. By interpreting this file at run time instead of compile time, each individual user at a site can have his or her own customized setup. This definition file uses a simple language that allows easy modification by end users.

In summary, the environment retrieves user input, passes required data in required formats, executes all needed analysis programs, and returns all results to the user. Thus the external analysis programs appear to be part of one integrated system with all behind- the- scenes file conversion, parameter passing, and file cleanup being handled automatically without the intervention of the user.

The Prototype System

The prototype implementation of the above paradigm is the Genetic Data Environment (GDE) which represents amino acid and DNA sequence data in a multiple- sequence alignment format. The primary display/editor and menu system was written using the MIT X Windows system under the XView user interface toolkit developed by Sun Microsystems. This greatly improved the esthetic quality of the interface, as well as speeding its development. We are currently running the system on Sun workstations using the Unix operating system, the flexibility of which was essential to the development of the system.

The display editor accepts five types of sequence information: nucleic acid sequence, amino acid sequence, text, sequence masks, and color mask information. It allows for sequence entry/editing under a four stage protection scheme which prevents accidental data corruption by allowing or restricting alignment gap modification, ambiguous character modification, standard character modification, and sequence translation/reversal. Once a set of sequences is aligned, they can be grouped so that modifications in one propagate to all, and sequences or sub-sequences can be duplicated, removed, moved, translated, complemented, and reversed.

The editor supports five coloring schemes: a monochrome mode, color by sequencing direction, a character to color mapping, an alignment wide color mask, and a sequence by sequence color mask. The character-to-color mapping aids in visual evaluation of alignments where each nucleic acid character is mapped to one color. Amino acid characters are grouped into seven categories based on size and charge. The alignment wide color mask gives each column of the alignment a specific color which is useful in representing position by position scoring of alignments. The sequence color mask will give a position by position color coding for a given sequence which is designed for representing position by position scoring on individual sequence. Other features of the primary display editor include the ability to do "split screen" editing, tactile feedback, a checking mode, reduce scale view, variable font sizes, and extended sequence information annotation.

Expandible Menu System

External functions are tied in by means of the menu configuration file ".GDEmenus". This file resides in a shared help directory, or can also be placed in an individual user's home directory for easier customization. This configuration file defines how to access remote functions and how to present "dialog boxes" to the user. It specifies the data format for external programs as well as the format for data that the functions might pass back to GDE.

The description language for the expandable menu file is fairly simple; an example of how one might tie in a hypothetical Unix program called db_search which performs a database search and retrieval follows. Assume the command line for this program is

db_search -base DataBankName -field Field Keyword

where DataBankName might be one of GENBANK, PIR, or EMBL; Field might be one of AUTHOR, PUB (for publication) or DESC (for description); Keyword would be some descriptive text for the chosen search field. Figure 2 depicts how such a function would be described in the .GDEmenus configuration file and this would create a new menu item labeled "Data base search" (see Figure 3). Once the user clicks the OK button, program db_search is run with the user's chosen parameters, the results are written out to a temporary file, and the sequences contained in that file are loaded into the GDE editing window.

Figure 2:

Example of a menu configuration definition hooking an external function into GDE:

item:Data base search

itemmethod:db_search -bank $BANK -field $FIELD $KEYWORD > OutputFile

arg:BANK

argtype:chooser

arglabel:Which database to search

argchoice:Genbank:GENBANK

argchoice:PIR:PIR

argchoice:EMBL:EMBL

arg:FIELD

argtype:choice_menu

arglabel:What field to search on?

argchoice:Author:AUTHOR

argchoice:Publication:PUB

argchoice:Description:DESC

arg:KEYWORD

argtype:text

arglabel:Keyword to search for?

out:Outputfile

outformat:genbank

The pop up window label is defined by "item:" and the external Unix command line that is executed is defined by "itemmethod:" Command line arguments are defined in each of the "arg:" sections and these values are substituted into the "itemmethod:" upon it's execution. The results of the database search are output in genbank format which is defined in the "out:" section.

Figure 3:

The pop up menu created by GDE after interpreting the entry in the menu configuration that is described in Figure 2.

Sequence data in the editor may also be passed as input to an external function by adding a field to the configuration file describing the input format. The user selects sequences by highlighting the sequence names before selecting a menu function, and the selected data are then written out to a temporary file in the appropriate format before being passed as input to the external program. The menu file also allows for handing out data in specific regions by means of a selection mask, as well as controlling temporary file cleanup after a program executes.

The second release of GDE has several external analysis functions tied in through its expandable menu system and are distributed with the core editor. Most of these programs were incorporated with no changes made to the original source code and include programs for multiple sequence alignment (11), similarity search (12,13), contig assembly (14), phylogenetic analysis (16,17,18), RNA secondary structure (18), and other custom-built functions useful in comparative analysis.

System Requirements

The GDE 2.0 currently runs on workstations from Sun Microsystems including Sun 3, Sun 4, and the SparcStation line of workstations. System software requirements are SunOS 4.1 or later, Openwindows 2.0, or the X11 window system from MIT along with the XView 2.0 toolkit. We have also implemented the system on the DECstation from Digital Equipment Corporation, and informal ports of GDE have been accomplished on the Silicon Graphics Iris as well as to the Cray X-MP. The package is available to internet users via anonymous ftp to golgi.harvard.edu

Discussion

The Genetic Data Environment was originally conceived as a phylogenetic tool at the University of Illinois and incorporated many salient features of previously developed multiple sequence editors (9,10). The need for an expandable system became obvious very early in the effort and quickly evolved into GDE 1.0, first released in February 1991. This version of the editor had an expandable menu system and integrated external programs through input an output files. The base editor was subsequently expanded to include sparse matrix storage, cut and paste, and a flat database using tagged fields that is attached to the sequence entry. The latter version, GDE 2.0, was released in May 1992.

Recently, we incorporated several database features and tools to handle data from the Mycoplasma project at the Harvard Genome Lab (19) and are expanding the environment further to include high level query capabilities (20). This model has proven extremely flexible allowing direct access to remote procedure calls on high-performance computers and network-based servers and can be easily modified on a daily basis as the need arises. The latter point is critical in when one is developing new technology and cannot predict a priori the needs of the system.

In conclusion, as the technical revolutions in DNA mapping and sequencing technology are setting the stage for the complete analysis of entire genomes, it is imperative that an evolutionary perspective be established to provide a systematic and coherent framework for the comparative analysis of such data. The initial tool required for comparative analysis is an expandable multiple sequence editor and, as such, GDE represents the first step in developing this framework. It is hoped that the paradigm outlined for the integration of multiple sequence analytical tools into a common graphic user interfaces can be extended to other views and perspectives of biology, allowing the development of an encompassing environment for computational biology.

Acknowledgements

The authors thank Jordan Konisky for the initial support of this effort at the Center for Prokaryotic Genome Analysis at the University of Illinois and to Gary Olsen for his initial suggestions on system requirements for Multiple Sequence Editors. We would also like to thank Don Gilbert of Indiana University, Mike Maciukenas at the University of Illinois, and Chunwei Wang at the Harvard Genome Laboratory for their programming efforts and contributions. C.Woese is supported in part by a Ribosomal Database Project grant from NSF DIR 891-17863 at the University of Illinois. Further development of this interface is being supported by the Harvard Genome Laboratory, Harvard University which is funded by NIH grant R01 HG00124.


References

  1. B. Roe (1988) Biotechniques 6(6), 560-562

  2. K. Ahern (1991) Genetic Engineering News January

  3. R.J. Douthart, J.J. Thomas, S.D. Rossier, J.E. Schmatz and J.W. West 91986) Nucl. Acids Res. 14(1), 285-297

  4. J. Ostell (1989) ToolBox Design Doc NCBI, personal communication

  5. J. Jurka (1987) in Report of the Matrix of Biological Knowledge Workshop (H.J. Morowitz and T. Smith eds) Santa Fe Institute, Santa Fe, NM.

  6. D.G. George and W.C. Barker (1986) Macromolecules, Genes and Computers, MBCRR of the Dana Farber Cancer Institute, Harvard University and GENBANK-Los Alamos National Laboratory.

  7. J. Devereux, P. Haeberli and O. Smithies (1984) Nucl. Acids Res. 12(1), 387-395

  8. A.M. Barbar and J.V Maizel Jr (1990) Gene Anal Techn Appl 7:39-45

  9. G. Olsen, personal communication

  10. T. Macke, personnal communication

  11. D.G. Higgins, A.J. Bleasby, and R. Fusch (1991) submitted to CABIOS

  12. S.F Altschul, W. Gish, W. Miller, E.W. Myers, and D. J. Lipman (1990) J. Mol. Biol. 215 403-410

  13. W.R Pearson and D.J. Lipman (1988) PNAS 85 2444-2448

  14. X. Huang (1991) submitted to Genomics

  15. J. Felsenstein (1989) Cladistics 5 164-166

  16. G. De Soete (1984) Psychometrika C implementation by M. Maciukenas, University of Illinois

  17. M. Maciukenas, University of Illinois personal communication

  18. M. Zuker (1989) Science 244 48-52

  19. S.W. Smith, C. Wang, P.M. Gillevet and W. Gilbert (1992) Genetic Data Environment and the Harvard Genome Database. Genome Mapping and Sequencing, Cold Spring Harbor Laboratory

  20. A. Baehr, R. Hagstrom, D. Joerg and R. Overbeek (1991) Argonne Technical Report ANL/MCS-TM-155, Argonne National Laboratory, September