BIOINFORMATICS<-->STRUCTURE
Jerusalem, Israel, November 17-21, 1996

Abstract


Structural neighbors and structural alignments for Entrez

Tom Madej, Jean-Francois Gibrat, Chris Hogue, Hitomi Ohkawa and Stephen Bryant

Computational Biology Branch, National Center for Biotechnology Information, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894 USA

bryant@melanie.nlm.nih.gov


Entrez is an Internet tool for retrieval of information on the structure and function of bio- logical macromolecules [1,2] ( http://www3.ncbi.nlm.nih.gov/Entrez/). It provides daily- updated databases of molecular sequences, the Medline citations pertaining to molecular genetics, and three-dimensional structures from the latest release of the Protein Data Bank [3]. With simple term matching queries one may easily retrieve information on a mol- ecule of interest from any of these sources. One may also easily "link" between data- bases, to retrieve, for example, the Medline citations contained within a molecular sequence or structure report.

Entrez's most powerful source of information on molecular structure and function, how- ever, is its "neighbor" database. The neighbors of a sequence are its homologs, as iden- tified by a significant similarity score using the Blast algorithm [4]. The neighbors of a Medline citation are articles which use surprisingly similar terms in their title and abstract [5]. Since biological functions are often conserved among members of a homology group, and/or described in the associated Medline abstracts, one may easily explore the struc- ture-function relationships of an entire protein family by traversing these neighbor relation- ships.

Structural neighbors in Entrez are identified by a direct comparison of 3D structure. All of the roughly 10,000 protein domain substructures in the current Protein Data Bank have been compared to one another using the VAST algorithm [6,7], and the resulting structure- structure alignments and superpositions recorded. The VAST algorithm (for "Vector Align- ment Search Tool") places great emphasis on the definition of the threshold of significant structural similarity. By focusing on similarities that are surprising in a statistical sense, one does not waste time examining the many similarities of small substructures that occur by chance in structure comparison. The remaining similarities are largely examples of remote homology, many of them undetectable by sequence comparison. As such they may provide a broader view of the structure, function and evolution of a protein family.

At the heart of VAST's significance calculation is definition of the "unit" of tertiary structure similarity as pairs of secondary structure elements (SSE's) that have similar type, relative orientation, and connectivity. In comparing two protein domains the most surprising sub- structure similarity is that where superposition scores summed across these "units" is greatest. The likelihood that this similarity would be seen by chance is given as a simple product: the probability that one would obtain this score in drawing so many "units" at ran- dom, times the number of alternative SSE-pair alignments possible in a given domain comparison. In practice one finds that the VAST significance threshold identifies similari- ties that span a sizable fraction of the domain structures compared, and it would appear that this theory corresponds to the subjective criteria long employed by crystallographers.

To calculate residue-by-residue structural alignments, VAST examines alternatives using a Gibbs sampling algorithm, starting from the "seed" SSE-pair alignment. The optimal alignment is defined as that which is most surprising relative to a background distribution of alpha-carbon superposition residuals, obtained by drawing structural fragments at ran- dom from like-sized protein domains. This definition provides an objective criterion with which to balance the well-known trade-off of lower superposition residuals versus more aligned residues. In practice refined alignments from VAST appear conservative, choos- ing a highly similar "core" substructure. In this superposition one easily identifies regions where protein evolution has modified the structure.

Entrez retrieves structural data from MMDB [2,8] ( http://www.ncbi.nlm.nih.gov/Structure/) database a "Molecular Modeling DataBase". MMDB maps Protein Data Bank information to an ASN.1 specification that lists explicitly both spatial and chemical-graph descriptions of macromolecular structure. Structural neighbors are presented via molecular graphic images using the CN3D viewer, provided with the Entrez client software. CN3D operates on many computer platforms, including MacIntosh, Windows and Unix, and supports a variety of algorithmic rendering schemes. Structure superposition data may also be easily exported from Entrez, most simply by writing PDB-format files rotated to the reference frame of a "neighbor" structure. In this way Entrez may serve as a starting point for detailed comparative analysis by structural biologists using other software to examine the patterns of structural conservation within a protein family.

[1] Schuler GD, Epstein JA, Ohkawa H, Kans JA: Entrez: Molecular biology database and retrieval system. Methods in Enzymology, 1996, 266:141-162.
[2] Hogue CWV, Ohkawa H, Bryant, SH: A dynamic look at structures: WWW-Entrez and the molecular modeling database. Trends Biochem. Sci. 1996, 21:226-229.
[3] Abola EE, Bernstein FC, Bryant SH, Koetzle TF, Weng JC: Protein data bank. In Crys- tallographic databases: information content, software systems, scientific applications. Edited by Allen FH, Bergerhoff, G, Sievers R. Bonn, Chester, Cambridge: International Union of Crystallography; 1987:107-132.
[4] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J. Mol. Biol. 1990, 215:403-410.
[5] Wilbur WJ, Yang Y: An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Comput. Biol. Med. 1996, 26:209-222.
[6] Madej T, Gibrat J-F, Bryant SH: Threading a database of protein cores. Proteins 1995, 23:356-369.
[7] Gibrat J-F, Madej T, Bryant SH: Surprising similarities in structure comparison. Current Opinion in Structural Biology. 1996, 6:377-385.
[8] Ohkawa H, Ostell J, Bryant S: MMDB: An ASN.1 specification for macromolecular structure. ISMB 1995, 3:259-267.


Back to the Invited Speakers Index.