Complexity Analysis of a Self-Assembling vs. a Template-Directed System.
Gad Yagil
Dept. of Cell Biology, The Weizmann Institute, Rehovot, Israel 76100.
Abstract: The structural biocomplexity of two viral structures is evaluated: that of the small RNA tobacco mosaic virus (TMV) and that of the larger dsDNA bacteriophage T4. Tobacco mosaic virus was chosen as a paradigm of a self- assembling biostructure, while the T4 represents biostructures where genome directed instructions are essential for the achievement of the correct virion structure. A large difference in complexity values is found: C = 4 for the TMV virion versus C = 117 for the tail part of the T4 virion. The considerable difference in these values indicates a correlation between the structural biocomplexity as defined and the pattern coding requirements of these organisms. It is proposed to utilize complexity analysis for the evaluation of expected genomic contribution to structural specification: the higher the complexity of a structure, the more genomic directions, of known or unknown nature, are likely to be required. The elements of biological complexity evaluation (Yagil, 1985; 1993b) are briefly summarized. A quantitative measure of order (0.93 for the T4 tail fiber) is an additional outcome of the formalism employed.
1. T4 vs. TMV
A major issue concerning pattern formation in living entities is whether biological structures are formed solely by spontaneous, self-assembling processes or whether additional, genome encoded signals are required for the specification of biostructures. In a classical demonstration of a self-assembling process, Fraenkel-Conrat (1963) dissociated the small RNA virus TMV (Tobacco Mosaic Virus) into its two components, RNA and coat protein, and showed that when these two components are remixed, fully infective virus particles are regained.
A different picture emerges when more complicated viruses such as the double stranded DNA bacteriophages are examined. Bacteriophage T4, for example, is composed of two principal parts, a head and a tail part. The tail structure is made of a sheath, a baseplate and 6 fibers attached to the base plate (see Figure), arranged in a close to perfect hexagonal symmetry. Bacteriophage T4 has a genome of about 166000 nucleotides (168699 nt in the t4 phage database as of 6/1994), compared with 6394 nt for TMV virus, and codes for at least 130 genes (Mosig and Eiserling, 1988). The correct expression of at least 49 of these genes is required for the assembly of the intact virion structure. Most of these genes code for the proteins composing the structure, but at least four genes are not found in the final structure: gene 57 codes for an enzymatic activity required for tail fiber assembly; gene 51, although essential, does not code for an identified protein and its role, is so far obscure; the product of gene 63 ("gp63") is responsible for chaperoning the correct joining of long tail fibers to the tail part (Wood and Crowther, 1983); gene 38 directs the correct assembly of tail fiber components (see Table 1 and Casjens and Hendrix, 1988). Gp29 and/or gp48 are proposed to serve as a tape or jig determining the correct length of the tail tube (Berget and King, 1983; Casjens and Hendrix, 1988). Other gene products have a role in directing the correct assembly of the head structure: gene 21 codes for a protease needed to cleave some of the head proteins before they can be properly assembled. Gp22 appears in the prohead to serve as a scaffold for the assembly of a correct head shell; only its degradation products appear in the final head structure (Black and Showe, 1983). It is clear that a simplistic principle of self-assembly is not applicable to bacteriophage T4. Rather, specific genomic instructions, some via yet unknown molecular pathways, have to be implemented before an infective virion particle can be produced in the infected cell. Direction by Such genomic signals may be dominating in higher organisms.
In this paper we shall apply a previously formulated theory of biocomplexity (Yagil, 1985; 1993a,b) to compare the structural complexity of the two virions - the self-assembling TMV virion with the instruction-directed T4 phage. It will be shown that a great difference in the numerical complexities of the two structures exists, leading to the proposition that self-organization can work well mainly for structures of low complexity, while template generated instructions are required for the generation of the more highly complex structures.
2. Structural Biocomplexity
Structural biocomplexity has been defined as the length of the shortest list of numerical and symbolic specifications necessary to describe a given structure (Yagil, 1985). This definition can be regarded as a form of the Kolmogorov algorithmic complexity (cf. Li and Vitanji, 1991) adapted to real, physical structures. The length of the list of specifications describing a physical system is determined by the number of regularities (i.e. identical numerical specifications) in a system - the more regularities that can be identified in the elements of the system (e.g. in the phage particles), the shorter is the specification list required. In quantitative form (Yagil, 1985):
C =
Sk [c(k)/k] - c' (1)where C designs the structural complexity of the system; c(k) the number of point coordinates sharing a k fold regularity and c' the 5-6 coordinates necessary to place a rigid structure in the external world. We have previously shown how complexity C is evaluated for small molecules such as methane, ethane and adenine (1985), for proteins (1993a) and for hydrocarbon isomers (1993b). Here we shall apply the procedure to evaluate, the structural complexities of the virion structure of TMV and of the bacteriophage T4 tail structure.
The first step in the analysis is to decide upon the hierarchical level of interest, e.g. whether virion complexity is to be evaluated in terms of its structural proteins, in terms of their composing amino acids, or possibly in terms of the atoms constituting each amino acid. The decision is an arbitrary one and depends on the kind of information needed. When coding capacity of a genome is of interest, complexity in terms of gene products, i.e proteins, is called for. A simple level rule (Yagil, in preparation) connects the complexities of the different hierarchical levels.
|
Table 1. The Long Tail Fibers of Phage T4 Virion - Specification Tablea |
||||||||||||
|
. |
||||||||||||
|
No. |
n (i=0,n-1) |
e ( protein) |
r (nm) |
f f |
q |
Function |
||||||
|
. |
||||||||||||
|
1 |
6 x 2 b |
gp34 |
695/2 c,e |
F 34 + ip/3 |
Any 1 |
Proximal part, main |
||||||
|
2 |
6 |
gp35 |
R 35d |
F 35 + ip/3 |
Any 2 |
Distal part |
||||||
|
3 |
6 x 2 b |
gp36 |
690/2 d |
F 36 + ip/3 |
Any 2 |
Distal part |
||||||
|
4 |
6 x 2 b |
gp37 |
690/2 d |
F 37 + ip/3 |
Any 2 |
Distal part, main |
||||||
|
- |
0 |
gp38 |
- |
- |
- |
catalytic |
||||||
|
- |
0 |
gp63 |
- |
- |
- |
catalytic |
||||||
|
. |
||||||||||||
|
Specifications: |
4 |
4+3 b |
4+3 |
0+3 |
||||||||
Comments to Tables 1 and 2:
a. Source: Wood and Crowther, 1983.
b. These proteins are di, tri or tetrameric. The extra 3 or 4 coordinates are for the positions of the monomers within multimers, assuming symmetric arrangement.
c. A coordinate transformation to an origin at the edge of the base plate, e.g. at the center of gp9 applies. A transformation with previously defined parameters does not add to the complexity (Rule 6 ,Yagil, 1985).
d. The origin is here on gp34. A transformation with the parameters for gp34 applies.
e. Numerical values are given where known, to illustrate that complexity analysis is based on observed values. The numbers were extracted from journal figures, and can differ somewhat from the measured values.
f. Some of the
f coordinates may have the same value because of the hexagonal symmetry, but no firm data are available.g. Sources: 1. Eiserling, F.A., Compreh. Virol. 13: p.558 (1979). 2. Berget, P. and King, J. (1983). 3. Casjens and Hendrix, (1988).
h. The numbers are for the hexagon form are from fig 4, Crowther et al., (1977).
j. From Amos and Klug (1977).
k. note that helically arranged gp18, gp19 need two numerical z specifications each.
m.See Kikuchi and King, 1975.
n. Converts gp12,34,37 (King and Laemmli, 1973).
p. the coordinates in small print are for the position of monomers within multimeres.
The next two steps are to set up a specification table and to choose a coordinate system. In the specification table all components of the system and their coordinates are listed (cf. e.g. Table 1). Each protein component is to be specified by four coordinates: One symbolic coordinate (
e) for its "type" or "color" ( for H2O, for instance, this would be: e1=O; e2,3=H) and three numerical coordinates for its position within the virion structure (the polar coordinates for H2O would be: r1,f1,q1= 0; r2,3 = rCH; f2 = 0; f3 = p; q2,3= ±52.25ƒ). A unique way to find the coordinate system which gives the shortest list of specifications can not be offered at present and several coordinate systems may have to be examined. For the analyses of the T4 tail and of TMV a cylindrical coordinate system (r,f,z) turns out to be the most suitable one, because of the hexagonal and helical symmetries in their structures. For the tail fibers (Table 1), the polar system (r,f,q) is the more suitable one.In the fourth step it should be determined whether each of the listed coordinates is a random or an ordered one. An ordered coordinate is one for which a fixed numerical value applies to all members of an ensemble (e.g. for all phage particles in a test tube), while a disordered,or random coordinate is one which takes a different value in each member of the ensemble. For instance, each tail fiber of T4 has two subparts the length of which is constant ("ordered coordinates") but have variable angles, depending on the extent each fiber is drawn in or extended (see Figure). These angles are "disordered coordinates", because they take a different value in every phage in an ensemble of phage particles and are assigned values of "Any" in Table 1. The angles of free rotating amino acids chains are other examples of random coordinates, while neighboring carbon - carbon distances are ordered features of these molecules. A coordinate is ordered only when it takes up the same value in each particle in ensemble. The distinction between disordered (random) and ordered coordinates is important, because only ordered coordinates, which can be assigned defined numerical values, contribute to the complexity of the system. For disordered coordinates it can not be stated whether they are complex or simply arranged; they have therefore to be excluded from the analysis. This is a major feature of the approach presented here.
As a bonus, data like those in Table 1 enable the assignment of a numerical value to the degree of order in a system in a quite natural way. The degree of order
W has been defined (Yagil,1985, p.19) as the ratio of ordered to total coordinates. The tail fiber components have altogether 168 coordinates (7 proteins of 6 units each, 4 coordinates each); of these 12 are disordered (2 angles of 6 proteins), leaving 156 ordered coordinates. The ratio of ordered to total coordinates is therefore 156/168, yielding a degree of order of W = 0.93 for the tail fibers. There are no "Any" coordinates in Table 2, consequently W = 1 for the main tail structure.3. The Complexity of the T4 Tail
The final step is to determine the minimal number of coordinates required to describe the ordered part of the system. This is done in Tables 1-2 for the tail fibers and the main tail structures of bacteriophage T4. The tail fiber is composed of the four gene products listed in Table 1 (Wood and Crowther, 1983); the two additional genes listed must be active for correct tail assembly but their products are absent from the final structure. Twenty two T4 gene products are listed in the specification table of the main tail structure (Table 2), divided into a sheath and tube, baseplate hub and six baseplate arms (or "wedges"). Almost all of the proteins listed are present in multiple units, up to 144 copies for the sheath and tube protein units (Berget and King, 1983). A numerical value has been entered whenever an experimental value was found in the literature. Almost all proteins listed share a six fold radial symmetry around the main axis of the phage, chosen here as the z axis. The z coordinates are the distances along the z axis to an arbitrarily chosen origin, e.g. to the base of the sheath structure. The radial arrangement of most units implies that only one r value need be listed for each protein, reducing the number of specifications needed for each six-fold subunit by 5. The radial symmetry dictates also the
f values for each component, so that one can write f=2pi/6 when six protein units are found in the structure. A slightly more complicated relation, involving a second numerical value, is necessary to specify the f coordinate of the tube and sheath proteins subunits (gp18, gp19), because of the helical displacement of successive rings.
|
Table 2: The Tail Part of Phage T4 Virion - Specification Table g |
|||||||||||||||||
|
. |
|||||||||||||||||
|
No. |
n (i=0,n-1) |
e (protein) |
r (nm) |
f f |
z (nm) |
Location and Function. |
|||||||||||
|
. |
|||||||||||||||||
|
1 |
144 |
gp18 |
12.0 e |
F 18 + 17i e |
Z 18 + 4.1 int(i/6) e |
Sheath j |
|||||||||||
|
2 |
144 |
gp19 |
4.5 |
F 19 + 17i |
Z 19 + 4.1 int(i/6) |
Tube j |
|||||||||||
|
3 |
1-6 |
gp15 |
R 15 |
F 15+ip/3 |
Z 15 |
Sheath cap |
|||||||||||
|
4 |
1-6 |
gp3 |
R 3 |
F 3 +ip/3 |
Z 3 |
Tube cap |
|||||||||||
|
5 |
6 |
gp8 |
0 |
F 48+ip/3 |
Z 48 |
Sheath initiation, Jig? |
|||||||||||
|
6 |
6 |
gp54 |
R 54 |
F 54+ip/3 |
Z 54 |
Tube initiation |
|||||||||||
|
8 |
6 |
gp29 |
R 29 |
F 29+ip/3 |
Z 29 |
Hub center, fol.synt.,tape m |
|||||||||||
|
7 |
6 |
gp5 |
R 5 |
F 5 +ip/3 |
Z 5 |
Hub, lysozyme |
|||||||||||
|
9 |
6 |
gp27 |
R 27 |
F 27+ip/3 |
Z 27 |
Hub |
|||||||||||
|
10 |
3 e |
gp26 |
R 26 |
F 26+2ip/3 |
Z 26 |
Hub assemb,folate synthase. |
|||||||||||
|
11 |
3 e |
gp28 |
R 28 |
F 28+2ip/3 |
Z 28 |
Hub assemb, pteroyl- hexaglut. synthase. |
|||||||||||
|
12 |
6 |
frd |
R frd |
F frd+ip/3 |
Z frd |
Hub, dhfolate reductase |
|||||||||||
|
13 |
3 |
td |
R td |
F td +2ip/3 |
Z td |
Hub centr.,dT synthase. |
|||||||||||
|
14 |
12 |
gp6 |
R 6 |
F 6+ip/6 |
Z 6 |
Arm, main inner |
|||||||||||
|
15 |
6 |
gp7 |
R 7 |
F 7+ip/3 |
Z 7 |
Arm, to hub |
|||||||||||
|
16 |
6 |
gp8 |
R 8 |
F 8+ip/3 |
Z 8 |
Arm, main inner |
|||||||||||
|
17 |
6 x 4 b,e |
gp9 |
15.3 h+ r9p |
F 9+ip/3+ f9p |
Z 9 + z9p |
Arm, long fiber att. site |
|||||||||||
|
18 |
6 x 2 b,e |
gp10 |
18.0 h+r10 |
F 10+ip/3+f10 |
Z 10+ z10 |
Arm, spike or vertex |
|||||||||||
|
19 |
6 x 2 b,e |
gp11 |
19.1 h+r11 |
F 11+ip/3+f11 |
Z 11+ z11 |
Arm, spike knob |
|||||||||||
|
20 |
6 x 3 b,e |
gp12 |
19.1 h+r12 |
F 12+ip/3+f12 |
Z 12+ z12 |
Arm, spike fiber |
|||||||||||
|
21 |
6 |
gp53 |
R 53 |
F 53+ip/3 |
Z 53 |
Arm, hexamer joiner? |
|||||||||||
|
22 |
6 |
gp25 |
R 25 |
F 25+ip/3 |
Z 25 |
Arm, lysozyme |
|||||||||||
|
- |
(6) |
gp51 |
- |
- |
- |
Hub assem,folate synt. |
|||||||||||
|
- |
- |
gp57 |
- |
- |
- |
Not in capsid, catalytic n |
|||||||||||
|
. |
|||||||||||||||||
|
Specifications: |
22 |
22+4 b |
22+4 b |
24 k+4b |
|||||||||||||
C
= 102 - 6 = 96
The total number of specifications for each coordinate is listed at the bottom of the e,r,f,z columns in the tables. The total number for all four coordinates is the structural biocomplexity of the tail part of the virion, in terms of constituent gene products (proteins). The complexities which result are C = 21 for the tail fibers and C = 102 - 6 for the main tail structure (6 is subtracted for c' in eq. (1), i.e. for placement in external world). The structural biocomplexity of the complete tail can thus be assigned the value of C = 117. This is about half of the complexity of the entire virion, head and neck structures included. The full details of these parts will be described elsewhere. Structural complexity assumes necessarily larger values when evaluated in terms of lower hierarchical levels. The next lower level is that of the amino acids composing each protein. The total MW of the coded tail proteins is 1061 kD. If we take the average MW of an amino acids as 130, we obtain 8162 separate amino acids; each amino acid involves four specifications (few regularities on this level), which yields an overall minimal complexity of C = 8162x4 +14 -22 = 32640 (14 for multimers and helical arrangements; minus 22 for the 22 gene products, now respecified in terms of their component amino acids).
|
Table 3 Complexity of the Tobacco Mosaic Virus (TMV) Shell |
||||||||||
|
. |
||||||||||
|
No |
n (i=0,n-1) |
e (protein) |
r |
f |
z |
|||||
|
. |
||||||||||
|
1 |
2100 |
Coat protein |
R 1 |
2 pi/161/3 |
330 i/2100 |
|||||
|
. |
||||||||||
Specifications: 1 1 1 1
C = 4
Source : Caspar et al., Adv.protein chemistry, 18: 37-88
Comment: In the present analysis proteins were considered as point elements; the analysis can also be performed with the proteins as rigid bodies. Three additional coordinates for the orientation of each of the bodies within the virion would be needed. For an analysis of the complete virion (including the RNA part) as rigid bodies see Table 6, Yagil, 1985. RNA has been left out here for comparison with the unfilled T4 structure.
4. The Complexity of the TMV Virion
The value of C = 117 is to be compared with the complexity of the Tobacco Mosaic Virion. (Table 3). TMV has only a single protein component ("
e = coat protein") of n = 2100 units, arranged in 330 helical turns. The three helical parameters (R1, 2pi/161/3 and 330i/2100) are sufficient to describe the position of each subunit within the structure. The structural complexity of TMV coat, in terms of this single component, is therefore just C = 4. The complexity of the entire T4 is thus more than 50-fold higher than that of TMV. This large increase in complexity parallels the transition from a strictly self-assembling system to one which is heavily instruction directed. The comparison of the two organisms illustrates the utility of complexity analysis in predicting the type of organization to be expected.5. Remarks and Conclusion
A more detailed discussion of the assumptions made in this analysis can be found in the three previous publications cited. These publications discuss also the relation between complexity as used here and common thermodynamic and information-theoretic quantities. In brief, biocomplexity can be regarded as an zero point entropy, which exists already at 0o Kelvin, and does not contribute, by the 3rd law, to the thermodynamically measured entropy of the system. The DNA and RNA of the viruses, really the heaviest contributors to the complexity of the two viruses (166000 nt and 6892 nt), have not been included in the present analysis; this has been done in order to highlight the structural contribution to biocomplexity. The contribution of nucleotide base sequence is nevertheless decisive in determining both shape and function of the viruses. This is a major point, as the increase in structural and functional complexity during evolution can hardly be imagined without the parallel increase in template complexity (certain aspects of brain function may be an exception). It is the emergence of templates capable of specifying organizing instructions which confers to living entities their high complexities, with all their manifestations. The complexities reached by the genetic templates remain unmatched in the inanimate world.
The difference between the two simple semi-organisms thus highlights the limitation of self-organization in producing complex biological structures. The requirement for template-specified directions should by no means be regarded as a conceptual hurdle - all highly evolved organisms have a vast repertoire of DNA with ample space for yet unaccounted functions (Hood, 1993). A search for novel organizational signals encoded in DNA, via small ORFs or otherwise, could be rewarding. The main object of this presentation is to illustrate, by way of an example, the relation between biocomplexity and morphogenetic elements and to point at the potential of complexity analysis to shed light on the emergence of the intricate patterns of life.
References
1. Amos, L.A. and Klug, A. (1977). Three dimensional image reconstruction of the contractile tail of the T4 bacteriophage. J. Mol. Biol . 99, 51-73.
2. Berget, P. and King, J. (1983). T4 tail morphogenesis. In: "Bacteriophage T4" (C. Mathews, E. Kutter, G. Mosig and P. Berget, Eds.), ASM publications, Washington, pp. 246-258.
3. Black, L. and Showe, M. (1983). Morphogenesis of the T4 head. In: "Bacteriophage T4" (C. Mathews, E.Kutter, G.Mosig and P. Berget, Eds.) ASM publications, Washington, pp. 219-245.
4. Casjens, S. and Hendrix, R. (1988). Control mechanisms in dsDNA bacteriophage assembly. In: "The Bacteriophages" , Calendar, R., Ed. Plenum Press N.Y. pp 15-92.
5. Crowther, R.A., Lenk, E.V., Kikuchi, Y. and King, J. (1977) Molecular reorganization in the hexagon to star transition of the baseplate of bacteriophage T4. J. Mol. Biol. 116, 489 - 623.
6. Eiserling ,F. (1983) Structure of the T4 virion. In: "Bacteriophage T4" (C. Mathews, E.Kutter, G.Mosig and P. Berget, Eds.) ASM publications, Washington, pp. 1-24..
7. Fraenkel-Conrat, H. (1963). "Design and function on the threshold of life", Academic press, N.Y.
8. Hood, L., Koop, B.F., Rowen, L. and Wang, K. (1993). Human and mouse T cell loci; The importance of comparative large-scale DNA sequence analyses. Cold Spring Harbor Symposia, 57, 339-348.
9. Kikuchi, Y. and King, J. (1975). Genetic Control of bacteriophage T4 baseplate morphogenesis. I. Sequential assembly of the major precursor, in vivo and in vitro. J. Mol. Biol. 99, 645-716.
10. King, J. and Laemmli, U. K. (1973). Bacteriophage T4 tail assembly: structural proteins and their genetic identification. J. Mol. Biol. 75, 315-337.
11. Li, M. and Vitanyi, P. (1993). "An introduction to Kolmogorov complexity and its applications". Springer Verlag, New York, Inc.
12. Mosig, G. and Eiserling, F. (1988). Phage T4 structure and metabolism. In "The Bacteriophages". Calendar, R. Ed., Vol. 2, p.521- 606.
13. Wood, H.B. and Crowther, R.A. (1983). Long tail fibers: Genes, proteins, assembly and structure. In: "Bacteriophage T4" (C. Mathews, E. Kutter, G. Mosig and P. Berget, Eds. ASM publications, Washington, pp. 259 - 269.
14. Yagil, G. (1985). On the structural complexity of simple biosystems. J. Theor.Biol., 112, 1-23.
15. Yagil, G. (1993a). Complexity analysis of a protein molecule. In: "Mathematics applied to Biology and medicine", J. Demongeot and V. Capasso, Eds., Wuerz publishing, pp. 305 - 313.
16. Yagil, G. (1993b). On the structural complexity of templated systems. in: "1992 lectures in complex systems" , L. Nadel and D. Stein, Eds., The Santa Fe Institute and Addison-Wesley, N.Y.