Long W tracts are over-represented in the E. coli and
H. influenzae Genomes.
Benny Shomer and Gad Yagil*
The European Bioinformatics Institute, Hinxton, UK and the Dept. of Molecular Cell Biology, The Weizmann Institute of Science, Rehovot, Israel 76100.
*To whom correspondence should be addressed, at the WIS, Rehovot, Israel.
Tel 972-89-342-275
Fax 972-89-344-125
e-mail lcyagil@wiccmail.weizmann.ac.il
Keywords: W tracts; AT rich; E. coli; H. influenzae; Complete genomes; DNA unwinding
Comment: The symbol "Greater or Equal" is translated by Netscape as a "?" or some other odd character, please note when an odd character appears.
Abstract
The occurrence of binary DNA tracts of the three combinations: R.Y, K.M and W;S has been mapped in the complete genomes of H. influenzae and E. coli. A highly significant over-representation of W tracts is observed in both bacteria. The excess of W tracts is particularly striking in the 10% intercoding regions. Subdivision into divergent, convergent and sequential intercoding regions shows that the excess of W tracts is concentrated in divergent, i.e. promoter regions. A particularly high excess of W tracts is observed in the first 200 bases 5' upstream of coding start sites. The data suggest that W tracts have a role in promoter function. A function as unwinding centers, analogous to the role of R.Y tracts in eukaryotes, is proposed. R.Y and K.M tracts are only modestly over-represented in the two bacteria.
Introduction
Oligonucleotide tracts consisting of only two bases ("binary tracts") can be formed in three pairs: Tracts made of purines on one strand and pyrimidines on the complimentary one ("R.Y tracts"); tracts made of G,T on one strand and A,C on the other ("K.M tracts") and the W;S pair which consist of A,T and of G,C tracts, each complementing itself. It is known for along time that R.Y tracts are highly over-represented in higher eukaryotic DNA (1-6); More recently, we documented that R.Y tracts are over- represented also in a lower eukaryote, Saccharomyces cerevisiae (7).
The over-representation of R.Y tracts in eukaryotes was found to be particularly high in regulatory regions. Thus, in chromosome III of yeast, intercoding regions contain R.Y tracts, which are longer than 15 nt, 32 times more than expected in uniform (random) DNA of the same composition. When only intercoding regions up to 200 bases upstream from a gene are considered this excess increases to 46-fold (6)! This observation suggests that the excess of R.Y tracts may be connected to promoter and terminator functions.
In earlier work, Kowalski and coworkers (8,9) demonstrated that DNA in A,T rich yeast regions can easily unwind and serve as DNA Unwinding Elements (DUE's). Experimental work from our lab (10) showed that A,T rich elements are not the only potential DUE's We found that two S. cerevisiae promoters containing long R.Y tracts (CYC1 and DED1) are attacked by single strand specific nucleases in the supercoiled but not in the linear state. These observations, supported by 2d topoisomer analysis, indicate that in yeast promoters R.Y tracts have a similar tendency to assume an unwound (paranemic) state. It thus seems that in yeast both W tracts and R.Y tracts can serve as DUE's and support the notion that these binary tracts can readily form unwinding elements.
Escherichia coli is long known to be free of the excessive R.Y tracts present in the higher eukaryotes (1), but that its promoters are rich in A,T tracts (11 - 13). Studies by Blattner, Kornberg, Kowalski and their colleagues (14 - 16) indicated that unwinding elements may play a regulatory role in bacteria, and that these elements are A,T rather than R.Y rich. Classical DNA melting theory actually suggests that A,T rich tracts are the first ones to unwind. Kowalski et al. proposed an algorithm to predict unwinding centers based on their A,T content (17-18). This approach has been expanded to include the effect of superhelicity (19). The availability of the complete sequences of E. coli and Haemophilus influenzae makes it now possible to map the occurrence of the binary tracts in the entire genome of these prokaryotic organisms.
In this paper we applied our previous program TRACTS, as well as newly written programs TRACDIS, ANEX and DIVCON, to analyze the occurrence and distribution of long binary tracts in E. coli and H. Influenza. It is found that W tracts are in as large an excess in these two prokaryotes as are the R.Y tracts in eukaryotes. R.Y and K.M tracts are in only moderate excess. W tracts are thus the dominating binary theme in both bacteria. It is further shown that the over-representation of W tracts, and to some extent also of R.Y and K.M tracts, is particularly high in promoter regions. This observation strengthens the proposition that in prokaryotes W tracts serve as the principal unwinding elements and may thus play a crucial role in prokaryotic gene regulation.
Methods
Program TRACDIS, written in python, generates a list of start and stop sites from annotation tables and permits plotting tract frequencies according to their distance from a nearby coding start or stop site. The program also provides links to the literature of the gene closest to each tract.
Program ANEX, written in FORTRAN, parses the GenBank annotation file (flat file) of a gene or a whole organism, and generates a file of gene start and stop sites. The file also lists the designation, length and a 50 letter description of each gene.
Program TRACTS calculates and lists the frequencies of tracts of each length, and lists all tracts above a certain length. Version 6.1 of TRACTS (formerly PUR, 5) has been extended to calculate separate tract frequencies in coding and non-coding regions. This is performed by reading the output of ANEX and determining which bases are within ORFs and which are intercoding (mostly intergenic, but, as transcription start sites are presently mostly unavailable, 5' UTS are scored as intercoding). rRNA and tRNA regions are counted in both versions. Overlapping coding regions are identified.
Program DIVCON reads the lists of all tracts longer than a given length l, as generated by TRACTS, as well as the annotation data generated by ANEX, and assigns each intercoding region into one of four classes: Divergent, convergent, sequential when coding on the GenBank listed strand ("www"), or sequential when coding on the complimentary strand ("ccc") . DIVCON then calculates the number of tracts for each binary subclass and lists the cumulative number of bases in these tracts.
The frequencies of tracts equal or longer then length l expected in randomized DNA (with fraction p of e.g. purines, p+q=1) are calculated by
n(„l) = L (p q
l+q pl), (1)The number of bases in tracts „l, N(„l), is:
N(„l) = L {(p + lq) p
l + (q + lp) ql} . (2)Controls: As control, two random DNA sequences of the length and composition of H. influenzae were generated, using IMSL routine GGUD. The average ratios of found over expected tracts were: For W;S: 0.98;1.05. For K.M: 0.96; 0.97. For R.Y: 1.03 ; 0.99 (the ratio expected for randomized DNA is of course unity).
These averages are for tracts from 10 nt to the longest consecutive tract found in the randomized genome (19;20 for R.Y and K.M; 25 for W;S); the randomized sequence had the same base composition as the studied sequence, e.g.: p(W)=0.62 for H. influenzae .
Results
The distribution of long W tracts between intercoding and coding regions of E. coli and H. influenzae is shown in Figs. 1A and 1B, respectively. The number of tracts of each length, from 12 nt upwards, in bins of 20 nt along the sequence, is plotted against the distance of that bin from the first coding position (ATG) of the closest gene. The different colors represent tracts of increasing lengths, in steps of 3, e.g. the red color represents tracts of 15-17 nt. A distinct peak is observed between positions -200 and -1, which makes it evident that a significant concentration of long W tracts („12) is present in the first 200-250 bases upstream to the first ATG, in both E. coli and H. influenzae. This peak is relative to a background frequency averaging, for tracts „15 nt (red color), 4 nt for E. coli or 8 nt for H. influenzae . The excess of long W tracts over the background is increasingly evident as tract size increases. However statistics become less significant as tract size increases as discussed in greater detail below. A similar but much less significant excess was observed for R and Y tracts (not shown).
Since no systematic data on transcription start sites are available for either bacteria, part of the intercoding regions can be transcribed and are not truly intergenic. Also, many of the intercoding regions, especially the shorter ones, reside within operons, and therefore are probably not transcription promoters. Salgado et al. (20) list 292 operons in E. coli. If we assume that each operon contains on average two intercoding regions then 20-25% of the intercoding regions are within operons. Transcribed regions may contribute to the somewhat lesser concentration between positions -100 -1 relative to -200 - -100 in E. coli. Altogether, one can expect most of the long W tracts to occur in promoter regions. This is a strong indication that the long tracts may have a role in promoting transcription.
To obtain more quantitative information about the excessive W tracts, as well as of other binary tracts, program TRACTS was applied to the Genbank listed sequences of E. coli (21, U00096) and of H. influenzae (22, HIL42023). The results are shown in Tables 1A,1B and are plotted in Figs. 2A,B. The number of bases in binary tracts of every length found in the genomes of E. coli and H. influenzae, are listed in the Tables; The length expected in randomized DNA of the same composition is also shown. (columns 3,6,9) Also listed are the ratios between these two values (columns 4,7,10); these ratios give a direct measure of the over-representations at each length, and are plotted in Fig. 2 against the respective tract lengths. W tracts of every length up to 30 nt are found in both E. coli (Table 1A) and H. influenzae (Table 1B). It is seen that in both bacteria, alone standing W and S bases (l = 1) are under represented (r = 0.89; 0.82 ), while W tracts of every length above 4 nt are over-represented to an increasing extent, up to enormous excesses for the longest tracts. Thus in E. coli., W tracts of l=25 (100 bases, 4 tracts) are found at an 54-fold excess over the average number expected in random DNA.
The most over-represented binary pair is clearly W;S. A consideration of the full output of TRACTS shows, however, that only W tracts are involved; the longest S tract is a single 22 nt tract, while seven W tracts of that length are found. The longest W tract is of 30 nt, expected only 0.07/30 times in the entire E. coli genome of 4,693,221nt, a 423 fold excess! The longest W tract expected in a random genome of that length is of 21 bases (24/21= 1.14 tracts, see Table). The detailed outputs of TRACTS can be seen on web site http://www.weizmann.ac.il/~lcyagil.
R.Y tracts of every length up to 22 nt are found in E. coli, 3.62 the number expected in random DNA. Two isolated tracts of 28 and 29 nt are also present. R.Y tracts up to 10 nt are actually under-represented (ratio below unity). As for K.M tracts, a moderate over-representation is observed (22 fold for the longest tract), but is continuous from 5 nt up. In brief, a moderate excess of R.Y and K.M tracts is observed, much less pronounced than for the W tracts. W tracts are thus the dominant excessive binary motif in E. coli.
A similar situation is evident in H. influenzae (Table 1B, Fig 2B). W tracts of every length up to 30 nt are found. The 30 nt W tract is only 6.8 times over-represented, due to the high A,T content of H. influenzae, 62%. In spite of this high A,T content, W tracts are continuously over-represented from 4 nt up. Up to 21 nt the over-representation of R.Y tracts is marginal. K.M tracts are in a continuous high excess, also up to 21 nt . Five extremely long K.M tracts, of 68 - 151 nt, are found, as often encountered in mammalian genomes. The composition of all these long tracts is (AACC)
n on one strand and (GGTT)n on the other. They are thus true microsattelites and require a special explanation. We should emphasize that the great majority of the tracts mapped by TRACTS have no particular repetitive or other symmetric feature, most of them are composed of just any mixture of the two bases, as can be seen when all E. coli teacts „25 are inspected ( Table 5 ) and A more detailed analysis is planned.Is over-representation evenly distributed over the genome, or is there a difference between coding and noncoding regions? Coding regions compose 89% of the E. coli genome and 87% of the H. influenzae genome, in Table 2A and Table 2B we see that W tracts „12 are slightly less over-represented in the coding regions than in the total genome (see ratios). However, in the intercoding regions (11% and 13% of the genomes) W tracts are represented at a much higher degree than in the whole genomes: Tracts 15 nt and longer ( „15) reach in E. coli a 17.63 fold excess over the value expected in uniform DNA (Table 2A). The over-representation in H. influenzae (6.38) is of a somewhat lesser magnitude, but still highly significant (Table 2B). The high excess of long W tracts in intercoding regions is a further indication, that W binary tracts may have a regulatory function in the bacterial genome. The excess of W tracts is evident whether one examines tracts „12 nt , or „15 nt (there are 497 such tracts in E. coli ; Table 2 lists the no. of bases in these tracts). K.M and R.Y tracts also show a significant excess in the intercoding regions of both bacteria. The excess of K.M tracts in H. influenzae (7.84 for K.M „15; 794 such tracts) is particularly notable.
Finally, in order to determine whether the excess of long tracts is connected to promoting, the 4398 intercoding regions of E. coli , as well as the 1818 ones of H. influenzae, were dissected into four subclasses: Divergent intercoding regions, which are promoting in both directions (on opposite strands); convergent regions - which are terminating in both directions, and consecutive ones, comprising of "www" regions, which are between two ORFs coded on the analyzed (Genbank listed) strand, and of "ccc" regions which are between ORFs coded on the opposite strand. Consecutive genes can have both promoter and terminator elements. The division was done with program DIVCON, which parses the data in the gene list produced by ANOT. DIVCON assigns each tract to the proper class, and counts tracts, as well as bases within tracts, in each class in two ways, as follows.
The first way is to consider the entire intercoding region as a potential promoter or terminator region. The data in Table 3A show that the E. coli genome has 645 divergent and 645 convergent intercoding regions. While 27% of the divergent regions contain at least one W tract longer „12 nt, only 9.5% of the convergent regions contain at least one of these longer tracts. Similarly low percentages (10.2; 9.9%) are observed for the www and ccc regions. Only 27% of all promoter regions include a long tract, but it should be borne in mind that many intercoding regions are quite short, often only a few bases, often within operons where no promoting features should be expected. If only tracts „15 are considered (next three rows), the excess in divergent regions becomes even more pronounced. However, the percentage of intercoding regions now having W tracts is smaller, indicating that a tract of length 12 nt, or possibly shorter (or incomplete), may already fill the functional role, whatever it may be. It should be added that the number of tracts expected in random DNA is in all subclasses about 20% of those found, so that we are dealing with excessive, subclass specific, tracts.
The second way to assess subclass distribution is to assume that promoting or terminating regions can extend into the preceding or following genes. To examine this possibility, tract frequencies 200 bases upstream from each ATG, whether extending into an upstream gene or not, as well as downstream from each terminating codon, were counted. The results are also shown in Table 3A and 3B (rows 6-10 in each half table). In that case the percentage of convergent regions having tracts was somewhat increased (16%; 3.6% for „12 or „15 nt in E. coli), but was still significantly less than in the divergent regions.
As to H. influenzae, ( Table 3B ) divergent tracts have a very high percentage (56%) of W tracts. However, an even higher percentage of convergents also have a large amount of these tracts, so that the case in favor of promoters as an unwinding sink is less strong than for E. coli, but still significant; in particular when the ±200 nt range is considered. The data indicate, nevertheless, that long W tracts are present in terminator regions as well, often at the 3' end of the RNA, or just beyond, at the polyadenylation site. to be associated with W rich elements. These regions are well known to contain a W rich elements which have been proposed to control poly adenylation and mRNA stability (19,23).
Concerning R.Y tracts, it was previously noted (5) that both the lac operon and pBR322, an E. coli derived plasmid, tend to have their few R.Y tracts concentrated in regulatory regions. Divergent intercoding regions contain three times as many R.Y tracts as convergent ones ( Table 4 ) . With K.M tracts divergent regions have nearly twice as many long tracts as convergent ones when examining all intercoding bases, but not beyond. It may be summarized that a certain amount of excessive R.Y and K.M tracts are present in E. coli promoters and also in terminators, but the significance is less obvious than with W tracts. H. influenzae also shows a certain excess of R.Y and K.M tracts, mainly in the divergent regions (not shown). Many promoters contain more than one binary tract, e.g. the ilv promoter (24) which has a W18 tract at -110 and a R11 tract at -155 from the first codon.
Discussion
The two main findings described are:
1) A very high over-representation of long W (A,T) tracts occurs in the E. coli and H. influenzae genomes, as compared with random DNA.
2) W tract over-representation is particularly high in promoter regions and, to a certain extent, in terminator and other intercoding regions.
3) A high fraction of all promoter regions contain one or several binary tracts.
What could the function of these W tracts be? If the excessive W tracts had no function, they would have been eliminated in the evolutionary process. Previous work on eukaryotic genomes, computational and experimental, has suggested that the binary tracts may serve as DNA unwinding centers in both transcription and replication control. The seminal study in this direction was by Larsen and Weintraub (26), who detected single-strand specific DNA cleavage in active chick globin promoters. Many other susceptible promoter regions have been detected since, the theme common to most of them being the binary homopurine.homopyrimidine theme, although other binary themes do occur (summarized in ref. 27).
The two bacteria studied here significantly differ from the eukaryotic genomes previously studied by us and others (3,4,9,10,25). In higher eukaryotic genomes, R.Y tracts were found to be the dominating binary theme, while W and S tracts were at a marginal excess at most. Yeast occupied an intermediate position, with all three binary motifs (except S) being in a large excess (6). As to Archea, the data for M. janaschii at least (unpub.) behave like eukaryotes rather than like the prokaryotes described here. Further organisms will have to be analyzed to verify the conclusion that an excess of W tracts characterizes prokaryotes in general.
May the W tracts serve as unwinding centers as well? A,T rich regions are well known to be the most readily melting form of DNA. Evidence in favor of melting of W tracts as a factor in gene activation in both E. coli and yeast exists: Susceptibility to cleavage by single-strand-specific nucleases showed that A,T rich regions in E. coli associated elements (phage lambda and pBR322) can serve as DNA unwinding elements (14,25). Studies concerning the ori c replication origin of E. coli (15,16) led to the same conclusion. ori c unwinding occurs in preparation for replication, a second major cellular process requiring a certain degree of DNA unwinding. In yeast, Umek and Kowalski (9,17,18) have demonstrated that A,T rich regions tend to form DNA unwinding elements (DUEs) in autonomous replication sequences (ARS) and in several yeast gene promoter regions.
The propensity of W tracts to unwind in the two bacteria would thus be the parallel of the propensity of R.Y tracts to unwind in higher eukaryotes (26, reviewed in 27). As seen in Table 4, in E. coli the R.Y tracts are in a certain, yet small, excess, a situation paralleling the situation with W tracts in the higher eukaryotes. As for K.M tracts (A,C.G,T), these show almost random ratios in E. coli (found over expected ratio = ~1), but are systematically over-represented in H. influenzae. This raises the possibility that all DNA sequences made of only two bases have a propensity to unwind into a paranemic state (10,26).
A structural basis for a propensity of binary tracts to unwind is not clear in general, but in the case of W tracts is in line with classical melting theory (30,31), which leads us to expect the W tracts to separate readily. Recent procedures to include the effect of supercoiling (19,28) strengthen that view and the presence of W rich unwinding centers in certain bacterial promoters , such as the ilv promoter, has been experimentally documented (24). A structural basis for ready melting of R.Y tracts is less obvious and their melting under supercoiling tension deserves further investigation. It should be added that other functions have been proposed for A,T rich elements, including signals that control mRNA degradation or polyadenylation (when at the 3' end of the gene, 23); or to serve as nuclear matrix attachment sites (MARs/SARs , 28) or even as preferred nucleosome attachment sites (4). These possibilities may explain some of the observed excessive tracts. The preponderance of the tracts in divergent (promoter) regions (Table 3) speaks nevertheless in favor of W tracts as DUEs. As a major function of the W tracts.
Margalit et al. (32) found no region of particular helix instability in E. coli promoters beyond the -35 - -10 sites. The explanation could be that the W tracts need not reside in a particular distance from the origin. Application of TRACTS to the first 75 bases upstream (Data base of ref 29, results not shown) shows an abundance of longer W tracts in many of the well studied genes of E. coli, and is not a special feature of a functionally yet unidentified set of ORFs. An unwinding region need not be located at an exact site or orientation. A linking deficiency may be formed in a remote location, far from the initiation site proper, and be transiently stabilized by single strand specific proteins. Upon a proper signal the linking deficiency can first be transformed into negative superhelicity distributed along an entire constrained chromosome loop and finally, upon arrival of a second set of factors, reconcentrate at the transcription/replication initiation site and permit unwinding where needed for entry of the copying machinery. Thus, in the lac operon, one W tract of 17 nt is found at the very end of the operon, i.e. at the termination of the lacA gene. This raises the possibility that a torsional sink may exist also at the end of a transcribing unit, remigrating to the initiation site by the supercoiling/decoiling mechanism just mentioned. All these inferences can be readily put to experimental examination. A regulatory role associated with unwinding sites may open new avenues for expression control mechanisms.
Acknowledgments
This work was initiated at the European Bioinformatics Institute, Hinxton, UK, when G.Y. was the recipient of a visiting Fellowship. The authors are indebted to Drs. M. Ashburner and C. Sander for hospitality, and to many members of the EBI for helpful discussions and assistance. We thank Dr. E. Yagil for his comments on the manuscript.
Legend to Figures.
Fig. 1. The number of long W tracts plotted according to their distance x (in bins of 20, from x to x-20) from the first translated nucleotide (A when ATG). The length of the tracts is color coded, as listed on the plot.
A. E. coli . B. H. influenzae .
Fig. 2. The log ratio of bases found in W tracts of a particular length over the expected base number, plotted against tract length. A. E. coli B. H. influenzae .
References:
1. Chargaff, E. (1963) Essays in Nucleic Acids. Elsevier, Amsterdam 1: 126ff.
2. Birnboim, H.C., Sederoff, R. R. and Paterson, M.C. (1979) Eur. J. Biochem. 98: 301-307.
3. Behe, M.J. (1987) Biochemistry 26: 7870-7875
4. Behe, M.J. (1995) Nucl. Acid Res. 23: 689-695.
5. Bucher, P. and Yagil, G. (1991) DNA sequence 1: 27-43.
6. Yagil, G. (1993) J. Mol. Evol., 37: 123-130.
7. Yagil, G. (1994)Yeast 10: 603-611.
8. Umek, R.M., Eddy, M.J. and Kowalski, D. (1988) Cancer Cells 6:473-478.
9. Umek, R.M. and Kowalski, D. (1990) Nucleic Acid Res., 18: 6601-6605.
10. Yagil, G., Shimron, F. and Tal, M. (1998)Gene 225: 152-163..
11. Nussinov, R., (1980)J. Theor. Biology, 85, 285-291.
12. Hawley, D.K. and McClure, W.R. (1983)Nucl. Acid Res. 11: 2237-2255.
13 Burge, C., Campbell, A.M. and Karlin, S. (1993) Proc natl. Acad Sci. (USA), 89: 1358-1362.
14. Schnos, M., Zahn, K., Inman, R.B. and Blattner, F. R. (1988) Cell 52:385-395.
15. Bramhill, D. and Kornberg, A. (1988) Cell 5: 915-917.
16. Kowalski, D. and Eddy, M.J. (1989) EMBO J. 8: 4335-4339.
17. Kowalski, D., Natale, D.R. and Eddy, M.J. (1988)Proc. Natl. Acad. Sci (USA) 85: 9464-9468.
18. Natale , D.R., Umek, R.M. and Kowalski, D. (1993)Nucl. Acid Res. 21: 555-560.
19. Benham, C.J. (1993) Proc. Natl. Acad. Sci (USA) 90: 2999-3003.
20. Salgado, H., Santos, A., Garza-Ramos, U., van Helden, J., Diaz E. and Collado-Vides, J. (1999) Nucl. Acid Res 27: 59-60.
21. Fleischmann, R.D. and 39 others (1995) Science 269: 504-512.
22. Blattner F. R. et al., (1997) Science 277: 1453-1462.
23. Zubiaga, A.M., Belasco J. G. and Greenberger, M.E. (1995) Mol. Cell. Biol. 15: 2219-2230.
24. Sheridan,S.D., Benham, C.J. and Hatfield, G.W. (1998) J. Biol. Chem. 273: 21298-21308.
25. Larsen, A. and Weintraub, H. (1982) Cell 29: 609-616.
26. Yagil, G. (1991)Crit. Revs. Biochem. Mol. Biol. 26: 475-559.
27. Shapiro, H.S., Rudner, R. , Miura, K.-I. and Chargaff, E. (1965) Nature 205: 1068-70.
28. Sheflin, L.G. and Kowalski, D. (1985) Nucl. Acid Res. 13: 6137-6153.
29. Orenstein, R.L. and Fresco, J. R. (1983) Biopolymers 22: 1979-2000.
30. Breslauer, K.J., Frank, R., Blocker, H. and Marky, L.A. (1986)Proc natl. Acad Sci. (USA) 83: 3748-3750.
31. Benham, C., Kohwi -Shigematsu, T. and Bode, J. (1997) J. Mol. Biol. 274: 181-196.
33. Lisser, S., and Margalit, H, (1993)Nucl. Acid Res. 21: 1507-1516.