By: Gad Yagil
Dept. of Cell Biology, The Weizmann Institute of Science,
Rehovot, Israel 76100
lcyagil@wiccmail.weizmann.ac.il
Fax: 00-972-8-344125
Telephone: 00-972-8-342775
Program TRACTS is employed to map the occurrence of base tracts composed of only two-bases in S. cerevisiae chromosome III The observed frequencies are compared with those expected in random DNA. A vast excess of long base tracts of all three possible two-base combinations , namely purine.pyrimidine (R.Y), keto.imino (K.M) and Weak;Strong (mainly A,T rich) is documented. This excess places yeast in the same category as other eukaryotic and organelle genomes analyzed. The excess of the two- base tracts is considerable larger in the noncoding 1/3 of the chromosome, in particular proximal to coding initiation and termination sites. A functional role for the excessive tracts, possibly as unwinding centers of particular genes, is proposed. Multiple occurence of long two-base tracts within an ORF is offered as another diagnostic to determine whether the ORF, or a subregion of it, are actually translated.
One of the more prominent deviations of DNA composition from randomness observed so far is the extremely high frequency of base tracts composed of only two bases. This high frequency has been observed very early for oligopyrimidine.oligopurine (R.Y) tracts by Chargaff and collaborators (Tamm et al.,1952; Shapiro et al, 1965) and further elaborated by several authors since (Case and Baker, 1975; Birnboim et al. 1976;1979; Behe, 1987; Bucher and Yagil, 1991). More recently, we have been able to document a similar excess of the other two "letter" sequences (Karlin and Ghandour, 1985), namely of K.M tracts (oligo(G,T).oligo(A,C) tracts) as well as a modest excess of S;W tracts, mainly oligo(A,T), (Yagil, 1993). The excess of all three possible two-base tracts is most pronounced in vertebrates, but is found in many genes of plants, invertebrates, vertebrates, their cellular organelles (generally thought to be of prokaryotic origin) and their viruses as well. In prokaryotes, an only barely significant excess of R.Y tracts is observed (Chargaff 1963; Bucher and Yagil, 1991).
The situation in yeast is less clear. While the analysis of the 2m plasmid and several coding genes revealed a situation similar to prokaryotes, i.e. no particular excess of R.Y tracts, other gene regions, like the CYC1 promoter (McNeill and Smith, 1986) or the centromere region (Ng and Carbon, 1987), contain several long R.Y or A,T tracts. With the availability of the first complete sequence of an entire yeast chromosome (Oliver et al., 1992), the frequency of the three types of two-base tracts (R,Y; K.M and S;W) in coding and noncoding regions of the yeast genome can be systematically evaluated. This is done here, by applying previously written program TRACTS to the published sequence of the chromosome. The results demonstrate that two-base tracts are as extensively overrepresented in the yeast chromosome as they are in all other eukaryotic genomes. Overrepresentation is found to be particularly high in intercoding regions, especially in the vicinity of translation start or termination sites. One letter tracts, mainly oligo A, oligo T , and strictly alternating A-T tracts have recently been mapped by Karlin et al. (1993). Results:
In Table 1 the frequency of R.Y tracts of every length found in yeast chromosome III is recorded. Table 1. is an abbreviated output of program TRACTS previously described (Yagil 1993). Oligopyrimidine (Y) tracts can be seen to be generally as frequent as oligopurine (R) tracts (columns 2 and 3). The data are for the strand listed in the EMBL database. Every tract length up to 26 bases is represented in the chromosome; additional tracts, up to 36 purines in a row, are present. In column 5 the number of bases expected in randomized DNA of the same length and base composition as chromosome III, 315357 nt, 50% A,G) are shown. The expected frequencies e of bases in tracts of length l are calculated by the simple expression: e = l(pl.q2 + p2 ql)L, (Eqs. (1),(4) of Bucher and Yagil 1991) where p is the fraction of one the two bases in the analyzed sequence, q of the other base (p+q=1), and L is the length of the DNA analyzed. This formula has been repeatedly verified by running many randomized DNA sequences by program TRACTS, see previous publications.
The difference between the expected and found bases (column 6) makes it clear that very short tracts, i.e. alone standing purines or pyrimidines, their doublets, triplets and quadruplets (l=1-4), are highly underrepresented in the chromosome. On the other hand, tracts of 5 bases and longer are overrepresented at a rapidly increasing extent. This is most evident when the ratio r, of bases found to bases expected, listed in the last column, is examined. For example, 12 R.Y tracts of 18 nt (216 bases) are observed while only 0.6 tracts (10.8 bases) of that length are expected in a random DNA of 315,357 bases. This means that 18-base tracts are overrepresented by the factor of 20 listed in the last column. The overrepresentation increases continuously with tract length and reaches very high value for tracts of 25-36 nt.
A similar situation is observed for K.M tracts (Table 2): Every tract length up to 21 bases is represented and tracts up to 37 nt are observed. In addition, a single huge A,C tract of 363 nt is present, representing the one telomere region sequenced; yeast telomeres are well known to terminate with a long repetitive (CA)n.CCC .(GT)nGGG tract, which may assume also a special structure (Zakian,1989) . Again, alone standing K (G or T) and M (C or A), their doublets and triplets (l =1-3) are significantly underrepresented and compensate for the excess of long tracts. The huge excess of the longer K.M tracts is evident when compared with the number expected in a randomized DNA (last column). looking again at 18nt tracts, 3 K and 3 M tracts are observed,108 nt together,while only 11.2 nt are expected, i.e. a 9.6 fold excess.
Overrepresentation of two-base tracts is also apparent for S;W family, but to a less significant extent (Table 3). Overrepresentation begins only at the 15 nt level, but then continues up to 24 nt; tracts up to 32 nt are present. The high ratios are mainly for W (A,T) tracts because of the high % of A,T in the yeast chromosome (0.615%), no S tracts as long as 16nt are expected or present.
The availability of a complete chromosome sequence makes it possible, for the first time, to compare compositional parameters of coding with noncoding gene regions. In the published sequence, 104,662 nt out of the total 315,357 are outside known or surmised open reading frames (ORFs) , i.e. close to 1/3 of the whole chromosome is non coding. It is not known exactly what parts of the noncoding regions are actually transcribed (Yoshikawa and Isono,1990), so that the noncoding regions are at least partly transcribed. In Table 4 the density of all two-base tracts longer than 15 nt in coding and noncoding regions is compared. It is evident, that the overrepresented long tracts of all three classes are not equally distributed, but tend to concentrate in the 1/3 non coding regions. Thus R.Y tracts of 15 bases and longer (145 of these tracts are found, while only 9.6 tracts are expected by equation (1) of Bucher and Yagil) are present at an overall 17 fold excess , but this excess increases to 32 fold in the non coding regions. There is still a large overrepresentation in the coding region, (x9.7), and we shall discuss its significance in the discussion.
The excess of two-base tracts is even more pronounced when distribution within the noncoding regions is examined: thus, 48,997 noncoding bases (47%) are within ±200 nt from the boundary of a coding region, i.e. from a translation start or termination site. These proximal regions contain 1100 bases within long (³15 nt) R.Y tracts (Table 4). In other words, R.Y tracts tend to concentrate in promoter/termination regions. Intercoding K.M and W tracts are also concentrated within ±200 bases from coding regions. These regions are often composed of predominantly two-base tracts. Some particular examples are the divergent initiation (promoter?) region between ORFs R59c and R60w,168 nt ctg. 30nt Y tract, a 34 nt A,T tract, a 13 nt G,T tract, a 13nt R tract, and several shorter ones. Also the region between the convergent termination region of ORFsÊ L17w (LEU2) and L18c; On the other hand, long intergenic regions like the centromere region ( 3284 bases), or the region between ORFs R95c and R96c (287,992 - 291,912; 3921 bases ) contain only few R.Y and K.M tracts. The concentration of the long two-base tracts near gene initiation sites is highly suggestive of a major role for these tracts in gene control.
In Table 5 all R.Y tracts 20 nt or longer are listed. oligo R and oligo Y tracts are nearly equally represented. To what extent can these tracts be considered repetitions of simpler motifs? The 38 tracts listed in the table have been classified according to whether a repetitive simple motif can be discerned in the tract. A tract is scored as "rand" if no repetitive is evident. A tract is scored as 1/2 "rand" when the repetitive motif listed covers 33-66% of the tract. It can be seen that the majority of the tracts (22/38) show no obvious repeated motif. It can not be stated therefore that R.Y overrepresentation is due to repetition of simple, or cryptically simple (Tautz et al.1986; Tautz 1989), sequence motifs, as found as "microsattelites" in many genes.(Beckman and Weber, 1992). Overrepresentation is not eliminated when strictly repetitive tracts are excluded - even if all 16 "simple" or semisimple sequences were absent, R.Y tracts would still be in an 22/38x17 =9.8 fold excess over random DNA. Overrepresentation seems thus to be a property of any mixture of A's and G's, and not of a particularly ordered subclass. Detailed examination of the longer tracts shows that a similar situation exists for W tracts: Of the 40 tracts ³ 20nt, 20.5 can be classified as simple, 19.5 as random. As for K.M tracts, only 2 out of 14 tracts ³20 nt are fully random, but the regular, simple subsequences contain a high proportion of long An or Tn stretches (data not shown). Among the simple tracts 10 have (AT)n subsequences with n³5, way above the expected frequency. Karlin et al. have recently shown (1993), using an r-scan statistical procedure, that these An, Tn and (AT)n subsequences are evenly distributed along the chromosome. To sum up this section, if there is an essential role for the excessive tracts - it is not due to a particular base sequence, nor is it due to a particular two-base combination.
The data brought demonstrate that the 3 two-base tract families (R.Y, K.M and S;W) are as highly overrepresented in the first sequenced yeast chromosome as in most other eukaryotic genomes (Yagil 1993). This is of particular interest for 2 reasons:
1) In a previous publication (Bucher and Yagil, 1991) it was suggested the yeast may resemble prokaryotes, in that R,Y tracts are only modestly overrepresented, if at all (cf. Chargaff, 1963). The larger data base analyzed here shows that the excess of two-base tracts can be as large in yeast as in plants and other higher eukaryotes.
2) The sequencing of the entire chromosome makes it possible for the first time to make statements on tract frequencies in intercoding regions. It may therefore be of general significance that two-base tract overrepresentation is much higher in noncoding regions then in strictly coding regions (Table 4).
The somewhat reduced occurrence of long two-base tracts in coding regions (still way above random DNA , Table 5) may partly be due to the restrictions imposed on protein composition by two letter codons. A long R tract can, for instance, code for only Gly (GGG,GGA); Lys (AAA,AAG); Arg (AGG,AGA) and Glu (GAG,GAA). Thus the 20nt R tract found within ORF R13c, G.GGGAAAGAG.AAAAGAAAAAA.A codes for ÊGly.Lys.Glu.Lys.Arg.Lys.Lys, which may well be a nuclear localization signal. In other cases, the coding for so many charged amino acids may a burden, e.g. the 36 nt R.Y tract at 233616 is within ORF R67c and codes for ala.(ser)10.phe, an unlikely composition. The overrepresented R.Y tracts in coding regions, points therefore at some general role for these tracts.
An example of a long K.M tract within a coding region is the 39 nt M tract, with one 2nt interruption, within ORF R93w, identified as coding for regulatory protein CDC39 (Connart and Struhl, 1993). This M tract codes for glutamine (CAA). The 11 glutamines coded for are part of one of two glutamine rich regions of the CDC39 protein. the region contains 21 glutamines, 18 of them coded by CAA (incl. the above 11) and only 3 by CAG. The average codon usage ratio for glutamine in yeast is 10/29. The unusually high proportion of CAA raises the question, whether the high concentration of K is due to a requirement for gln, or whether the presence of many gln may not be due to a requirement for an K.M tract (a second gln rich coding region has nevertheless only 11/18 CAA).
Two-base tracts may serve as an diagnostic for untranslated regions. For instance, ORF L7w contains within its 393 nt, one 38 (-1) Y tract, one 17 Y tract, followed by an 22 nt K tract, another 12nt K tract, A 10nt W tract, and several shorter ones. These would produce a rather restricted protein sequence, so that L7w Is not likely to traslated. L10c, and L58w and R16w are further examples.
The high concentration of the two-base tracts in the noncoding regions, particularly within 200 bases from translation initiation or termination sites (Table 4), is strongly indicative for a special function for these tracts. The extent to which each proximal regions are actually transcribed is so far only partly established (Yoshikawa and Isono, 1990) so that a role on the RNA level, possibly by triplex formation (Maher et al., 1992) can not be excluded. It seems however, that a reasonable role to consider is that two-base tracts can serve as DNA unwinding elements in preparation for DNA replication, transcription and other template directed processes ( Kowalski and Eddy , 1990 ; Palecek, 1991; Yagil 1991). Umek, Kowalski and coworkers have shown that several important yeast regulatory gene regions which are rich in A,T are susceptible to single strand DNA cleaving enzymes, like mung bean nuclease and P1 .(Umek and Kowalski, 1988; Natale et al., 1993). In order to be cleaved, these sequences must be partly unwound, and have been termed DNA unwinding elements (DUE). In recent experiments (Yagil, Tal and Shimron , 1993) we were able to show that the A,T rich elements of centromere of chromosome IV (CEN4) is highly susceptible to single strand specific nuclease P1 and to the conformation specific reagent KMnO4, in support of an unwinding function for the W tracts involved. As to R.Y tracts, evidence from numerous laboratories has shown that DNA in many eukaryotic genes, in intact nuclei as well as in supercoiled plasmids, are also susceptible to cleavage by single strand specific nucleases, like S1 and P1 nuclease (Larsen and Weintraub, 1982; reviewed in ; Wells et al., 1988; Yagil 1991). This suggests that R,Y tracts, and possibly also K.M tracts, can also as unwinding elements. This is a bit counter intuitive, because numerous studies shows A,T (W) rich sequences are the most readily melting ones and therefore the most likely to be in a strand separated state. In chromosome III, R.Y and K.M tracts are nevertheless as excessive as W tracts. It is there fore possible that in vivo conditions, incl. negative supercoiling,may confer early strand separation on these sequences as well. Preliminary experiments reveal indeed a P1 nuclease sensitive tract which maps within a highly purine rich tract of the yeast cyc1 promoter, in support of the unwinding role for that region (Yagil, Tal and Shimron 1993). These experiments encourage the further exploration of an unwinding role for all excessive two- base tracts.
SEQUENCES BASES Length Pyr Pur Found Expected Difference Ratio l Y R f = (Y+R)l e f - e e/f 35978 36129 72107 78839 -6732 0.91 18384 18369 73506 78839 -5333 0.93 8575 8330 50715 59129 -8414 0.86 4682 4762 37776 39419 -1643 0.96 2787 2759 27730 24637 3093 1.13 1446 1504 17700 14782 2918 1.20 802 839 11487 8623 2863 1.33 516 501 8136 4927 3208 1.65 278 257 4815 2771 2043 1.74 0 169 169 3380 1539 1840 2.20 1 90 97 2057 846 1210 2.43 2 66 65 1572 461 1110 3.40 3 38 37 975 250 724 3.90 4 23 30 742 134 607 5.51 5 20 14 510 72.2 437.8 7.07 6 15 14 464 38.5 425.5 12.1 7 11 13 408 20.4 387.5 20 8 8 4 216 10.8 205.2 20 9 5 3 152 5.7 146.3 27 0 2 7 180 3.0 177.0 60 1 2 2 84 1.6 82.4 53 2 5 1 132 0.83 131.2 159 3 1 2 69 0.43 68.6 159 4 2 0 48 0.23 47.8 213 5 3 1 100 0.12 99.9 851 6 3 0 78 0.06 78 1276 9 0 2 58 0.01 58 6808 0 2 0 60 ²0.01 60 13615 1 0 1 31 ²0.01 31 13615 3 1 0 33 ²0.01 33 54458 6 0 1 36 ²0.01 36 435632 SUM: 147827 tracts 315357 bases %A,G = 0.500. The numbers listed are the output of of FORTRAN program TRACTS. The input was EMBL entry NUC:X59720, S.Cerevisiae Chromosome III, complete DNA Sequence (Oliver et al., 1992). TRACTS is described as an earlier version PUR in Bucher and Yagil, 1992. VMS and VM/CMS versions available from author. a. For calculation of e, the number of tracts expected in randomized DNA, see text. Table 2. The frequency of K.M tracts in yeast chromosome III Sequences Bases
Length G,T A,C Found Expected Difference Ratio
l K M f=(K+ M)l e f-e
f/e
1 34902 35704 70606 78815 -8209
0.90
2 18042 18425 2934 78792 -5858
0.93
3 9460 9328 56364 59094 -2730
0.95
4 5222 4930 40608 39407 1200
1.03
5 2907 2654 27805 24644 3160
1.13
6 1547 1340 17322 14800 2521
1.17
7 821 723 10808 8643 2164
1.25
8 463 362 6600 4946 1653 1.33
9 250 221 4239 2787 1451 1.52
10 147 105 2520 1551 68 1.62
11 83 76 1749 855 893 2.04
12 56 46 1224 467 756 2.62
13 24 20 572 254 317 2.25
14 19 11 420 13 282 3.06
15 10 13 345 73 71 4.67
16 3 4 112 39 72 2.83
17 9 6 255 21 234 12.09
18 3 3 108 11.2 96.8 9.63
19 3 1 76 5.9 70.1 12.78
20 2 5 40 3.1 137.9 44.53
21 1 0 21 1.6 19.4 12.66
23 1 0 23 0.4 22.4 50
24 1 1 48 0.2 48.8 199
26 0 1 26 0.07 25.93 393
32 1 0 32 ²0.01 32 24061
37 1 0 37 ²0.01 37 737307
SUM: 147957 tracts 314994 bases
%G,T = 0.491
TABLE 3. The frequency of S;W tracts in yeast chromosome III
Sequences Bases
Length A,T G,C Found Expected Difference Ratio
l W S f=(S+W)l e f - e
f/e
1 28422 45905 74327 74700 373 1.00
2 19630 19001 77262 70779 6483 1.09
3 10507 6496 51009 53083 -2074 0.96
4 6585 2477 36248 37247 -999 0.97
5 4068 943 25055 25602 -547 0.98
6 2241 320 15366 17488 -2122 0.88
7 1439 119 10906 11912 -1006 0.92
8 889 61 7600 8091 -491 0.94
9 588 15 5427 5474 -47 0.99
10 348 11 3590 3687 -97 0.97
11 218 3 2431 2472 -41 0.98
12 124 1 1500 1647 -147 0.91
13 95 0 1235 1093 142 1.13
14 51 0 714 721. -7 0.99
15 43 0 645 475 170 1.36
16 21 0 336 311 25 1.08
17 16 0 272 203 69 1.34
18 9 0 162 130 30 1.23
19 18 0 342 86 256 4.00
20 8 0 160 56 104 2.89
21 12 0 252 36 216 7.05
22 4 0 88 23. 65 3.83
23 4 0 92 15 77 6.23
24 3 0 72 9.5 62.5 7.60
26 1 0 26 3.9 22.1 6.71
29 5 0 145 1.0 144.0 144.5
31 1 0 31 0.4 30.6 76.5
32 2 0 64 0.26 63.74 249
SUM: 150704 tracts 315357 bases
%G,C = 0.385
.
Table 4. Two-base tract frequencies in coding and non coding regions.
Tracts of 15 bases and longer are counted.
Region Total bases ---- Bases in tracts 15nt and longer ---
R.Y
Found Expected Ratio
Entire chromosome 315,357 2639 154 x 17
Coding 210,695 971 103 x 9.7
Non coding: 104,662 1668 51 x 32
Non coding, ±200 48,942a 1100 24 x 46
K.M
Found Expected Ratio
Entire chromosome 315,357 1213 158 x 7.0
Coding 210,695 442 105 x 4.2
Non coding: 104,662 781 53 x 14.7
Non coding, ±200 48,942a 468 25 x 18.7
Telomere 363
S;W
Found Expected Ratio
Entire chromosome 315,357 2697 1987 x 1.97
Coding 210,695 350 1325 x 0.26
Non coding: 104,662 2287 662 x 3.45
Non coding, ±200 48,942a 1380 309 x 4.46
Other (Intron,joint) 60
a. Within ±200 bases from an translation initiation or termination site.
Table 5. R.Y tracts longer than 19 nt