The frequency of Oligopurine.Oligopyrimidine and other two-base tracts in yeast chromosome III.

By: Gad Yagil
Dept. of Cell Biology, The Weizmann Institute of Science,
Rehovot, Israel 76100
lcyagil@wiccmail.weizmann.ac.il
Fax: 00-972-8-344125
Telephone: 00-972-8-342775

Reference:

Yeast, Vol 10, pp 603-611 (1994)
2. Greek letters in the text are spelled out: p=pai; f=fi; e=epsilon; SIGMA = S TETA = Q; teta = q
3. Numbers in Chemical compounds (e.g. H2O) will be subs in the final. .

Abstract

Program TRACTS is employed to map the occurrence of base tracts composed of only two-bases in S. cerevisiae chromosome III The observed frequencies are compared with those expected in random DNA. A vast excess of long base tracts of all three possible two-base combinations , namely purine.pyrimidine (R.Y), keto.imino (K.M) and Weak;Strong (mainly A,T rich) is documented. This excess places yeast in the same category as other eukaryotic and organelle genomes analyzed. The excess of the two- base tracts is considerable larger in the noncoding 1/3 of the chromosome, in particular proximal to coding initiation and termination sites. A functional role for the excessive tracts, possibly as unwinding centers of particular genes, is proposed. Multiple occurence of long two-base tracts within an ORF is offered as another diagnostic to determine whether the ORF, or a subregion of it, are actually translated.

Introduction

One of the more prominent deviations of DNA composition from randomness observed so far is the extremely high frequency of base tracts composed of only two bases. This high frequency has been observed very early for oligopyrimidine.oligopurine (R.Y) tracts by Chargaff and collaborators (Tamm et al.,1952; Shapiro et al, 1965) and further elaborated by several authors since (Case and Baker, 1975; Birnboim et al. 1976;1979; Behe, 1987; Bucher and Yagil, 1991). More recently, we have been able to document a similar excess of the other two "letter" sequences (Karlin and Ghandour, 1985), namely of K.M tracts (oligo(G,T).oligo(A,C) tracts) as well as a modest excess of S;W tracts, mainly oligo(A,T), (Yagil, 1993). The excess of all three possible two-base tracts is most pronounced in vertebrates, but is found in many genes of plants, invertebrates, vertebrates, their cellular organelles (generally thought to be of prokaryotic origin) and their viruses as well. In prokaryotes, an only barely significant excess of R.Y tracts is observed (Chargaff 1963; Bucher and Yagil, 1991).

The situation in yeast is less clear. While the analysis of the 2m plasmid and several coding genes revealed a situation similar to prokaryotes, i.e. no particular excess of R.Y tracts, other gene regions, like the CYC1 promoter (McNeill and Smith, 1986) or the centromere region (Ng and Carbon, 1987), contain several long R.Y or A,T tracts. With the availability of the first complete sequence of an entire yeast chromosome (Oliver et al., 1992), the frequency of the three types of two-base tracts (R,Y; K.M and S;W) in coding and noncoding regions of the yeast genome can be systematically evaluated. This is done here, by applying previously written program TRACTS to the published sequence of the chromosome. The results demonstrate that two-base tracts are as extensively overrepresented in the yeast chromosome as they are in all other eukaryotic genomes. Overrepresentation is found to be particularly high in intercoding regions, especially in the vicinity of translation start or termination sites. One letter tracts, mainly oligo A, oligo T , and strictly alternating A-T tracts have recently been mapped by Karlin et al. (1993). Results:

Results

In Table 1 the frequency of R.Y tracts of every length found in yeast chromosome III is recorded. Table 1. is an abbreviated output of program TRACTS previously described (Yagil 1993). Oligopyrimidine (Y) tracts can be seen to be generally as frequent as oligopurine (R) tracts (columns 2 and 3). The data are for the strand listed in the EMBL database. Every tract length up to 26 bases is represented in the chromosome; additional tracts, up to 36 purines in a row, are present. In column 5 the number of bases expected in randomized DNA of the same length and base composition as chromosome III, 315357 nt, 50% A,G) are shown. The expected frequencies e of bases in tracts of length l are calculated by the simple expression: e = l(pl.q2 + p2 ql)L, (Eqs. (1),(4) of Bucher and Yagil 1991) where p is the fraction of one the two bases in the analyzed sequence, q of the other base (p+q=1), and L is the length of the DNA analyzed. This formula has been repeatedly verified by running many randomized DNA sequences by program TRACTS, see previous publications.

The difference between the expected and found bases (column 6) makes it clear that very short tracts, i.e. alone standing purines or pyrimidines, their doublets, triplets and quadruplets (l=1-4), are highly underrepresented in the chromosome. On the other hand, tracts of 5 bases and longer are overrepresented at a rapidly increasing extent. This is most evident when the ratio r, of bases found to bases expected, listed in the last column, is examined. For example, 12 R.Y tracts of 18 nt (216 bases) are observed while only 0.6 tracts (10.8 bases) of that length are expected in a random DNA of 315,357 bases. This means that 18-base tracts are overrepresented by the factor of 20 listed in the last column. The overrepresentation increases continuously with tract length and reaches very high value for tracts of 25-36 nt.

A similar situation is observed for K.M tracts (Table 2): Every tract length up to 21 bases is represented and tracts up to 37 nt are observed. In addition, a single huge A,C tract of 363 nt is present, representing the one telomere region sequenced; yeast telomeres are well known to terminate with a long repetitive (CA)n.CCC .(GT)nGGG tract, which may assume also a special structure (Zakian,1989) . Again, alone standing K (G or T) and M (C or A), their doublets and triplets (l =1-3) are significantly underrepresented and compensate for the excess of long tracts. The huge excess of the longer K.M tracts is evident when compared with the number expected in a randomized DNA (last column). looking again at 18nt tracts, 3 K and 3 M tracts are observed,108 nt together,while only 11.2 nt are expected, i.e. a 9.6 fold excess.

Overrepresentation of two-base tracts is also apparent for S;W family, but to a less significant extent (Table 3). Overrepresentation begins only at the 15 nt level, but then continues up to 24 nt; tracts up to 32 nt are present. The high ratios are mainly for W (A,T) tracts because of the high % of A,T in the yeast chromosome (0.615%), no S tracts as long as 16nt are expected or present.

The availability of a complete chromosome sequence makes it possible, for the first time, to compare compositional parameters of coding with noncoding gene regions. In the published sequence, 104,662 nt out of the total 315,357 are outside known or surmised open reading frames (ORFs) , i.e. close to 1/3 of the whole chromosome is non coding. It is not known exactly what parts of the noncoding regions are actually transcribed (Yoshikawa and Isono,1990), so that the noncoding regions are at least partly transcribed. In Table 4 the density of all two-base tracts longer than 15 nt in coding and noncoding regions is compared. It is evident, that the overrepresented long tracts of all three classes are not equally distributed, but tend to concentrate in the 1/3 non coding regions. Thus R.Y tracts of 15 bases and longer (145 of these tracts are found, while only 9.6 tracts are expected by equation (1) of Bucher and Yagil) are present at an overall 17 fold excess , but this excess increases to 32 fold in the non coding regions. There is still a large overrepresentation in the coding region, (x9.7), and we shall discuss its significance in the discussion.

The excess of two-base tracts is even more pronounced when distribution within the noncoding regions is examined: thus, 48,997 noncoding bases (47%) are within ±200 nt from the boundary of a coding region, i.e. from a translation start or termination site. These proximal regions contain 1100 bases within long (³15 nt) R.Y tracts (Table 4). In other words, R.Y tracts tend to concentrate in promoter/termination regions. Intercoding K.M and W tracts are also concentrated within ±200 bases from coding regions. These regions are often composed of predominantly two-base tracts. Some particular examples are the divergent initiation (promoter?) region between ORFs R59c and R60w,168 nt ctg. 30nt Y tract, a 34 nt A,T tract, a 13 nt G,T tract, a 13nt R tract, and several shorter ones. Also the region between the convergent termination region of ORFsÊ L17w (LEU2) and L18c; On the other hand, long intergenic regions like the centromere region ( 3284 bases), or the region between ORFs R95c and R96c (287,992 - 291,912; 3921 bases ) contain only few R.Y and K.M tracts. The concentration of the long two-base tracts near gene initiation sites is highly suggestive of a major role for these tracts in gene control.

In Table 5 all R.Y tracts 20 nt or longer are listed. oligo R and oligo Y tracts are nearly equally represented. To what extent can these tracts be considered repetitions of simpler motifs? The 38 tracts listed in the table have been classified according to whether a repetitive simple motif can be discerned in the tract. A tract is scored as "rand" if no repetitive is evident. A tract is scored as 1/2 "rand" when the repetitive motif listed covers 33-66% of the tract. It can be seen that the majority of the tracts (22/38) show no obvious repeated motif. It can not be stated therefore that R.Y overrepresentation is due to repetition of simple, or cryptically simple (Tautz et al.1986; Tautz 1989), sequence motifs, as found as "microsattelites" in many genes.(Beckman and Weber, 1992). Overrepresentation is not eliminated when strictly repetitive tracts are excluded - even if all 16 "simple" or semisimple sequences were absent, R.Y tracts would still be in an 22/38x17 =9.8 fold excess over random DNA. Overrepresentation seems thus to be a property of any mixture of A's and G's, and not of a particularly ordered subclass. Detailed examination of the longer tracts shows that a similar situation exists for W tracts: Of the 40 tracts ³ 20nt, 20.5 can be classified as simple, 19.5 as random. As for K.M tracts, only 2 out of 14 tracts ³20 nt are fully random, but the regular, simple subsequences contain a high proportion of long An or Tn stretches (data not shown). Among the simple tracts 10 have (AT)n subsequences with n³5, way above the expected frequency. Karlin et al. have recently shown (1993), using an r-scan statistical procedure, that these An, Tn and (AT)n subsequences are evenly distributed along the chromosome. To sum up this section, if there is an essential role for the excessive tracts - it is not due to a particular base sequence, nor is it due to a particular two-base combination.

Disscusion

The data brought demonstrate that the 3 two-base tract families (R.Y, K.M and S;W) are as highly overrepresented in the first sequenced yeast chromosome as in most other eukaryotic genomes (Yagil 1993). This is of particular interest for 2 reasons:

1) In a previous publication (Bucher and Yagil, 1991) it was suggested the yeast may resemble prokaryotes, in that R,Y tracts are only modestly overrepresented, if at all (cf. Chargaff, 1963). The larger data base analyzed here shows that the excess of two-base tracts can be as large in yeast as in plants and other higher eukaryotes.

2) The sequencing of the entire chromosome makes it possible for the first time to make statements on tract frequencies in intercoding regions. It may therefore be of general significance that two-base tract overrepresentation is much higher in noncoding regions then in strictly coding regions (Table 4).

The somewhat reduced occurrence of long two-base tracts in coding regions (still way above random DNA , Table 5) may partly be due to the restrictions imposed on protein composition by two letter codons. A long R tract can, for instance, code for only Gly (GGG,GGA); Lys (AAA,AAG); Arg (AGG,AGA) and Glu (GAG,GAA). Thus the 20nt R tract found within ORF R13c, G.GGGAAAGAG.AAAAGAAAAAA.A codes for ÊGly.Lys.Glu.Lys.Arg.Lys.Lys, which may well be a nuclear localization signal. In other cases, the coding for so many charged amino acids may a burden, e.g. the 36 nt R.Y tract at 233616 is within ORF R67c and codes for ala.(ser)10.phe, an unlikely composition. The overrepresented R.Y tracts in coding regions, points therefore at some general role for these tracts.

An example of a long K.M tract within a coding region is the 39 nt M tract, with one 2nt interruption, within ORF R93w, identified as coding for regulatory protein CDC39 (Connart and Struhl, 1993). This M tract codes for glutamine (CAA). The 11 glutamines coded for are part of one of two glutamine rich regions of the CDC39 protein. the region contains 21 glutamines, 18 of them coded by CAA (incl. the above 11) and only 3 by CAG. The average codon usage ratio for glutamine in yeast is 10/29. The unusually high proportion of CAA raises the question, whether the high concentration of K is due to a requirement for gln, or whether the presence of many gln may not be due to a requirement for an K.M tract (a second gln rich coding region has nevertheless only 11/18 CAA).

Two-base tracts may serve as an diagnostic for untranslated regions. For instance, ORF L7w contains within its 393 nt, one 38 (-1) Y tract, one 17 Y tract, followed by an 22 nt K tract, another 12nt K tract, A 10nt W tract, and several shorter ones. These would produce a rather restricted protein sequence, so that L7w Is not likely to traslated. L10c, and L58w and R16w are further examples.

The high concentration of the two-base tracts in the noncoding regions, particularly within 200 bases from translation initiation or termination sites (Table 4), is strongly indicative for a special function for these tracts. The extent to which each proximal regions are actually transcribed is so far only partly established (Yoshikawa and Isono, 1990) so that a role on the RNA level, possibly by triplex formation (Maher et al., 1992) can not be excluded. It seems however, that a reasonable role to consider is that two-base tracts can serve as DNA unwinding elements in preparation for DNA replication, transcription and other template directed processes ( Kowalski and Eddy , 1990 ; Palecek, 1991; Yagil 1991). Umek, Kowalski and coworkers have shown that several important yeast regulatory gene regions which are rich in A,T are susceptible to single strand DNA cleaving enzymes, like mung bean nuclease and P1 .(Umek and Kowalski, 1988; Natale et al., 1993). In order to be cleaved, these sequences must be partly unwound, and have been termed DNA unwinding elements (DUE). In recent experiments (Yagil, Tal and Shimron , 1993) we were able to show that the A,T rich elements of centromere of chromosome IV (CEN4) is highly susceptible to single strand specific nuclease P1 and to the conformation specific reagent KMnO4, in support of an unwinding function for the W tracts involved. As to R.Y tracts, evidence from numerous laboratories has shown that DNA in many eukaryotic genes, in intact nuclei as well as in supercoiled plasmids, are also susceptible to cleavage by single strand specific nucleases, like S1 and P1 nuclease (Larsen and Weintraub, 1982; reviewed in ; Wells et al., 1988; Yagil 1991). This suggests that R,Y tracts, and possibly also K.M tracts, can also as unwinding elements. This is a bit counter intuitive, because numerous studies shows A,T (W) rich sequences are the most readily melting ones and therefore the most likely to be in a strand separated state. In chromosome III, R.Y and K.M tracts are nevertheless as excessive as W tracts. It is there fore possible that in vivo conditions, incl. negative supercoiling,may confer early strand separation on these sequences as well. Preliminary experiments reveal indeed a P1 nuclease sensitive tract which maps within a highly purine rich tract of the yeast cyc1 promoter, in support of the unwinding role for that region (Yagil, Tal and Shimron 1993). These experiments encourage the further exploration of an unwinding role for all excessive two- base tracts.

References

Table 1. Frequency of R.Y tracts in yeast chromosome III
				
			SEQUENCES				BASES
	
Length  Pyr    	Pur	    Found	     Expected       Difference   Ratio
l                Y              R                 f = (Y+R)l                 e               f - e              e/f     
	
	35978	36129	72107	78839	-6732      0.91
	18384	18369	73506	78839	-5333      0.93
	8575	 	8330		50715	59129	-8414
0.86
	4682		4762		37776	39419	-1643
0.96
	2787		2759		27730	24637	3093	
1.13
	1446		1504		17700	14782	2918	
1.20
	802		839		11487	8623		2863	
1.33
	516		501		8136		4927		3208	
1.65
	278		257		4815		2771		2043	
1.74
0	169		169		3380		1539		1840	
2.20
1	90		97		2057		846		1210	
2.43
2	66		65		1572		461		1110	
3.40
3	38		37		975		250		724	
3.90
4	23		30		742		134		607	
5.51
5	20		14		510		72.2		437.8
7.07
6	15		14		464		38.5		425.5
12.1
7	11		13		408		20.4		387.5
20
8	8		4		216		10.8		205.2
20
9	5		3		152		5.7		146.3
27
0	2		7		180		3.0		177.0
60
1	2		2		84		1.6		82.4	
53
2	5		1		132		0.83		131.2
159
3	1		2		69		0.43		68.6	
159
4	2		0		48		0.23		47.8	     213
5	3		1		100		0.12		99.9	     851
6	3		0		78		0.06		78	   1276
9	0		2		58		0.01		58	   6808
0	2		0		60		²0.01	60	 13615
1	0		1		31		²0.01	31	 13615
3	1		0		33		²0.01	33	 54458
6	0		1		36		²0.01	36      435632
SUM:              147827 tracts	         315357 bases
%A,G    =    0.500
. The numbers listed are the output of of FORTRAN program TRACTS. The input was EMBL entry NUC:X59720, S.Cerevisiae Chromosome III, complete DNA Sequence (Oliver et al., 1992). TRACTS is described as an earlier version PUR in Bucher and Yagil, 1992. VMS and VM/CMS versions available from author. a. For calculation of e, the number of tracts expected in randomized DNA, see text. Table 2. The frequency of K.M tracts in yeast chromosome III Sequences Bases
				
	Length	G,T	A,C	Found	Expected	Difference	Ratio
      l             K              M                   f=(K+ M)l             e                   f-e                         
f/e
			
	1	34902	35704	70606	78815	-8209
	0.90
	2	18042	18425	2934	 	78792	-5858
	0.93
	3	9460		9328		56364	59094	-2730
	0.95
	4	5222		4930		40608	39407	1200	
	1.03
	5	2907		2654		27805	24644	3160	
	1.13
	6	1547		1340		17322	14800	2521	
	1.17
	7	821		723		10808	8643		2164	
	1.25
8 463 362 6600 4946 1653 1.33 9 250 221 4239 2787 1451 1.52 10 147 105 2520 1551 68 1.62 11 83 76 1749 855 893 2.04 12 56 46 1224 467 756 2.62 13 24 20 572 254 317 2.25 14 19 11 420 13 282 3.06 15 10 13 345 73 71 4.67 16 3 4 112 39 72 2.83 17 9 6 255 21 234 12.09 18 3 3 108 11.2 96.8 9.63 19 3 1 76 5.9 70.1 12.78 20 2 5 40 3.1 137.9 44.53 21 1 0 21 1.6 19.4 12.66 23 1 0 23 0.4 22.4 50 24 1 1 48 0.2 48.8 199 26 0 1 26 0.07 25.93 393 32 1 0 32 ²0.01 32 24061 37 1 0 37 ²0.01 37 737307 SUM: 147957 tracts 314994 bases %G,T = 0.491 TABLE 3. The frequency of S;W tracts in yeast chromosome III Sequences Bases Length A,T G,C Found Expected Difference Ratio l W S f=(S+W)l e f - e f/e 1 28422 45905 74327 74700 373 1.00 2 19630 19001 77262 70779 6483 1.09 3 10507 6496 51009 53083 -2074 0.96 4 6585 2477 36248 37247 -999 0.97 5 4068 943 25055 25602 -547 0.98 6 2241 320 15366 17488 -2122 0.88 7 1439 119 10906 11912 -1006 0.92 8 889 61 7600 8091 -491 0.94 9 588 15 5427 5474 -47 0.99 10 348 11 3590 3687 -97 0.97 11 218 3 2431 2472 -41 0.98 12 124 1 1500 1647 -147 0.91 13 95 0 1235 1093 142 1.13 14 51 0 714 721. -7 0.99 15 43 0 645 475 170 1.36 16 21 0 336 311 25 1.08 17 16 0 272 203 69 1.34 18 9 0 162 130 30 1.23 19 18 0 342 86 256 4.00 20 8 0 160 56 104 2.89 21 12 0 252 36 216 7.05 22 4 0 88 23. 65 3.83 23 4 0 92 15 77 6.23 24 3 0 72 9.5 62.5 7.60 26 1 0 26 3.9 22.1 6.71 29 5 0 145 1.0 144.0 144.5 31 1 0 31 0.4 30.6 76.5 32 2 0 64 0.26 63.74 249 SUM: 150704 tracts 315357 bases %G,C = 0.385 . Table 4. Two-base tract frequencies in coding and non coding regions. Tracts of 15 bases and longer are counted.
Region           Total bases       ----  Bases in tracts 15nt and longer  ---

                                                 R.Y            
					Found	Expected	Ratio

Entire chromosome 	315,357		2639	154		x 17
Coding                  210,695		971	103		x  9.7
Non coding:             104,662	 	1668	51		x 32
Non coding, ±200         48,942a	1100	24		x 46

                                                 K.M           
	      	                 	Found	Expected        Ratio

Entire chromosome 	315,357		1213	158		x 7.0
Coding                  210,695		442	105 		x 4.2
Non coding:             104,662		781	 53 		x 14.7
Non coding, ±200      	48,942a		468	 25		x 18.7
Telomere                   363

                                                 S;W            
	                                Found	 Expected 	Ratio

Entire chromosome 	315,357		2697	1987		x 1.97
Coding                  210,695	 	350	1325		x 0.26
Non coding:             104,662		2287	662		x 3.45
Non coding, ±200        48,942a		1380	309		x 4.46
Other (Intron,joint)        60		
a. Within ±200 bases from an translation initiation or termination site. Table 5. R.Y tracts longer than 19 nt
length Position Sequence Subsequence
33 6072 TTCTCCTTTTTTTTCTTTCTTTCTTTCTTTTTC rand + (CT3)4
26 8340 TTCCTTTTTTTTTTTTTTCTCTTTCC rand + T14
20 8566 AAAAAAAAAAAAAAAAAAAA A20
23 20496 TTTTTCCCTTCTCTTCTCTTTTT rand
20 27348 CTTTCTTTTCTTCTTCCTTT rand
31 27643 AAAAAAAAAAAAAAAAAAAAAAAGAAGGAAG (GAAG)2A23
21 32950 AGAAAAGAAAGAAGAGGAAGG rand
22 39823 TCTCTTCCTCTTCCTCTTCCTC (TC)2(TTCCTC)3
21 41364 TCTTTTCTTCCTCTTTTCTTT rand
23 41981 AAAAAAAAAAAAAAAAAAAAAAA A23
29 44513 AGGGAAAAAAAAAAAAAAAAAAAAGAAAG rand + A20
21 50409 GAGAGAAAAGGGAAAAAGAGG rand
23 58933 AAGAAGGGAAGAAGGAAAGGAGG rand
22 72958 AAAGAAAAGGAAAAAAAGAAGA rand
20 92165 AAGAAGGAGAAAAAGGAGGA rand
26 101481 TCCCTTTTTTTTTTTTTTTTTTCTTC TTCTC3.T18C
25 106666 CTTCCCTTTTCTTCCTTCTTCTTCT rand
26 119992 CCCCTCTTTTCCTTTTTCCTCTTCTT rand
24 127398 TTTTTCTTTTCTTTTTTTTTTTTT (T4C)2.T13
22 136533 TTTTTTTTTCTTTCCTCTTTTT rand + T9
22 138628 TTTTTTTCTTTTCTCTTTCCCC rand
22 139101 TTTCTTCTTCTTCTTCTTCTTC T(TTC)7
25 142259 GGGAAAAAGAAAAAAAAAAAAAAAA rand + A16
20 143494 AAAGAAAGAGAAAAAAAGAA rand
30 203143 CCTTTCCTCTTCCCCTTCCTCTTCCTCTTC rand +(CTCTTC)4
30 222963 CTTTTCTTTCCTCTCTCTTTTTTTTTTCTT rand
24 226194 CTTTTCTTTTTTCTTTTTTTTTTC rand + T10
25 226522 TTCTTCCTTCCTTTTTTCTTTTTTT rand
22 229728 CTTTTTTTTTTTTTTTTTTTTC C.T20C
36 233616 AGAAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGAAG AGA(AAG)11
29 235072 AAAAGGAGAGAAAAAAAAGGAAGAGAGGA rand
25 252854 TTTTTTTTTTCCTCTTCCTCTTTTT rand + T10
20 258867 AGGAAAAGGAGGGGAAGGGA rand
20 262667 AAGAAAAGAGAGAAAAAAGA rand
20 264897 AGAAAAAGAGGAGGAAGAAA rand
20 271760 GAAAAAAAAAAAAAAAAAAA G.A19
20 272966 CTTTCCTTTTTCCCTCTTTC rand
21 293660 TTTCTCTCTTCTTTTCCTTT rand
Total: 38 tracts, 21 random, 9 simple and 8 mixed,


Extensive evidence in favor of W rich sequences serving as unwinding elements in Yeast in particular comes from the study of ARS elements and other yeast regions by Umek, Kowalsky and coworkers (1988;1990). The study of these regions led Umek and Kowalsky to prose A,T rich rich as DNA unwinding elements (DUE's ) in yeast.