Proteomic Signatures: Amino Acid and Oligopeptide Compositions Differentiate Among Phyla

Supplementary material for

Proteomic Signatures: Amino Acid and Oligopeptide Compositions Differentiate Among Phyla

Itsik Pe'er, Clifford E. Felder, Orna Man, Israel Silman, Joel L. Sussman, and Jacques S. Beckmann

Proteins: Structure, Function and Genetics 54 , 20-40 (2004)

All the data text files are best viewed when opened with MS Excel. Page numbers refer to the journal page numbers.

Repeating figures of clustering by other training sets, referred to on page 23 (tif images):
- Separation between eukaryotes & eubacteria (repeating Figure 2a), by training sets 1 2
- Separation between eukaryotes & archaea (repeating Figure 2c), by training sets 1 2
- Separation between eukaryotes, eubacteria, and archaea with training on eykaryotes and archaea only (repeating Figure 2c), by training sets 1 2

Z-scores of XY vs YX heterodipeptide bias, referred to on page 24. A text file with a symmetric table of 20x20 space delimited real numbers. The number on row X, column Y denotes the Z-score (number of standard deviations) by which XY occurs more than YX.

Oligopeptide frequency deviations from expectation, referred to on Page 24. 2-column tab delimited text files, the first column listing the oligopeptide, the second listing its Z-score (number of standard deviations)

Homotripeptides in eukaryotes, referred to on page 24. Tab delimited lists of either expected, observed, and their normalized difference (z-score) for homotripeptide contents of each eukaryotic genome examined. The homotripeptides are listed in alphabetical order.

Rank order of residues and oligopeptides by phyla, referred to on pages 29, 34, and 36. Three 4/5-column, tab delimited text files, with rows corresponding to either single residues, dipeptides, or tripeptides, respectively, which are listed in the first column. The next column in oligopeptide files indicates whether the current oligopeptide is a homopeptide, heteropeptide or, in tripeptides, a palindrome. The following columns in all files detail the rank order of the current residue or oligopeptide in eukaryota, eubacteria or archaea, respectively. Each of the three files appears in 6 extract versions: top/bottom extracts according to each of the three superkingdoms.

Standardized Euclidean distances between species' compositions, referred to on page 34. A text file with a symmetric table of 72x72 space delimited real numbers, with preceding row and column that detail speciesí 3-letter nickname (see Table I). The number on row X, column Y denotes the standardized Euclidean distance between normalized composition vectors for species X and Y. Each such vector is computed by dividing the frequency of each of the 20 residues in the current speciesí proteome by the variance of the frequency of this residue across all the proteomes examined.

Similarity trees for dipeptides and tripeptides, referred to on page 34 (tif images).

Species coordinates in all figures, 3-column tab delimited text files, one per principle component analysis plot, the first column naming the species and the other two providing first and second component coordinates for this species in the current plot. Figures 2a 2b 2c 2d 2dí 2e 3a 3b 4 7a 7c 9a 9c

Exact oligopeptide score referred to on the Appendix. PDF document that describes the exact details of the score used, a score which properly distinguishes left and right ends of the sequence, and takes special care of them.