Using Bioinformatics Resources in Molecular Biology
Research
Exercise: Analyzing Expression Array Data
(Gaddy
Getz)
- Cluster yeast cell cycle data using Average Linkage.
- Compare clustering results of Average Linkage, K-means and SPC
Software
The programs Cluster
and TreeView that are used in this
exercise were written by Michael Q. Eisen (eisen@genome.stanford.edu)
and can be downloaded upon request from http://rana.stanford.edu/software/
The software package for SPC
(Super-Paramagnetic Clustering) can be obtained by mailing to Prof.
Eytan Domany (fedomany@wicc.weizmann.ac.il).
A. Cluster yeast cell cycle data using Average Linkage
Goal: Find groups of genes with similar
temporal behavior.
- Run Cluster
- press Start (bottom left corner), select Programs, select Clustering,
select Cluster.
- NOTE: you can read the manual by pressing "Read Manual".
- Load the yeast cell cycle data of an alpha factor arrest and release
experiment. The data is a result
of a cDNA chip experiment of the yeast
cell cycle (taken from Eisen et al. (1998) PNAS 95-14863).
In
the file for each gene there is the log2(expression at time i /
expression in mixture of asynchronous cells).
A missing value means that there was no strong enough signal in the spot.
In the experiment measurements were taken over a
period of 119 minutes which is nearly two cell cycles.
- Press "Load File"
- Load the file from C:\clustering\exercise\alpha.txt. (A copy of
the file can also be found here).
- Question: How many genes and how many time
points were measured?
- Preprocess the data (center and normalize genes)
- There is no need to filter the data.
- Choose the "Adjust Data" tab: check the normalize genes and mean center
genes.
- Press Apply
- Cluster the genes using Average Linkage
- Choose the "Hierarchical Clustering" tab
- Since we want to cluster only the genes and not the arrays (time
points), uncheck the Arrays cluster.
- Choose "Correlation (centered)" as similarity measure.
- Question: What will be the difference if
you use "Absolute Correlation (centered)" as similarity measure?
- Press Average Linkage Clustering. This creates two files alpha.gtr
(dendrogram of genes) and alpha.cdt (reordered data matrix).
- View the clustering results in TreeView
- run TreeView by pressing Start (bottom left corner), select Programs,
select Clustering, select TreeView.
- Load the file C:\clsutering\exercise\alpha.cdt. Use the menu
option File and then Load.
- Scroll along the tree. Note: green color means negative normalized log
ratio, red color means positive.
- Search for genes which are known to be cell cycle regulated: HHT1 (S
phase), CLN1 (Late G1), CLB1(G2/M)
- Choose Find in the menu, then Gene, and type HHT1. Press Find.
- Select the cluster of genes that contains HHT1 by clicking on a vertex
on the dendrogram or
by using up/left/right buttons to go up the
tree/down to left descendant/down to right descendant.
- Question: How many genes did you
choose?
- Question: Are the genes in the cluster
from the same biological process?
- Save the data of this cluster by selecting File/Save Data. Change the
name to hht1.txt.
- Open Excel (Start/Program/Microsoft Excel).
- Load hht1.txt. (File/Open C:\clustering\exercise\hht1.txt). Press Finish
in the Text Import Wizard.
- Create a graph of the normalized gene expression over time.
- Select the region C2:T9 by dragging the mouse from C2 to T9 while
pressing the left mouse button.
- Select the Insert/Chart. Line. Choose the first sub-type. Press Next.
Make sure the rows button is chosen.
Then, press Finish.
- Question: At what times does the
expression of these genes peak?
- Save the Excel work book by selecting File/Save and changing the Save
as Type to Microsoft Excel Workbook.
The filename should be hht1.xls
(result C:\clustering\exercise\hht1.xls).
- Repeat this analysis with genes CLN1 and CLB1 (C:\clustering\exercise\cln1.xls,
C:\clustering\exercise\clb1.xls).
- Question: Order the phases along the time
using the peak times you recorded.
- You can look at additional interesting clusters. Look for gene UBP6,
SPA2.
B. Compare
clustering results of Average Linkage, K-means and SPC
Goal: See the differences between the
methods.
K-Means
- Cluster alpha.txt as before using K-means with K=150.
- Return to Cluster. If you closed it, rerun it and normalize the
data as in part A.
- Choose the K-Means clustering.
- Check only "Organize Genes".
- Choose K to be 150. 150 will give an average cluster size of ~16
(2467/150) genes.
- Leave max cycles 100.
- Press Execute. It could take a while. Follow the iterations at the
bottom of the window.
- Load the resulting file in Excel
(C:\clustering\exercise\alpha_K_G150.txt).
- Find the cluster that contains CLN1. The clusters in this file are
separated by empty lines that have NONE in column A.
- Plot the normalized gene expression of the genes in the cluster.
- Compare your results to part A.
Super-Paramagnetic Clustering (SPC)
- Load the SPC results in TreeView (C:\clustering\exercise\alpha_SPC.cdt; or
get files from alpha_spc.cdt,
alpha_spc.gtr).
- Find the cluster that contains CLN1.
- Question: Is it easier to decide which genes
belong to the cluster?
- Save the data for this cluster in C:\clustering\exercise\spc_cln1.txt.
- Load the file in Excel.
- Plot the normalized gene expression of the genes in the cluster.
- Question: Compare your results with to
K-means and Average Linkage.
- Question: Draw a Venn diagram that counts
the number of common genes in all methods.
Back to the
Workshop Home Page