Pathifier GUI
Information about the algorithm can be found
here: http://www.weizmann.ac.il/complex/compphys/software/yotam/pathifier/
Change R Folder ... this button, located on the top left corner of
the GUI layout, should be used only when the R folder was not configured
correctly at the initial Pathifier launch.
In this case, you should indicate whether the R installed
on your computer is included on your search path, i.e. R can be launched
from any folder on your computer.
If you answer No, you will have to supply its location, as will be explained in
the next step. This setting will be saved for the next
sessions.
Input:
Working path: The folder where the
output files are written.
Data File: The format of this file can be either in txt or mat (Matlab) format.
- txt: tab-delimited text file.
The first row contains the names of the samples.
The second row indicates whether each sample is normal (1) or not (0).
The first column contains the names of the genes.
The corresponding matrix starts from the third row and second column and
includes the expression value for each gene and sample.
- mat: This Matlab file
contains the following 4 variables:
samples - cell vector cotaining
the names of the samples.
normals - a logical vector
indicating whether each sample is normal (1) or not (0).
genes - cell vector containing the names of the
genes.
data - a double number matrix (number of rows as the
number of genes, number of columns as the number of samples) containing the
expression value of each genes in each sample.
Pathway data: You can either choose between:
Pre-compiled sets: the available databases are:KEGG, BioCarta and NCI.
Custom file: a gmt file
specifying the pathways. The first column stands for the pathway names. The
second column must exist but is not used by the Pathifier.
For each pathway (row), the corresponding genes appear on rows 3,
4, 5 , etc. Note: each row must have an identical
number of tabs, so shorter rows should be completed with tabs. In order to do
this you can save your file using Microsoft Excel as a tab-delimited file.
Attempts: Number of runs to determine stability (default is 100).
Maximize stability: If checked (default), throw away components leading to low
stability of sampling noise.
Minimal expression:The minimal expression considered as a real signal.
Any values below are thresholded to be min_exp (default is 4).
Number of CPUs to use (available only on Unix): Enables parallel
running in order to accelerate the analysis (Default is 1, i.e. serial
running).
Output files:
The following output files are written to the working path, as specified in the
GUI:
input_for_R.mat - the input variables for The R code.
PDS.RData - the output PDS variable in R format.
results.mat - output in MATLAB format.
scores.txt - a text file specifying the deregulation scores for each pathway
and sample.
logfile.txt - a text log file.
pathways.txt - a text file listing the pathway names on the first column,
translated to serial names used in the R code on the second column.
Output plot:
Load previous analysis: enables the user to download a previous analysis
from the working path.
Load labels (optional): enables
the user to load various sample labels (e.g. tumor stage, smoking, gender,
etc.) in a tab-delimited text file.
These labels are used in the output plot in order to color samples according to
these labels.
The first row contains the names of the samples.
The first column contains the names of the labels.
Each row (from the second) contains the corresponding label symbol for each
sample.
Plot: this button opens a figure with the principal curve learned for
the chosen pathway. The principal curve is projected onto the three leading
principal components.
Questions & Answers
Q1: How did you set min_std and min_exp ?
A: The minimal standard error (min_std) was set as the first quartile of the standard
deviation in the data. MATLAB code: x=data(:,normals); min_std=quantile(std(x'),.25);
The min_exp
was set based on a method to estimate noise levels that can be found at http://www.ncbi.nlm.nih.gov/pubmed/20663218.
After the noise is estimated we set the min_exp to be the minimal expression at which the noise
levels are acceptable. In this case we defined acceptable as the 90% quartile
of the standard deviation of the data. To make sure the minimal expression is
not ridiculously low, we don't let it fall below known noise levels (for Affymetrix arrays we used a minimal level of 4 in log2
space). In order to estimate the min_exp easily, you
can draw a scatter plot of a pair of normals and find
the noise level.
Pathifier
is a robust algorithm and any reasonable choice of min_exp
and min_std should work out reasonably well.
Q2: I have trouble converting the output
of quantify_pathways_deregulation to figure like
Fig. 4 A in paper. Could you help me with it? What's
the software to do visualization?
A1: Pathifier, and quantify_pathways_deregulation
is only meant to infer pathway scores. Once the scores are computed you can use
almost any software for gene level analysis and visualization (such as
SPIN: http://www.ncbi.nlm.nih.gov/pubmed/15722375),
only this time instead of feeding it with gene expression, use the pathway
scores computed by quantify_pathways_deregulation.
A2: Figure 4A in the paper
was generated by Matlab. You can use, for example,
the Matlab function "clustergram" http://www.mathworks.com/help/bioinfo/ref/clustergram.html .
Q3: when I look at two different pathways the rot
and z matrix for one of the starts with PC3 PC5 PC6, etc ... but for the other one it starts with PC1 PC2
skips PC3 and PC4 and starts back up with PC5.
A: As specified in the paper (under "Finding a
stable gene set") we omit some of the PC that we find noisy, based on
sampling noise.
Q4: The following error was thrown during the
running of Pathifier's "quantify_pathways_deregulation"
function:Error smooth.spline(lambda, xj, ..., df = df, keep.data
= FALSE). What would cause this error? I did not use the default minexp and minstd - would these
influence this error?
A: This error is thrown by the smooth.spline
function that the princurve package is calling. If
you have below 50 samples or below 3 normals, this
may cause smoothing issues throwing off such errors. Also you might want to
omit too small pathways (say, less than 5 genes) or too big pathways (say, with
a number of genes that is bigger than the number of samples you have).
You should set min_exp as the point where the noise drops (in the low
value regime). The estimation is easier when you have replicates. (for example, with affymetrix chips
we usually work with min_exp=4 (in log2 space)).
Q5: What are the meaning of
the "sig" values that are printed as the "quantify_pathways_deregulation"
function is running?
A: It is a measure of score stability for sampling noise. the Pathifier simply repeats the
algorithm several times on random subsets of the samples, and then calculates
the standard deviation of the scores each sample got. "sig" (short
for sigma) is just the average standard deviation. This is not officially an
output of the algorithm and on a regular use it is not supposed to be taken
into consideration. It was mainly used for debugging purposes, but it has been
left it in the log since if someone wants to look deeply into one pathway,
especially one where a few metagenes were removed to
reduce noise, this might be useful.
Q6: Does the Pathifier
perform normalization?
A: The normalization is already done within Pathifier. The normalization is based on the standard
deviation across the normal samples only. If you have a very few normal samples
(say, 3 or less) you might need to increase their number, or define more
samples as 'normals'. However, note that the Pathifier does not perform gene filtering; this should be
done by the user (e.g. by gene variance).
Q7: Are the data on colorectal
cancer used in this study are available?
A: The expression data is published at the GEO
repository, accession number GSE41258.
Q8: In your paper, it is said that a PDS can be
negative, which means that a pathway is deregulated in a different direction
than a PDS with a positive value. However, the PDS output from the Pathifier seem to be normalized between 0 and 1. So how can
one know which pathways have negative PDS, that is, are deregulated in a
different direction?
A: In the Pathifier
recent version, the PDS are normalized between 0 and 1. In this version, the
PDS of the normal samples can be somewhere between 0 and 1. So the closer to 1,
the more deregulated the pathway is. However, if the normal samples have PDS
larger than 0, then values which are close to 0 are also deregulated but in the
opposite direction. For example, if the PDS of the normal samples is about 0
(beginning of the curve), then the higher is the PDS, the more deregulated is
the pathway (in this case PDS = 1 is the maximal deregulation in the end of the
curve). However, in case that the mean PDS of the normal samples is between 0
and 1 (for example the mean of the normal PDS is 0.5 - that is, the normal
samples are located at about the middle of the curve), then PDS either close to
0 or 1 are deregulated but in different directions (0 and 1 stand for the two
opposite ends of the curve).
Q9: What are the main benefits of Pathifier?
A: 1) It provides pathway deregulation score per
sample, so in the common case where samples are tumors, you can know which
pathway is deregulated in each tumor, and just which pathways change overall.
2) It provides a non-linear context specific dysregulation score - that is, it
learns from the data how pathways are dysregulated in the specific dataset, and
scores according to that.