Pathifier GUI
Information about the algorithm can be found here: http://www.weizmann.ac.il/complex/compphys/software/yotam/pathifier/
Change R Folder ... this button, located on the top left corner of
the GUI layout, should be used only when the R folder
was not configured correctly at the initial Pathifier
launch.
In this case, you should indicate whether the R installed
on your computer is included on your search path, i.e. R can be launched
from any folder on your computer.
If you answer No, you will have to supply its location, as will be explained in
the next step. This setting will be saved for the next
sessions.
Input:
Working path: The folder where the
output files are written.
Data File: The format of this file can be either in txt
or mat (Matlab) format.
- txt: tab-delimited text file.
The first row contains the names of the samples.
The second row indicates whether each sample is normal (1) or not (0).
The first column contains the names of the genes.
The corresponding matrix starts from the third row and second column and
includes the expression value for each gene and sample.
- mat: This Matlab file
contains the following 4 variables:
samples - cell vector cotaining
the names of the samples.
normals - a logical vector
indicating whether each sample is normal (1) or not (0).
genes - cell vector containing the names of the
genes.
data - a double number matrix (number of rows as the
number of genes, number of columns as the number of samples) containing the
expression value of each genes in each sample.
Pathway data: You can either choose between:
Pre-compiled sets: the available databases are:KEGG, BioCarta and NCI.
Custom file: a gmt file
specifying the pathways. The first column stands for the pathway names. The
second column must exist but is not used by the Pathifier.
For each pathway (row), the corresponding genes appear on rows 3, 4,5 , etc.
Attempts: Number of runs to determine stability (default is 100).
Maximize stability: If checked (default), throw away components leading to low
stability of sampling noise.
Minimal expression:The minimal expression considered as a real signal.
Any values below are thresholded to be min_exp (default is 4).
Number of CPUs to use (available only on Unix): Enables parallel
running in order to accelerate the analysis (Default is 1, i.e. serial
running).
Output files:
The following output files are written to the working
path, as specified in the GUI:
input_for_R.mat - the input variables for The R code.
PDS.RData - the output PDS variable in R format.
results.mat - output in MATLAB format.
scores.txt - a text file specifying the deregulation scores for each pathway
and sample.
logfile.txt - a text log file.
pathways.txt - a text file listing the pathway names on the first column,
translated to serial names used in the R code on the second column.
Output plot:
Load previous analysis: enables the user to download a previous analysis
from the working path.
Load labels (optional): enables
the user to load various sample labels (e.g. tumor stage, smoking, gender,
etc.) in a tab-delimited text file.
These labels are used in the output plot in order to
color samples according to these labels.
The first row contains the names of the samples.
The first column contains the names of the labels.
Each row (from the second) contains the corresponding label symbol for each
sample.
Plot: this button opens a figure with the principal curve learned for
the chosen pathway. The principal curve is projected
onto the three leading principal components.
Questions & Answers
Q1: How did you set min_std and min_exp ?
A: The minimal standard error (min_std) was set as the first quartile of the standard
deviation in the data. The min_exp was set
based on a method to estimate noise levels that can be found
at http://www.ncbi.nlm.nih.gov/pubmed/20663218.
After the noise is estimated we set the min_exp to be the minimal expression at which the noise
levels are acceptable. In this case we defined
acceptable as the 90% quartile of the standard deviation of the data. To make
sure the minimal expression is not ridiculously low, we don't
let it fall below known noise levels (for Affymetrix
arrays we used a minimal level of 4 in log2 space). Pathifier
is a robust algorithm and any reasonable choice of min_exp
and min_std should work out reasonably well.
Q2: I have trouble converting the output
of quantify_pathways_deregulation to figure like
Fig. 4 A in paper. Could you help me with it? What's the software to do visualization?
A1: Pathifier, and quantify_pathways_deregulation
is only meant to infer pathway scores. Once the scores are computed you can use
almost any software for gene level analysis and visualization (such as
SPIN: http://www.ncbi.nlm.nih.gov/pubmed/15722375),
only this time instead of feeding it with gene expression, use the pathway
scores computed by quantify_pathways_deregulation.
A2: Figure 4A in the paper
was generated by Matlab. You can use, for example, the Matlab function "clustergram" http://www.mathworks.com/help/bioinfo/ref/clustergram.html .
Q3: when I look at two different pathways the rot
and z matrix for one of the starts with PC3 PC5 PC6, etc ... but for the other one it starts with PC1 PC2
skips PC3 and PC4 and starts back up with PC5.
A: As specified in the paper (under "Finding a
stable gene set") we omit some of the PC that we find noisy, based on
sampling noise.
Q4: The following error was thrown during the
running of Pathifier's "quantify_pathways_deregulation"
function:Error smooth.spline(lambda, xj, ..., df = df, keep.data
= FALSE). What would cause this error? I did not use the default minexp and minstd - would these
influence this error?
A: This error is thrown by
the smooth.spline function that the princurve package is calling. If you have below 50 samples
or below 3 normals, this may cause smoothing issues
throwing off such errors. Also you might want to omit
too small pathways (say, less than 5 genes) or too big pathways (say, with a
number of genes that is bigger than the number of samples you have).
You should set min_exp as the point where the noise drops (in the low
value regime). The estimation is easier when you have replicates. (for example, with affymetrix chips
we usually work with min_exp=4 (in log2 space)).
Q5: What are the meaning of
the "sig" values that are printed as the
"quantify_pathways_deregulation" function
is running?
A: It is a measure of score stability for sampling noise. the Pathifier simply repeats the
algorithm several times on random subsets of the samples, and then calculates
the standard deviation of the scores each sample got. "sig" (short for sigma)
is just the average standard deviation. This is not officially an output of the
algorithm and on a regular use it is not supposed to
be taken into consideration. It was mainly used for
debugging purposes, but it has been left it in the log since if someone wants
to look deeply into one pathway, especially one where a few metagenes
were removed to reduce noise, this might be useful.
Q6: Does the Pathifier
perform normalization?
A: The normalization is
already done within Pathifier. The
normalization is based on the standard deviation
across the normal samples only. If you have a very few normal samples (say, 3
or less) you might need to increase their number, or define more samples as 'normals'. However, note that the Pathifier
does not perform gene filtering; this should be done by the user (e.g. by gene
variance).
Q7: Are the data on colorectal
cancer used in this study are available?
A: The expression data is
published at the GEO repository, accession number GSE41258.
Q8: In your paper, it is said
that a PDS can be negative, which means that a pathway is deregulated in a
different direction than a PDS with a positive value. However, the PDS output
from the Pathifier seem to be normalized between 0 and 1. So how can one know which pathways have negative
PDS, that is, are deregulated in a different
direction?
A: In the Pathifier
recent version, the PDS are normalized between 0 and
1. In this version, the PDS of the normal samples can be somewhere between 0 and 1. So the closer to 1, the
more deregulated the pathway is. However, if the normal samples have PDS larger
than 0, then values which are close to 0 are also
deregulated but in the opposite direction. For example, if the PDS of the
normal samples is about 0 (beginning of the curve),
then the higher is the PDS, the more deregulated is the pathway (in this case
PDS = 1 is the maximal deregulation in the end of the curve). However, in case
that the mean PDS of the normal samples is between 0 and 1 (for example the
mean of the normal PDS is 0.5 - that is, the normal samples are located at
about the middle of the curve), then PDS either close to 0 or 1 are deregulated
but in different directions (0 and 1 stand for the two opposite ends of the curve).
Q9: What are the main benefits of Pathifier?
A: 1) It provides pathway deregulation score per
sample, so in the common case where samples are tumors, you can know which
pathway is deregulated in each tumor, and just which
pathways change overall. 2) It provides a non-linear context specific
dysregulation score - that is, it learns from the data how pathways are
dysregulated in the specific dataset, and scores according to that.