Information about the algorithm can be found here: http://www.weizmann.ac.il/complex/compphys/software/yotam/pathifier/
Change R Folder ... this button, located on the top left corner of the GUI layout, should be used only when the R folder was not configured correctly at the initial Pathifier launch.
In this case, you should indicate whether the R installed on your computer is included on your search path, i.e. R can be launched from any folder on your computer.
If you answer No, you will have to supply its location, as will be explained in the next step. This setting will be saved for the next sessions.
Working path: The folder where the output files are written.
Data File: The format of this file can be either in txt or mat (Matlab) format.
- txt: tab-delimited text file.
The first row contains the names of the samples.
The second row indicates whether each sample is normal (1) or not (0).
The first column contains the names of the genes.
The corresponding matrix starts from the third row and second column and includes the expression value for each gene and sample.
- mat: This Matlab file contains the following 4 variables:
samples - cell vector cotaining the names of the samples.
normals - a logical vector indicating whether each sample is normal (1) or not (0).
genes - cell vector containing the names of the genes.
data - a double number matrix (number of rows as the number of genes, number of columns as the number of samples) containing the expression value of each genes in each sample.
Pathway data: You can either choose between:
Pre-compiled sets: the available databases are:KEGG, BioCarta and NCI.
Custom file: a gmt file specifying the pathways. The first column stands for the pathway names. The second column must exist but is not used by the Pathifier.
For each pathway (row), the corresponding genes appear on rows 3, 4,5 , etc.
Attempts: Number of runs to determine stability (default is 100).
Maximize stability: If checked (default), throw away components leading to low stability of sampling noise.
Minimal expression:The minimal expression considered as a real signal. Any values below are thresholded to be min_exp (default is 4).
Number of CPUs to use (available only on Unix): Enables parallel running in order to accelerate the analysis (Default is 1, i.e. serial running).
The following output files are written to the working path, as specified in the GUI:
input_for_R.mat - the input variables for The R code.
PDS.RData - the output PDS variable in R format.
results.mat - output in MATLAB format.
scores.txt - a text file specifying the deregulation scores for each pathway and sample.
logfile.txt - a text log file.
pathways.txt - a text file listing the pathway names on the first column, translated to serial names used in the R code on the second column.
Load previous analysis: enables the user to download a previous analysis from the working path.
Load labels (optional): enables the user to load various sample labels (e.g. tumor stage, smoking, gender, etc.) in a tab-delimited text file.
These labels are used in the output plot in order to color samples according to these labels.
The first row contains the names of the samples.
The first column contains the names of the labels.
Each row (from the second) contains the corresponding label symbol for each sample.
Plot: this button opens a figure with the principal curve learned for the chosen pathway. The principal curve is projected onto the three leading principal components.
Questions & Answers
Q1: How did you set min_std and min_exp ?
A: The minimal standard error (min_std) was set as the first quartile of the standard deviation in the data. The min_exp was set based on a method to estimate noise levels that can be found at http://www.ncbi.nlm.nih.gov/pubmed/20663218. After the noise is estimated we set the min_exp to be the minimal expression at which the noise levels are acceptable. In this case we defined acceptable as the 90% quartile of the standard deviation of the data. To make sure the minimal expression is not ridiculously low, we don't let it fall below known noise levels (for Affymetrix arrays we used a minimal level of 4 in log2 space). Pathifier is a robust algorithm and any reasonable choice of min_exp and min_std should work out reasonably well.
Q2: I have trouble converting the output of quantify_pathways_deregulation to figure like Fig. 4 A in paper. Could you help me with it? What's the software to do visualization?
A1: Pathifier, and quantify_pathways_deregulation
is only meant to infer pathway scores. Once the scores are computed you can use
almost any software for gene level analysis and visualization (such as
only this time instead of feeding it with gene expression, use the pathway
scores computed by quantify_pathways_deregulation.
A2: Figure 4A in the paper was generated by Matlab. You can use, for example, the Matlab function "clustergram" http://www.mathworks.com/help/bioinfo/ref/clustergram.html .
Q3: when I look at two different pathways the rot and z matrix for one of the starts with PC3 PC5 PC6, etc ... but for the other one it starts with PC1 PC2 skips PC3 and PC4 and starts back up with PC5.
A: As specified in the paper (under "Finding a stable gene set") we omit some of the PC that we find noisy, based on sampling noise.
Q4: The following error was thrown during the running of Pathifier's "quantify_pathways_deregulation" function:Error smooth.spline(lambda, xj, ..., df = df, keep.data = FALSE). What would cause this error? I did not use the default minexp and minstd - would these influence this error?
A: This error is thrown by the smooth.spline function that the princurve package is calling. If you have below 50 samples or below 3 normals, this may cause smoothing issues throwing off such errors. Also you might want to omit too small pathways (say, less than 5 genes) or too big pathways (say, with a number of genes that is bigger than the number of samples you have).
You should set min_exp as the point where the noise drops (in the low value regime). The estimation is easier when you have replicates. (for example, with affymetrix chips we usually work with min_exp=4 (in log2 space)).
Q5: What are the meaning of the "sig" values that are printed as the "quantify_pathways_deregulation" function is running?
A: It is a measure of score stability for sampling noise. the Pathifier simply repeats the algorithm several times on random subsets of the samples, and then calculates the standard deviation of the scores each sample got. "sig" (short for sigma) is just the average standard deviation. This is not officially an output of the algorithm and on a regular use it is not supposed to be taken into consideration. It was mainly used for debugging purposes, but it has been left it in the log since if someone wants to look deeply into one pathway, especially one where a few metagenes were removed to reduce noise, this might be useful.
Q6: Does the Pathifier perform normalization?
A: The normalization is already done within Pathifier. The normalization is based on the standard deviation across the normal samples only. If you have a very few normal samples (say, 3 or less) you might need to increase their number, or define more samples as 'normals'. However, note that the Pathifier does not perform gene filtering; this should be done by the user (e.g. by gene variance).
Q7: Are the data on colorectal cancer used in this study are available?
A: The expression data is published at the GEO repository, accession number GSE41258.
Q8: In your paper, it is said that a PDS can be negative, which means that a pathway is deregulated in a different direction than a PDS with a positive value. However, the PDS output from the Pathifier seem to be normalized between 0 and 1. So how can one know which pathways have negative PDS, that is, are deregulated in a different direction?
A: In the Pathifier recent version, the PDS are normalized between 0 and 1. In this version, the PDS of the normal samples can be somewhere between 0 and 1. So the closer to 1, the more deregulated the pathway is. However, if the normal samples have PDS larger than 0, then values which are close to 0 are also deregulated but in the opposite direction. For example, if the PDS of the normal samples is about 0 (beginning of the curve), then the higher is the PDS, the more deregulated is the pathway (in this case PDS = 1 is the maximal deregulation in the end of the curve). However, in case that the mean PDS of the normal samples is between 0 and 1 (for example the mean of the normal PDS is 0.5 - that is, the normal samples are located at about the middle of the curve), then PDS either close to 0 or 1 are deregulated but in different directions (0 and 1 stand for the two opposite ends of the curve).
Q9: What are the main benefits of Pathifier?
A: 1) It provides pathway deregulation score per sample, so in the common case where samples are tumors, you can know which pathway is deregulated in each tumor, and just which pathways change overall. 2) It provides a non-linear context specific dysregulation score - that is, it learns from the data how pathways are dysregulated in the specific dataset, and scores according to that.