# Publications

### 2018

*SIAM Journal on Matrix Analysis and Applications.*39:(3)1071-1094.

*Applied and Computational Harmonic Analysis.*45:463-485.

*Proceedings of the 35th International Conference on Machine Learning.*80:2196-2205.

*International Conference on Learning Representations.*

### 2017

*Statistics & Probability Letters.*123:177-182. Abstract

The boundary crossing probability of a Poisson process with n jumps is a fundamental quantity with numerous applications. We present a fast O(n(2) log n) algorithm to calculate this probability for arbitrary upper and lower boundaries. (C) 2016 Elsevier B.V. All rights reserved.

For code see here: github.com/mosco/crossing-probability

*Biometrika.*104:(1)181-193. Abstract

Roy's largest root is a common test statistic in multivariate analysis, statistical signal processing and allied fields. Despite its ubiquity, provision of accurate and tractable approximations to its distribution under the alternative has been a longstanding open problem. Assuming Gaussian observations and a rank-one alternative, or concentrated noncentrality, we derive simple yet accurate approximations for the most common low-dimensional settings. These include signal detection in noise, multiple response regression, multivariate analysis of variance and canonical correlation analysis. A small-noise perturbation approach, perhaps underused in statistics, leads to simple combinations of standard univariate distributions, such as central and noncentral chi(2) and F. Our results allow approximate power and sample size calculations for Roy's test for rank-one effects, which is precisely where it is most powerful.

*Artificial Intelligence and Statistics.*20.

*Journal of Mathematical Imaging and Vision.*59:(3)373-393. Abstract

Detecting edges in noisy images is a fundamental task in image processing. Motivated, in part, by various real-time applications that involve large and noisy images, in this paper we consider the problem of detecting long curved edges under extreme computational constraints, that allow processing of only a fraction of all image pixels. We present a sublinear algorithm for this task, which runs in two stages: (1) a multiscale scheme to detect curved edges inside a few image strips; and (2) a tracking procedure to estimate their extent beyond these strips. We theoretically analyze the runtime and detection performance of our algorithm and empirically illustrate its competitive results on both simulated and real images.

### 2016

*Annals of Statistics.*44:(6)2366-2371.

*Nature Communications.*7. Abstract

The non-crystallographic phase problem arises in numerous scientific and technological fields. An important application is coherent diffractive imaging. Recent advances in X-ray free-electron lasers allow capturing of the diffraction pattern from a single nanoparticle before it disintegrates, in so-called 'diffraction before destruction' experiments. Presently, the phase is reconstructed by iterative algorithms, imposing a non-convex computational challenge, or by Fourier holography, requiring a well-characterized reference field. Here we present a convex scheme for single-shot phase retrieval for two (or more) sufficiently separated objects, demonstrated in two dimensions. In our approach, the objects serve as unknown references to one another, reducing the phase problem to a solvable set of linear equations. We establish our method numerically and experimentally in the optical domain and demonstrate a proof-of-principle single-shot coherent diffractive imaging using X-ray free-electron lasers pulses. Our scheme alleviates several limitations of current methods, offering a new pathway towards direct reconstruction of complex objects.

*Information and Inference-a Journal of the Ima.*5:(4)379-404. Abstract

A common approach to statistical learning with Big-data is to randomly split it among m machines and learn the parameter of interest by averaging the m individual estimates. In this paper, focusing on empirical risk minimization or equivalently M-estimation, we study the statistical error incurred by this strategy. We consider two large-sample settings: first, a classical setting where the number of parameters p is fixed, and the number of samples per machine n ->infinity. Second, a high-dimensional regime where both p,n -> infinity with p/n -> kappa is an element of(0, 1). For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-p setting, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second-order error terms, and show that these can be non-negligible, notably for nonlinear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: data splitting incurs a first-order accuracy loss, which increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations.

*Information and Inference-a Journal of the Ima.*5:(3)304-330. Abstract

The covariance matrix of a p-dimensional random variable is a fundamental quantity in data analysis. Given n i.i.d. observations, it is typically estimated by the sample covariance matrix, at a computational cost of O(np(2)) operations. When n, p are large, this computation may be prohibitively slow. Moreover, in several contemporary applications, the population matrix is approximately sparse, and only its few large entries are of interest. This raises the following question: Assuming approximate sparsity of the covariance matrix, can its large entries be detected much faster, say in sub-quadratic time, without explicitly computing all its p(2) entries? In this paper, we present and theoretically analyze two randomized algorithms that detect the large entries of an approximately sparse sample covariance matrix using only O(np poly log p) operations. Furthermore, assuming sparsity of the population matrix, we derive sufficient conditions on the underlying random variable and on the number of samples n, for the sample covariance matrix to satisfy our approximate sparsity requirements. Finally, we illustrate the performance of our algorithms via several simulations.

*19th international Conference on Artificial Intelligence and Statistics (AISTATS), 2016.*

*2016 Ieee Conference on Computer Vision and Pattern Recognition (Cpvr).*213-221. Abstract

Detecting edges is a fundamental problem in computer vision with many applications, some involving very noisy images. While most edge detection methods are fast, they perform well only on relatively clean images. Unfortunately, sophisticated methods that are robust to high levels of noise are quite slow. In this paper we develop a novel multiscale method to detect curved edges in noisy images. Even though our algorithm searches for edges over an exponentially large set of candidate curves, its runtime is nearly linear in the total number of image pixels. As we demonstrate experimentally, our algorithm is orders of magnitude faster than previous methods designed to deal with high noise levels. At the same time it obtains comparable and often superior results to existing methods on a variety of challenging noisy images.

*Quantitative Phase Imaging II.*9718.

*Electronic Journal of Statistics.*10:(2)2329-2354. Abstract

Continuous goodness-of-fit testing is a classical problem in statistics. Despite having low power for detecting deviations at the tail of a distribution, the most popular test is based on the Kolmogorov-Smirnov statistic. While similar variance-weighted statistics such as Anderson-Darling and the Higher Criticism statistic give more weight to tail deviations, as shown in various works, they still mishandle the extreme tails. As a viable alternative, in this paper we study some of the statistical properties of the exact M-n statistics of Berk and Jones. In particular we show that they are consistent and asymptotically optimal for detecting a wide range of rare-weak mixture models. Additionally, we present a new computationally efficient method to calculate p-values for any supremum-based one-sided statistic, including the one-sided M-n(+), M-n(-) and R-n(+), R-n(-) statistics of Berk and Jones and the Higher Criticism statistic. Finally, we show that M-n compares favorably to related statistics in several finite-sample simulations.

For code see here

### 2015

*SIAM Journal on Imaging Sciences.*8:(1)458-483. Abstract

Edge detection is a challenging, important task in image analysis. Various applications require real-time detection of long edges in large and noisy images, possibly under limited computational resources. While standard edge detection methods are computationally fast, they perform well only at low levels of noise. Modern sophisticated methods, in contrast, are robust to noise, but may be too slow for real-time processing of large images. This raises the following question, which is the focus of our paper: How well can one detect long edges in noisy images under severe computational constraints that allow only a fraction of all image pixels to be processed? We make several theoretical and practical contributions regarding this problem. We develop possibly the first sublinear algorithm to detect long straight edges in noisy images. In addition, we theoretically analyze the inevitable tradeoff between its detection performance and the allowed computational budget. Finally, we demonstrate its competitive performance on both simulated and real images.

*Annals of Statistics.*43:(3)1300-1322. Abstract

Estimating the leading principal components of data, assuming they are sparse, is a central task in modern high-dimensional statistics. Many algorithms were developed for this sparse PCA problem, from simple diagonal thresholding to sophisticated semidefinite programming (SDP) methods. A key theoretical question is under what conditions can such algorithms recover the sparse principal components? We study this question for a single-spike model with an l(0)-sparse eigenvector, in the asymptotic regime as dimension p and sample size n both tend to infinity. Amini and Wainwright [Ann. Statist. 37 (2009) 2877-2921] proved that for sparsity levels k >= Omega (n/ log p), no algorithm, efficient or not, can reliably recover the sparse eigenvector. In contrast, for k = Omega(root n), the proposed SDP approach, at least in its standard usage, cannot recover the sparse spike. In fact, we conjecture that in the single-spike model, no computationally-efficient algorithm can recover a spike of l(0)-sparsity k >= Omega (root n). Finally, we present empirical results suggesting that up to sparsity levels k = O (root n), recovery is possible by a simple covariance thresholding algorithm.

### 2014

*Optics Express.*22:(21)24935-24950. Abstract

Phase measurement is a long-standing challenge in a wide range of applications, from X-ray imaging to astrophysics and spectroscopy. While in some scenarios the phase is resolved by an interferometric measurement, in others it is reconstructed via numerical optimization, based on some a-priori knowledge about the signal. The latter commonly use iterative algorithms, and thus have to deal with their convergence, stagnation, and robustness to noise. Here we combine these two approaches and present a new scheme, termed double blind Fourier holography, providing an efficient solution to the phase problem in two dimensions, by solving a system of linear equations. We present and experimentally demonstrate our approach for the case of lens-less imaging. (C) 2014 Optical Society of America

*Proceedings of the National Academy of Sciences of the United States of America.*111:(4)1253-1258. Abstract

In a broad range of classification and decision-making problems, one is given the advice or predictions of several classifiers, of unknown reliability, over multiple questions or queries. This scenario is different from the standard supervised setting, where each classifier's accuracy can be assessed using available labeled data, and raises two questions: Given only the predictions of several classifiers over a large set of unlabeled test data, is it possible to (i) reliably rank them and (ii) construct a metaclassifier more accurate than most classifiers in the ensemble? Here we present a spectral approach to address these questions. First, assuming conditional independence between classifiers, we show that the off-diagonal entries of their covariance matrix correspond to a rank-one matrix. Moreover, the classifiers can be ranked using the leading eigenvector of this covariance matrix, because its entries are proportional to their balanced accuracies. Second, via a linear approximation to the maximum likelihood estimator, we derive the Spectral Meta-Learner (SML), an unsupervised ensemble classifier whose weights are equal to these eigenvector entries. On both simulated and real data, SML typically achieves a higher accuracy than most classifiers in the ensemble and can provide a better starting point than majority voting for estimating the maximum likelihood solution. Furthermore, SML is robust to the presence of small malicious groups of classifiers designed to veer the ensemble prediction away from the (unknown) ground truth.

### 2013

*IEEE Transactions on Signal Processing.*61:(7)1632-1643. Abstract

Reconstruction of signals from measurements of their spectral intensities, also known as the phase retrieval problem, is of fundamental importance in many scientific fields. In this paper we present a novel framework, denoted as vectorial phase retrieval, for reconstruction of pairs of signals from spectral intensity measurements of the two signals and of their interference. We show that this new framework can alleviate some of the theoretical and computational challenges associated with classical phase retrieval from a single signal. First, we prove that for compactly supported signals, in the absence of measurement noise, this new setup admits a unique solution. Next, we present a statistical analysis of vectorial phase retrieval and derive a computationally efficient algorithm to solve it. Finally, we illustrate via simulations, that our algorithm can accurately reconstruct signals even at considerable noise levels.

*SIAM Journal on Matrix Analysis and Applications.*34:(2)757-772. Abstract

Normalized cut is a widely used measure of separation between clusters in a graph. In this paper we provide a novel probabilistic perspective on this measure. We show that for a partition of a graph into two weakly connected sets, V = A (sic) B, the multiway normalized cut is approximately MNcut approximate to 1/tau(A -> B) + 1/tau(B -> A), where tau(A -> B) is the unidirectional characteristic exit time of a random walk from subset A to subset B. Using matrix perturbation theory, we show that this approximation is exact to first order in the connection strength between the two subsets A and B, and we derive an explicit bound for the approximation error. Our result implies that for a Markov chain composed of a reversible subset A that is weakly connected to an absorbing subset B, the easy-to-compute normalized cut measure is a reliable proxy for the chain's spectral gap.

*Annals of Statistics.*41:(3)1055-1084. Abstract

We study the problem of estimating the leading eigenvectors of a high-dimensional population covariance matrix based on independent Gaussian observations. We establish a lower bound on the minimax risk of estimators under the l(2) loss, in the joint limit as dimension and sample size increase to infinity, under various models of sparsity for the population eigenvectors. The lower bound on the risk points to the existence of different regimes of sparsity of the eigenvectors. We also propose a new method for estimating the eigenvectors by a two-stage coordinate selection scheme.

### 2012

*European Conference Computer Vision (ECCV), 2012.*

*Conference on Uncertainty in Artificial Intelligence (UAI), 2012.*

### 2011

*2011 Ieee Conference On Computer Vision And Pattern Recognition (Cvpr).*40. Abstract

The goal of natural image denoising is to estimate a clean version of a given noisy image, utilizing prior knowledge on the statistics of natural images. The problem has been studied intensively with considerable progress made in recent years. However, it seems that image denoising algorithms are starting to converge and recent algorithms improve over previous ones by only fractional dB values. It is thus important to understand how much more can we still improve natural image denoising algorithms and what are the inherent limits imposed by the actual statistics of the data. The challenge in evaluating such limits is that constructing proper models of natural image statistics is a long standing and yet unsolved problem. To overcome the absence of accurate image priors, this paper takes a non parametric approach and represents the distribution of natural images using a huge set of 1010 patches. We then derive a simple statistical measure which provides a lower bound on the optimal Bayesian minimum mean square error (MMSE). This imposes a limit on the best possible results of denoising algorithms which utilize a fixed support around a denoised pixel and a generic natural image prior. Our findings suggest that for small windows, state of the art denoising algorithms are approaching optimality and cannot be further improved beyond similar to 0.1dB values.

*Journal of Multivariate Analysis.*102:(2)363-371. Abstract

The ratio of the largest eigenvalue divided by the trace of a p x p random Wishart matrix with n degrees of freedom and an identity covariance matrix plays an important role in various hypothesis testing problems both in statistics and in signal processing In this paper we derive an approximate explicit expression for the distribution of this ratio by considering the joint limit as both p, n -> infinity with p/n -> c Our analysis reveals that even though asymptotically in this limit the ratio follows a Tracy-Widom (TW) distribution one of the leading error terms depends on the second derivative of the TW distribution and is non-negligible for practical values of p in particular for determining tail probabilities We thus propose to explicitly include this term in the approximate distribution for the ratio We illustrate empirically using simulations that adding this term to the TW distribution yields a quite accurate expression to the empirical distribution of the ratio even for small values of p, n (C) 2010 Elsevier Inc All rights reserved

*Physical Review Letters.*107:(13) Abstract

The waveforms of attosecond pulses produced by high-harmonic generation carry information on the electronic structure and dynamics in atomic and molecular systems. Current methods for the temporal characterization of such pulses have limited sensitivity and impose significant experimental complexity. We propose a new linear and all-optical method inspired by widely used multidimensional phase retrieval algorithms. Our new scheme is based on the spectral measurement of two attosecond sources and their interference. As an example, we focus on the case of spectral polarization measurements of attosecond pulses, relying on their most fundamental property-being well confined in time. We demonstrate this method numerically by reconstructing the temporal profiles of attosecond pulses generated from aligned CO(2) molecules.

*2011 Ieee Statistical Signal Processing Workshop (Ssp).*681-684. Abstract

Detecting the presence of a signal embedded in noise from a multi-sensor system is a fundamental problem in signal and array processing. In this paper we consider the case where the noise covariance matrix is arbitrary and unknown but we are given both signal bearing and noise-only samples. Using a matrix perturbation approach, combined with known results on the eigenvalues of inverse Wishart matrices, we study the behavior of the largest eigenvalue of the relevant covariance matrix, and derive an approximate expression for the detection probability of Roy's largest root test. The accuracy of our expressions is confirmed by simulations.

*Technical Report, Department of Statistics, Stanford University, 2011.*

*IEEE Transactions on Signal Processing.*59:(4)1333-1345. Abstract

Detection of the number of sinusoids embedded in noise is a fundamental problem in statistical signal processing. Most parametric methods minimize the sum of a data fit (likelihood) term and a complexity penalty term. The latter is often derived via information theoretic criteria, such as minimum description length (MDL), or via Bayesian approaches including Bayesian information criterion (BIC) or maximum a posteriori (MAP). While the resulting estimators are asymptotically consistent, empirically their finite sample performance is strongly dependent on the specific penalty term chosen. In this paper we elucidate the source of this behavior, by relating the detection performance to the extreme value distribution of the maximum of the periodogram and of related random fields. Based on this relation, we propose a combined detection-estimation algorithm with a new penalty term. Our proposed penalty term is sharp in the sense that the resulting estimator achieves a nearly constant false alarm rate. A series of simulations support our theoretical analysis and show the superior detection performance of the suggested estimator.

*2011 Ieee International Conference On Communications (Icc).*Abstract

In this paper we consider signal detection in cognitive radio networks, under a non-parametric, multi-sensor detection scenario, and compare the cases of known and unknown noise level. The analysis is focused on two eigenvalue-based methods, namely Roy's largest root test, which requires knowledge of the noise variance, and the generalized likelihood ratio test, which can be interpreted as a test of the largest eigenvalue vs. a maximum-likelihood estimate of the noise variance. The detection performance of the two considered methods is expressed by closed-form analytical formulas, shown to be accurate even for small number of sensors and samples. We then derive an expression of the gap between the two detectors in terms of the signal-to-noise ratio of the signal to be detected, and we identify critical settings where this gap is significant (e. g., small number of sensors and signal strength). Our results thus provide a measure of the impact of noise level knowledge and highlight the importance of accurate noise estimation.

### 2010

*International Conference on Machine Learning, 2010.*

*Sixth IEEE Sensor Array and Multichannel Signal Processing Workshop, 2010.*

*Computer Vision-Eccv 2010, Pt Iv.*6314:750-763. Abstract

A fundamental question for edge detection is how faint an edge can be and still be detected. In this paper we offer a formalism to study this question and subsequently introduce a hierarchical edge detection algorithm designed to detect faint curved edges in noisy images. In our formalism we view edge detection as a search in a space of feasible curves, and derive expressions to characterize the behavior of the optimal detection threshold as a function of curve length and the combinatorics of the search space. We then present an algorithm that efficiently searches for edges through a very large set of curves by hierarchically constructing difference filters that match the curves traced by the sought edges. We demonstrate the utility of our algorithm in simulations and in applications to challenging real images.

*Journal of Neuroscience.*30:(27)9017-9026. Abstract

Odor identity is coded in spatiotemporal patterns of neural activity in the olfactory bulb. Here we asked whether meaningful olfactory information could also be read from the global olfactory neural population response. We applied standard statistical methods of dimensionality-reduction to neural activity from 12 previously published studies using seven different species. Four studies reported olfactory receptor activity, seven reported glomerulus activity, and one reported the activity of projection-neurons. We found two linear axes of neural population activity that accounted for more than half of the variance in neural response across species. The first axis was correlated with the total sum of odor-induced neural activity, and reflected the behavior of approach or withdrawal in animals, and odorant pleasantness in humans. The second and orthogonal axis reflected odorant toxicity across species. We conclude that in parallel with spatiotemporal pattern coding, the olfactory system can use simple global computations to read vital olfactory information from the neural population response.

*Artificial Intelligence in Medicine.*48:(3-Feb)91-98. Abstract

Objective: The importance of gene expression data in cancer diagnosis and treatment has become widely known by cancer researchers in recent years. However, one of the major challenges in the computational analysis of such data is the curse of dimensionality because of the overwhelming number of variables measured (genes) versus the small number of samples. Here, we use a two-step method to reduce the dimension of gene expression data and aim to address the problem of high dimensionality. Methods: First, we extract a subset of genes based on statistical characteristics of their corresponding gene expression levels. Then, for further dimensionality reduction, we apply diffusion maps, which interpret the eigenfunctions of Markov matrices as a system of coordinates on the original data set, in order to obtain efficient representation of data geometric descriptions. Finally, a neural network clustering theory, fuzzy ART, is applied to the resulting data to generate clusters of cancer samples. Results: Experimental results on the small round blue-cell tumor data set, compared with other widely used clustering algorithms, such as the hierarchical clustering algorithm and K-means, show that our proposed method can effectively identify different cancer types and generate high-quality cancer sample clusters. Conclusion: The proposed feature selection methods and diffusion maps can achieve useful information from the multidimensional gene expression data and prove effective at addressing the problem of high dimensionality inherent in gene expression data analysis. (C) 2009 Elsevier BM. All rights reserved.

*IEEE Transactions on Signal Processing.*58:(5)2746-2756. Abstract

Determining the number of sources from observed data is a fundamental problem in many scientific fields. In this paper we consider the nonparametric setting, and focus on the detection performance of two popular estimators based on information theoretic criteria, the Akaike information criterion (AIC) and minimum description length (MDL). We present three contributions on this subject. First, we derive a new expression for the detection performance of the MDL estimator, which exhibits a much closer fit to simulations in comparison to previous formulas. Second, we present a random matrix theory viewpoint of the performance of the AIC estimator, including approximate analytical formulas for its overestimation probability. Finally, we show that a small increase in the penalty term of AIC leads to an estimator with a very good detection performance and a negligible overestimation probability.

### 2009

*Cancer.*115:(8)1638-1650. Abstract

BACKGROUND: A more accurate taxonomy of small intestinal (SI) neuroendocrine tumors (NETs) is necessary to accurately predict tumor behavior and prognosis and to define therapeutic strategy. In this study, the authors identified a panel of such markers that have been implicated in tumorigenicity, metastasis, and hormone production and hypothesized that transcript levels of the genes melanoma antigen family D2 (MAGE-D2), metastasis-associated 1 (MTA1), nucleosome assembly protein I-like (NAP1L1), Ki-67 (a marker of proliferation), survivin, frizzled homolog 7 (FZD7), the Kiss1 metastasis suppressor (Kiss1), neuropilin 2 (NPP2), and chromogranin A (CgA) could be used to define primary SI NETs and to predict the development of metastases. METHODS: Seventy-three clinically and World Health Organization pathologically classified NET samples (primary tumor, n = 44 samples; liver metastases, n = 29 samples) and 30 normal human enterochromafffin (EC) cell preparations were analyzed using real-time polymerase chain reaction. Transcript levels were normalized to 3 NET housekeeping genes (asparagine-linked glycosylation 9 or ALG9, transcription factor CP2 or TFCP2, and zinc finger protein 410 or ZNF410) using geNorm analysis. A predictive gene-based model was constructed using supervised learning algorithms from the transcript expression levels. RESULTS: Primary SI NETS could be differentiated from normal human EC cell preparations with 100% specificity and 92% sensitivity. Well differentiated NETs (WDNETs), well differentiated neuroendocrine carcinomas, and poorly differentiated NETs (PDNETs) were classified with a specificity of 78%, 78%, and 71%, respectively; whereas poorly differentiated neuroendocrine carcinomas were misclassified as either WDNETs or PDNETs. Metastases were predicted in all cases with 100% sensitivity and specificity. CONCLUSIONS: The current results indicated that gene expression profiling and supervised machine learning can be used to classify SI NE

*Journal of the American Statistical Association.*104:(486)694-697.

*Siam Journal On Matrix Analysis And Applications.*31:40-53. Abstract

We present eigenvalue bounds for perturbations of Hermitian matrices and express the change in eigenvalues in terms of a projection of the perturbation onto a particular eigenspace, rather than in terms of the full perturbation. The perturbations we consider are Hermitian of rank one, and Hermitian or non-Hermitian with norm smaller than the spectral gap of a specific eigenvalue. Applications include principal component analysis under a spiked covariance model, and pseudo-arclength continuation methods for the solution of nonlinear systems.

*Neural Information Processing Systems (NIPS), Vol. 22, 2009.*

*IEEE Transactions on Signal Processing.*57:(10)3930-3941. Abstract

Detection of the number of signals embedded in noise is a fundamental problem in signal and array processing. This paper focuses on the non-parametric setting where no knowledge of the array manifold is assumed. First, we present a detailed statistical analysis of this problem, including an analysis of the signal strength required for detection with high probability, and the form of the optimal detection test under certain conditions where such a test exists. Second, combining this analysis with recent results from random matrix theory, we present a new algorithm for detection of the number of sources via a sequence of hypothesis tests. We theoretically analyze the consistency and detection performance of the proposed algorithm, showing its superiority compared to the standard minimum description length (MDL)-based estimator. A series of simulations confirm our theoretical analysis.

*Journal of Machine Learning Research.*10:1095-1129. Abstract

We propose a novel framework for supervised learning of discrete concepts. Since the 1970's, the standard computational primitive has been to find the most consistent hypothesis in a given complexity class. In contrast, in this paper we propose a new basic operation: for each pair of input instances, count how many concepts of bounded complexity contain both of them. Our approach maps instances to a Hilbert space, whose metric is induced by a universal kernel coinciding with our computational primitive, and identifies concepts with half-spaces. We prove that all concepts are linearly separable under this mapping. Hence, given a labeled sample and an oracle for evaluating the universal kernel, we can efficiently compute a linear classifier (via SVM, for example) and use margin bounds to control its generalization error. Even though exact evaluation of the universal kernel may be infeasible, in various natural situations it is efficiently approximable. Though our approach is general, our main application is to regular languages. Our approach presents a substantial departure from current learning paradigms and in particular yields a novel method for learning this fundamental concept class. Unlike existing techniques, we make no structural assumptions on the corresponding unknown automata, the string distribution or the completeness of the training set. Instead, given a labeled sample our algorithm outputs a classifier with guaranteed distribution-free generalization bounds; to our knowledge, the proposed framework is the only one capable of achieving the latter. Along the way, we touch upon several fundamental questions in complexity, automata, and machine learning.

*SIAM Journal on Imaging Sciences.*2:(1)118-139. Abstract

Nonlocal neighborhood filters are modern and powerful techniques for image and signal denoising. In this paper, we give a probabilistic interpretation and analysis of the method viewed as a random walk on the patch space. We show that the method is intimately connected to the characteristics of diffusion processes, their escape times over potential barriers, and their spectral decomposition. In particular, the eigenstructure of the diffusion operator leads to novel insights on the performance and limitations of the denoising method, as well as a proposal for an improved filtering algorithm.

*Annals of Surgical Oncology.*16:(2)487-498. Abstract

Incidence of neuroendocrine tumors (NETs) is increasing (approximately 6%/year), but clinical presentation is nonspecific, resulting in delays in diagnosis (5-7 years; approximately 70% have metastases). This reflects absence of a sensitive plasma marker. The aim of this study is to investigate whether detection of circulating messenger RNA (mRNA) alone or in combination with circulating NET-related hormones and growth factors can detect gastrointestinal NET disease. The small intestinal (SI) NET cell line KRJ-I was used to define the sensitivity of real-time polymerase chain reaction (PCR) for mRNA detection in blood. NSE, Tph-1, and VMAT (2) transcripts were identified from one KRJ-I cell/ml blood. mRNA from the tissue and plasma of SI-NETs (n = 12) and gastric NETs (n = 7), and plasma from healthy controls (n = 9) was isolated and real-time PCR performed. Tph-1 was a specific marker of SI-NETs (58%, p <0.03) whereas CgA transcripts did not differentiate tumors from controls. Patients with metastatic disease expressed more marker transcripts than localized tumors (75% versus 18%, p <0.02). Plasma 5-hydroxytryptamine (5-HT), chromogranin A (CgA), ghrelin, and connective tissue growth factor (CTGF) fragments were measured, combined with mRNA levels, and a predictive mathematical model for NET diagnosis developed using decision trees. The sensitivity and specificity to diagnose SI-NETs and gastric NETs were 81.2% and 100%, and 71.4% and 55.6%, respectively. We conclude that mRNA from one NET cell/ml blood can be detected. Circulating plasma Tph-1 is a promising marker gene for SI-NET disease (specificity 100%) while an increased number of marker transcripts (> 2) correlated with disease spread. Including NET-related circulating hormones and growth factors in the algorithm increased the sensitivity of detection of SI-NETs from 58 to 82%.

### 2008

*Chemometrics and Intelligent Laboratory Systems.*94:(1)19-32. Abstract

Determining the number of components in a linear mixture model is a fundamental problem in many scientific fields, including chemometrics and signal processing. In this paper we present a new method to automatically determine the number of components from a limited number of (possibly) high dimensional noisy samples. The proposed method, based on the eigenvalues of the sample covariance matrix, combines a matrix perturbation approach for the interaction of signal and noise eigenvalues, with recent results from random matrix theory regarding the behavior of noise eigenvalues. We present the theoretical derivation of the algorithm and an analysis of its consistency and limit of detection. Results on simulated data show that under a wide range of conditions our method compares favorably with other common algorithms. (C) 2008 Elsevier B.V. All rights reserved.

*2008 Ieee Conference On Computer Vision And Pattern Recognition, Vols 1-12.*3323-3330. Abstract

Two-tone ("Mooney") images seem to arouse vivid 3D percept of faces, both familiar and unfamiliar, despite their seemingly poor content. Recent psychological and fMRI studies suggest that this percept is guided primarily by top-down procedures in which recognition precedes reconstruction. In this paper we investigate this hypothesis from a mathematical standpoint. We show that indeed, under standard shape from shading assumptions, a Mooney image can give rise to multiple different 3D reconstructions even if reconstruction is restricted to the Mooney transition curve (the boundary curve between black and white) alone. We then use top-down reconstruction methods to recover the shape of novel faces from single Mooney images exploiting prior knowledge of the structure of at least one face of a different individual. We apply these methods to thresholded images of real faces and compare the reconstruction quality relative to reconstruction from gray level images.

*Annals of Applied Statistics.*2:(2)494-500.

*Annals of Applied Statistics.*2:(2)435-471. Abstract

In many modern applications, including analysis of gene expression and texts documents, the data are noisy, high-dimensional, and unordered-with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper we present treelets-a novel construction of multi-scale bases that extends wavelets to nonsmooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonormal basis which both reflect the internal structure of the data. Treelets are especially well-suited as dimensionality reduction and feature selection tool prior to regression and classification, in situations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis, as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables.

*Annals of Statistics.*36:(6)2791-2817. Abstract

Principal component analysis (PCA) is a standard tool for dimensional reduction of a set of n observations (samples), each with p variables. In this paper, using a matrix perturbation approach, we study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size n, and those of the limiting population PCA as n -> infinity. As in machine learning, we present a finite sample theorem which holds with high probability for the closeness between the leading eigenvalue and eigenvector of sample PCA and population PCA under a spiked covariance model. In addition, we also consider the relation between finite sample PCA and the asymptotic results in the joint limit p, n -> infinity, with p/n = c. We present a matrix perturbation view of the "phase transition phenomenon," and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit. Moreover, our analysis also applies for finite p, n where we show that although there is no sharp phase transition as in the infinite case, either as a function of noise level or as a function of sample size n, the eigenvector of sample PICA may exhibit a sharp "loss of tracking," suddenly losing its relation to the (true) eigenvector of the population PCA matrix. This occurs due to a crossover between the eigenvalue due to the signal and the largest eigenvalue due to noise, whose eigenvector points in a random direction.

*2008 42Nd Asilomar Conference On Signals, Systems And Computers, Vols 1-4.*1680-1683. Abstract

Nonparametric detection of the number of sources is a fundamental and extensively studied problem in signal processing. The common approach to solving this problem is via information theoretic criteria, in particular minimum description length (MDL). In this paper we propose a different approach and present a new algorithm based on a sequence of hypothesis tests. Our algorithm combines recent results from random matrix theory with a matrix perturbation approach. The new method has superior performance over the standard MDL estimator, as shown by simulations.

*Multiscale Modeling & Simulation.*7:(2)842-864. Abstract

The concise representation of complex high dimensional stochastic systems via a few reduced coordinates is an important problem in computational physics, chemistry, and biology. In this paper we use the first few eigenfunctions of the backward Fokker-Planck diffusion operator as a coarse-grained low dimensional representation for the long-term evolution of a stochastic system and show that they are optimal under a certain mean squared error criterion. We denote the mapping from physical space to these eigenfunctions as the diffusion map. While in high dimensional systems these eigenfunctions are difficult to compute numerically by conventional methods such as finite differences or finite elements, we describe a simple computational data-driven method to approximate them from a large set of simulated data. Our method is based on de. ning an appropriately weighted graph on the set of simulated data and computing the first few eigenvectors and eigenvalues of the corresponding random walk matrix on this graph. Thus, our algorithm incorporates the local geometry and density at each point into a global picture that merges data from different simulation runs in a natural way. Furthermore, we describe lifting and restriction operators between the diffusion map space and the original space. These operators facilitate the description of the coarse-grained dynamics, possibly in the form of a low dimensional effective free energy surface parameterized by the diffusion map reduction coordinates. They also enable a systematic exploration of such effective free energy surfaces through the design of additional "intelligently biased" computational experiments. We conclude by demonstrating our method in a few examples.

### 2007

*Journal of Chemical Physics.*126:(15) Abstract

Finding coarse-grained, low-dimensional descriptions is an important task in the analysis of complex, stochastic models of gene regulatory networks. This task involves (a) identifying observables that best describe the state of these complex systems and (b) characterizing the dynamics of the observables. In a previous paper [R. Erban , J. Chem. Phys. 124, 084106 (2006)] the authors assumed that good observables were known a priori, and presented an equation-free approach to approximate coarse-grained quantities (i.e., effective drift and diffusion coefficients) that characterize the long-time behavior of the observables. Here we use diffusion maps [R. Coifman , Proc. Natl. Acad. Sci. U.S.A. 102, 7426 (2005)] to extract appropriate observables ("reduction coordinates") in an automated fashion; these involve the leading eigenvectors of a weighted Laplacian on a graph constructed from network simulation data. We present lifting and restriction procedures for translating between physical variables and these data-based observables. These procedures allow us to perform equation-free, coarse-grained computations characterizing the long-term dynamics through the design and processing of short bursts of stochastic simulation initialized at appropriate values of the data-based observables. (c) 2007 American Institute of Physics.

*Physiological Genomics.*30:(3)363-370. Abstract

Accurate quantitation of target genes depends on correct normalization. Use of genes with variable tissue transcription ( GAPDH) is problematic, particularly in clinical samples, which are derived from different tissue sources. Using a large- scale gene database ( Affymetrix U133A) data set of 36 gastrointestinal ( GI) tumors and normal tissues, we identified 8 candidate reference genes and established expression levels by real-time RT- PCR in an independent data set ( n = 42). A geometric averaging method ( geNorm) identified ALG9, TFCP2, and ZNF410 as the most robustly expressed control genes. Examination of raw CT values demonstrated that these genes were tightly correlated between themselves ( R(2) > 0.86, P <0.0001), with low variability [ coefficient of variation ( CV) <12.7%] and high interassay reproducibility ( r = 0.93, P = 0.001). In comparison, the alternative control gene, GAPDH, exhibited the highest variability ( CV = 18.1%), was significantly differently expressed between tissue types ( P = 0.05), was poorly correlated with the three reference genes ( R2 <0.4), and was considered the least stable gene. To illustrate the importance of correct normalization, the target gene, MTA1, was significantly overexpressed ( P = 0.0006) in primary GI neuroendocrine tumor ( NET) samples ( vs. normal GI samples) when normalized by geNorm(ATZ) but not when normalized using GAPDH. The geNormATZ approach was, in addition, applicable to adenocarcinomas; MTA1 was overexpressed ( P <0.04) in malignant colon, pancreas, and breast tumors compared with normal tissues. We provide a robust basis for the establishment of a reference gene set using GeneChip data and provide evidence for the utility of normalizing a malignancy- associated gene ( MTA1) using novel reference genes and the geNorm approach in GI NETs as well as in adenocarcinomas and breast tumors.

*in Principal Manifolds for Data Visualization and Dimension Reduction, A.N. Gorban, B. Kegl, D.C. Wunsch and A. Zinovyev (Eds), Springer, 2007. .*

### 2006

*Applied and Computational Harmonic Analysis.*21:(1)113-127. Abstract

A central problem in data analysis is the low dimensional representation of high dimensional data and the concise description of its underlying geometry and density. In the analysis of large scale simulations of complex dynamical systems, where the notion of time evolution comes into play, important problems are the identification of slow variables and dynamically meaningful reaction coordinates that capture the long time evolution of the system. In this paper we provide a unifying view of these apparently different tasks, by considering a family of diffusion maps, defined as the embedding of complex (high dimensional) data onto a low dimensional Euclidean space, via the eigenvectors of suitably defined random walks defined on the given datasets. Assuming that the data is randomly sampled from an underlying general probability distribution p(x) = e(-U(x)), we show that as the number of samples goes to infinity, the eigenvectors of each diffusion map converge to the eigenfunctions of a corresponding differential operator defined on the support of the probability distribution. Different normalizations of the Markov chain on the graph lead to different limiting differential operators. Specifically, the normalized graph Laplacian leads to a backward Fokker-Planck operator with an underlying potential of 2U(x), best suited for spectral clustering. A different anisotropic normalization of the random walk leads to the backward Fokker-Planck operator with the potential U(x), best suited for the analysis of the long time asymptotics of high dimensional stochastic systems governed by a stochastic differential equation with the same potential U(x). Finally, yet another normalization leads to the eigenfunctions of the Laplace-Beltrami (heat) operator on the manifold in which the data resides, best suited for the analysis of the geometry of the dataset regardless of its possibly non-uniform density. (C) 2006 Published by Elsevier Inc.

*Neural Information Processing systems, Vol. 19, 2006.*

### 2005

*preprint of an article published in Journal of Chemometrics, 19(2):107-118 (2005).*

*preprint of an article published in Journal of Chemometrics, 19(1):45-54 (2005).*

The final version can be accessed at the publisher's website by clicking here

_{This paper was awarded the Kowalski prize for best theoretical paper in J. Chemometrics in 2004-5.}

*Physical Review Letters, 94 (2005) 218101.*

*Neural Information Processing Systems (NIPS), Vol 18, 2005.*

*Proceedings of the National Academy of Sciences, 102(21):7426-31 (2005).*

*Proceedings of the National Academy of Sciences, 102(21):7432-37 (2005).*

### 2004

*Physical Review E, 70 (2004) 051912.*

*Physical Review E, 70 (2004) 061106.*

*Journal of Physics: Condensed Matter, 16(22) (2004) S2153-S2165.*

### 2003

*Physical Review E, 68 (2003) 021905.*

*SIAM Journal on Applied Mathematics, 63, Number 3 (2003), pp. 850-873.*

### 2001

*Physical Review E, 64 (2001) 036116.*

*SIAM Journal on Applied Mathematics, 62(2) (2001), pp. 433-447.*

### 1999

*Computers and Graphics, 23(5):655-666 (1999).*