Publications

2024

Chen Amiraz, Robert Krauthgamer, Boaz Nadler (2024) Proceedings of The 27th International Conference on Artificial Intelligence and Statistics. 238, p. 802-810,

2023

Tyler’s and Maronna’s M-estimators: Non-asymptotic concentration results

Elad Romanov, Gil Kur, Boaz Nadler (2023) Journal of Multivariate Analysis. 196, p. 105184

Imbalanced Mixed Linear Regression

Pini Zilber, Boaz Nadler (2023) Advances in Neural Information Processing Systems. 36, p. 54540--54553

A Statistical Approach to Estimate Seismic Monitoring Stations’ Biases and Error Levels

Yael Radzyner, Meirav Galun, Boaz Nadler (2023) Bulletin of the Seismological Society of America.

Spectral top-down recovery of latent tree models

Yariv Aizenbud, Ariel Jaffe, Meng Wang, Amber Hu, Noah Amsel, Boaz Nadler, Joseph T. Chang, Yuval Kluger (2023) Information and Inference. 12, p. 2300-2350

2022

Distributed sparse normal means estimation with sublinear communication

Chen Amiraz, Robert Krauthgamer, Boaz Nadler (2022) Information and Inference.

GNMR: A Provable One-Line Algorithm for Low Rank Matrix Recovery

Pini Zilber, Boaz Nadler (2022) SIAM Journal on Mathematics of Data Science. 4, 2, p. 909–934

Inductive Matrix Completion: No Bad Local Minima and a Fast Algorithm

Pini Zilber, Boaz Nadler (2022) Icml. p. 27671-27692

Crowdsourcing regression: a spectral approach

Yaniv Tenzer, Omer Dror, Boaz Nadler, Erhan Bilal, Yuval Kluger (2022) Artificial Intelligence and Statistics.

"Self-Wiener" Filtering: Data-Driven Deconvolution of Deterministic Signals

Amir Weiss, Boaz Nadler (2022) IEEE Transactions on Signal Processing. 70, p. 468-481

Zero-preserving imputation of single-cell RNA-seq data

George C Linderman, Jun Zhao, Manolis Roulis, Piotr Bielecki, Richard A Flavell, Boaz Nadler, Yuval Kluger (2022) Nature Communications. 13, p. 192

2021

Spectral Neighbor Joining for Reconstruction of Latent Tree Models

Ariel Jaffe, Noah Amsel, Yariv Aizenbud, Boaz Nadler, Joseph T Chang, Yuval Kluger (2021) SIAM Journal on Mathematics of Data Science. 3, 1, p. 113-141

Rank 2r Iterative Least Squares: Efficient Recovery of Ill-Conditioned Low Rank Matrices from Few Entries

Jonathan Bauch, Boaz Nadler, Pini Zilber (2021) SIAM Journal on Mathematics of Data Science. 3, 1, p. 439-465

Improved convergence guarantees for learning Gaussian mixture models by EM and gradient EM

Nimrod Segol, Boaz Nadler (2021) Electronic Journal of Statistics. 15, 2, p. 4510 - 4544

The Trimmed Lasso: Sparse Recovery Guarantees and Practical Optimization by the Generalized Soft-Min Penalty

Tal Amir, Ronen Basri, Boaz Nadler (2021) SIAM Journal on Mathematics of Data Science. 3, 3, p. 909-929

2020

Tight recovery guarantees for orthogonal matching pursuit under Gaussian noise

Chen Amiraz, Robert Krauthgamer, Boaz Nadler (2020) Information and Inference. iaaa021,

Asymptotically Optimal Blind Calibration of Acoustic Vector Sensor Uniform Linear Arrays

Amir Weiss, Boaz Nadler, Arie Yeredor (2020) International Conference on Acoustics, Speech and Signal Processing.

Robust sparse covariance estimation by thresholding Tyler’s M-estimator

John Goes, Gilad Lerman, Boaz Nadler (2020) Annals of Statistics. 48, 1, p. 86-110

Beyond Trees: Classification with Sparse Pairwise Dependencies

Yaniv Tenzer, Amit Moscovich, Mary Frances Dorn, Boaz Nadler, Clifford Spiegelman (2020) Journal of Machine Learning Research. 21, 189, p. 1-33

2019

Double-blind holography of attosecond pulses

O. Pedatzur, A. Trabattoni, B. Leshem, H. Shalmoni, M. C. Castrovilli, M. Galli, M. Lucchini, E. MåNsson, F. Frassetto, L. Poletto, B. Nadler, O. Raz, M. Nisoli, F. Calegari, D. Oron And N. Dudovich (2019) Nature Photonics. 13, p. 91-95

Roy’s largest root under rank-one perturbations: The complex valued case and applications

Prathapasinghe Dharmawansa, Boaz Nadler, Ofer Shwartz (2019) Journal of Multivariate Analysis. 174,

2018

The discrete sign problem: Uniqueness, recovery algorithms and phase retrieval applications

Ben Leshem, Oren Raz, Ariel Jaffe, Boaz Nadler (2018) Applied and Computational Harmonic Analysis. 45, p. 463-485

Projection pursuit in high dimensions

Peter J. Bickel, Gil Kur, Boaz Nadler (2018) Proceedings of the National Academy of Sciences.

Newton Correction Methods for Computing Real Eigenpairs of Symmetric Tensors

Ariel Jaffe, Roi Weiss, Boaz Nadler (2018) SIAM Journal on Matrix Analysis and Applications. 39, 3, p. 1071-1094

Learning Binary Latent Variable Models: A Tensor Eigenpair Approach

Ariel Jaffe, Roi Weiss, Boaz Nadler, Shai Carmi, Yuval Kluger (2018) Proceedings of the 35th International Conference on Machine Learning. 80, p. 2196-2205

SpectralNet: Spectral Clustering using Deep Neural Networks

Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, Yuval Kluger (2018) International Conference on Learning Representations.

2017

Cost efficient gradient boosting

Sven Peter Ferran Diego Fred A. Hamprecht Boaz Nadler (2017) Neural Information Processing Systems. 30,

Roy's largest root test under rank-one alternatives

I. M. Johnstone; Boaz Nadler (2017) Biometrika. 104, 1, p. 181-193 Abstract

Roy's largest root is a common test statistic in multivariate analysis, statistical signal processing and allied fields. Despite its ubiquity, provision of accurate and tractable approximations to its distribution under the alternative has been a longstanding open problem. Assuming Gaussian observations and a rank-one alternative, or concentrated noncentrality, we derive simple yet accurate approximations for the most common low-dimensional settings. These include signal detection in noise, multiple response regression, multivariate analysis of variance and canonical correlation analysis. A small-noise perturbation approach, perhaps underused in statistics, leads to simple combinations of standard univariate distributions, such as central and noncentral chi(2) and F. Our results allow approximate power and sample size calculations for Roy's test for rank-one effects, which is precisely where it is most powerful.

Minimax-optimal semi-supervised regression on unknown manifolds

Amit Moscovich, Ariel Jaffe, Boaz Nadler (2017) Artificial Intelligence and Statistics. 20,

Supplementary material

Detecting Curved Edges in Noisy Images in Sublinear Time

Yi-Qing Wang; Alain Trouve; Yali Amit; Boaz Nadler (2017) Journal of Mathematical Imaging and Vision. 59, 3, p. 373-393 Abstract

Detecting edges in noisy images is a fundamental task in image processing. Motivated, in part, by various real-time applications that involve large and noisy images, in this paper we consider the problem of detecting long curved edges under extreme computational constraints, that allow processing of only a fraction of all image pixels. We present a sublinear algorithm for this task, which runs in two stages: (1) a multiscale scheme to detect curved edges inside a few image strips; and (2) a tracking procedure to estimate their extent beyond these strips. We theoretically analyze the runtime and detection performance of our algorithm and empirically illustrate its competitive results on both simulated and real images.

Fast calculation of boundary crossing probabilities for Poisson processes

Amit Moscovich; Boaz Nadler (2017) Statistics & Probability Letters. 123, p. 177-182 Abstract

For code see here: github.com/mosco/crossing-probability

2016

Direct Single-Shot Phase Retrieval From The Diffraction Pattern Of Separated Objects

Ben Leshem; Rui Xu; Yehonatan Dallal; Jianwei Miao; Boaz Nadler; Dan Oron; Nirit Dudovich; Oren Raz (2016) Nature Communications. 7, Abstract

The non-crystallographic phase problem arises in numerous scientific and technological fields. An important application is coherent diffractive imaging. Recent advances in X-ray free-electron lasers allow capturing of the diffraction pattern from a single nanoparticle before it disintegrates, in so-called 'diffraction before destruction' experiments. Presently, the phase is reconstructed by iterative algorithms, imposing a non-convex computational challenge, or by Fourier holography, requiring a well-characterized reference field. Here we present a convex scheme for single-shot phase retrieval for two (or more) sufficiently separated objects, demonstrated in two dimensions. In our approach, the objects serve as unknown references to one another, reducing the phase problem to a solvable set of linear equations. We establish our method numerically and experimentally in the optical domain and demonstrate a proof-of-principle single-shot coherent diffractive imaging using X-ray free-electron lasers pulses. Our scheme alleviates several limitations of current methods, offering a new pathway towards direct reconstruction of complex objects.

A deep learning approach to unsupervised ensemble learning

U. Shaham, X. Cheng, O. Dror, A. Jaffe, B. Nadler, J. Chang And Y. Kluger (2016) International Conference on Machine Learning.

DISCUSSION OF "INFLUENTIAL FEATURES PCA FOR HIGH DIMENSIONAL CLUSTERING"

Boaz Nadler (2016) Annals of Statistics. 44, 6, p. 2366-2371

On the optimality of averaging in distributed statistical learning

Jd Rosenblatt; Boaz Nadler (2016) Information and Inference-a Journal of the Ima. 5, 4, p. 379-404 Abstract

A common approach to statistical learning with Big-data is to randomly split it among m machines and learn the parameter of interest by averaging the m individual estimates. In this paper, focusing on empirical risk minimization or equivalently M-estimation, we study the statistical error incurred by this strategy. We consider two large-sample settings: first, a classical setting where the number of parameters p is fixed, and the number of samples per machine n ->infinity. Second, a high-dimensional regime where both p,n -> infinity with p/n -> kappa is an element of(0, 1). For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-p setting, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second-order error terms, and show that these can be non-negligible, notably for nonlinear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: data splitting incurs a first-order accuracy loss, which increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations.

Detecting the large entries of a sparse covariance matrix in sub-quadratic time

Ofer Shwartz; Boaz Nadler (2016) Information and Inference-a Journal of the Ima. 5, 3, p. 304-330 Abstract

The covariance matrix of a p-dimensional random variable is a fundamental quantity in data analysis. Given n i.i.d. observations, it is typically estimated by the sample covariance matrix, at a computational cost of O(np(2)) operations. When n, p are large, this computation may be prohibitively slow. Moreover, in several contemporary applications, the population matrix is approximately sparse, and only its few large entries are of interest. This raises the following question: Assuming approximate sparsity of the covariance matrix, can its large entries be detected much faster, say in sub-quadratic time, without explicitly computing all its p(2) entries? In this paper, we present and theoretically analyze two randomized algorithms that detect the large entries of an approximately sparse sample covariance matrix using only O(np poly log p) operations. Furthermore, assuming sparsity of the population matrix, we derive sufficient conditions on the underlying random variable and on the number of samples n, for the sample covariance matrix to satisfy our approximate sparsity requirements. Finally, we illustrate the performance of our algorithms via several simulations.

Fast Detection of Curved Edges at Low SNR

Nati Ofir; Meirav Galun; Boaz Nadler; Ronen Basri (2016) 2016 Ieee Conference on Computer Vision and Pattern Recognition (Cpvr). p. 213-221 Abstract

Detecting edges is a fundamental problem in computer vision with many applications, some involving very noisy images. While most edge detection methods are fast, they perform well only on relatively clean images. Unfortunately, sophisticated methods that are robust to high levels of noise are quite slow. In this paper we develop a novel multiscale method to detect curved edges in noisy images. Even though our algorithm searches for edges over an exponentially large set of candidate curves, its runtime is nearly linear in the total number of image pixels. As we demonstrate experimentally, our algorithm is orders of magnitude faster than previous methods designed to deal with high noise levels. At the same time it obtains comparable and often superior results to existing methods on a variety of challenging noisy images.

Direct single-shot phase retrieval for separated objects (Conference Presentation)

Ben Leshem; Rui Xu; Jianwei Miao; Boaz Nadler; Dan Oron; Nirit Dudovich; Oren Raz (2016) Quantitative Phase Imaging II. 9718,

On the exact Berk-Jones statistics and their p-value calculation

Amit Moscovich; Boaz Nadler; C Spiegelman (2016) Electronic Journal of Statistics. 10, 2, p. 2329-2354 Abstract

Continuous goodness-of-fit testing is a classical problem in statistics. Despite having low power for detecting deviations at the tail of a distribution, the most popular test is based on the Kolmogorov-Smirnov statistic. While similar variance-weighted statistics such as Anderson-Darling and the Higher Criticism statistic give more weight to tail deviations, as shown in various works, they still mishandle the extreme tails. As a viable alternative, in this paper we study some of the statistical properties of the exact M-n statistics of Berk and Jones. In particular we show that they are consistent and asymptotically optimal for detecting a wide range of rare-weak mixture models. Additionally, we present a new computationally efficient method to calculate p-values for any supremum-based one-sided statistic, including the one-sided M-n(+), M-n(-) and R-n(+), R-n(-) statistics of Berk and Jones and the Higher Criticism statistic. Finally, we show that M-n compares favorably to related statistics in several finite-sample simulations.

For code see here

Unsupervised ensemble learning with dependent classifiers

A. Jaffe, E. Fetaya, B. Nadler, T. Jiang, Y. Kluger (2016) 19th international Conference on Artificial Intelligence and Statistics (AISTATS), 2016..

2015

Detection of Long Edges on a Computational Budget: A Sublinear Approach

Inbal Horev; Boaz Nadler; Ery Arias-Castro; Meirav Galun; Ronen Basri (2015) SIAM Journal on Imaging Sciences. 8, 1, p. 458-483 Abstract

Edge detection is a challenging, important task in image analysis. Various applications require real-time detection of long edges in large and noisy images, possibly under limited computational resources. While standard edge detection methods are computationally fast, they perform well only at low levels of noise. Modern sophisticated methods, in contrast, are robust to noise, but may be too slow for real-time processing of large images. This raises the following question, which is the focus of our paper: How well can one detect long edges in noisy images under severe computational constraints that allow only a fraction of all image pixels to be processed? We make several theoretical and practical contributions regarding this problem. We develop possibly the first sublinear algorithm to detect long straight edges in noisy images. In addition, we theoretically analyze the inevitable tradeoff between its detection performance and the allowed computational budget. Finally, we demonstrate its competitive performance on both simulated and real images.

DO SEMIDEFINITE RELAXATIONS SOLVE SPARSE PCA UP TO THE INFORMATION LIMIT?

Robert Krauthgamer; Boaz Nadler; Dan Vilenchik (2015) Annals of Statistics. 43, 3, p. 1300-1322 Abstract

Estimating the leading principal components of data, assuming they are sparse, is a central task in modern high-dimensional statistics. Many algorithms were developed for this sparse PCA problem, from simple diagonal thresholding to sophisticated semidefinite programming (SDP) methods. A key theoretical question is under what conditions can such algorithms recover the sparse principal components? We study this question for a single-spike model with an l(0)-sparse eigenvector, in the asymptotic regime as dimension p and sample size n both tend to infinity. Amini and Wainwright [Ann. Statist. 37 (2009) 2877-2921] proved that for sparsity levels k >= Omega (n/ log p), no algorithm, efficient or not, can reliably recover the sparse eigenvector. In contrast, for k = Omega(root n), the proposed SDP approach, at least in its standard usage, cannot recover the sparse spike. In fact, we conjecture that in the single-spike model, no computationally-efficient algorithm can recover a spike of l(0)-sparsity k >= Omega (root n). Finally, we present empirical results suggesting that up to sparsity levels k = O (root n), recovery is possible by a simple covariance thresholding algorithm.

Estimating the accuracies of multiple classifiers without labeled data

A. Jaffe, B. Nadler, Y. Kluger (2015) Aistats-2015.

For more information and code, click here.

Learning Parametric Output HMMs with two aliased states

R. Weiss And B. Nadler (2015) International Conference on Machine Learning (ICML), 2015..

2014

Direct phase retrieval in double blind Fourier holography

Oren Raz; Ben Leshem; Jianwei Miao; Boaz Nadler; Dan Oron; Nirit Dudovich (2014) Optics Express. 22, 21, p. 24935-24950 Abstract

Phase measurement is a long-standing challenge in a wide range of applications, from X-ray imaging to astrophysics and spectroscopy. While in some scenarios the phase is resolved by an interferometric measurement, in others it is reconstructed via numerical optimization, based on some a-priori knowledge about the signal. The latter commonly use iterative algorithms, and thus have to deal with their convergence, stagnation, and robustness to noise. Here we combine these two approaches and present a new scheme, termed double blind Fourier holography, providing an efficient solution to the phase problem in two dimensions, by solving a system of linear equations. We present and experimentally demonstrate our approach for the case of lens-less imaging. (C) 2014 Optical Society of America

Ranking and combining multiple predictors without labeled data

Fabio Parisi; Francesco Strino; Boaz Nadler; Yuval Kluger (2014) Proceedings of the National Academy of Sciences of the United States of America. 111, 4, p. 1253-1258 Abstract

In a broad range of classification and decision-making problems, one is given the advice or predictions of several classifiers, of unknown reliability, over multiple questions or queries. This scenario is different from the standard supervised setting, where each classifier's accuracy can be assessed using available labeled data, and raises two questions: Given only the predictions of several classifiers over a large set of unlabeled test data, is it possible to (i) reliably rank them and (ii) construct a metaclassifier more accurate than most classifiers in the ensemble? Here we present a spectral approach to address these questions. First, assuming conditional independence between classifiers, we show that the off-diagonal entries of their covariance matrix correspond to a rank-one matrix. Moreover, the classifiers can be ranked using the leading eigenvector of this covariance matrix, because its entries are proportional to their balanced accuracies. Second, via a linear approximation to the maximum likelihood estimator, we derive the Spectral Meta-Learner (SML), an unsupervised ensemble classifier whose weights are equal to these eigenvector entries. On both simulated and real data, SML typically achieves a higher accuracy than most classifiers in the ensemble and can provide a better starting point than majority voting for estimating the maximum likelihood solution. Furthermore, SML is robust to the presence of small malicious groups of classifiers designed to veer the ensemble prediction away from the (unknown) ground truth.

For more information and code, click here.

2013

Vectorial Phase Retrieval of 1-D Signals

Oren Raz; Nirit Dudovich; Boaz Nadler (2013) IEEE Transactions on Signal Processing. 61, 7, p. 1632-1643 Abstract

Reconstruction of signals from measurements of their spectral intensities, also known as the phase retrieval problem, is of fundamental importance in many scientific fields. In this paper we present a novel framework, denoted as vectorial phase retrieval, for reconstruction of pairs of signals from spectral intensity measurements of the two signals and of their interference. We show that this new framework can alleviate some of the theoretical and computational challenges associated with classical phase retrieval from a single signal. First, we prove that for compactly supported signals, in the absence of measurement noise, this new setup admits a unique solution. Next, we present a statistical analysis of vectorial phase retrieval and derive a computationally efficient algorithm to solve it. Finally, we illustrate via simulations, that our algorithm can accurately reconstruct signals even at considerable noise levels.

Matlab code can be found here.

Accurate Blur Models vs. Image Priors in Single Image Super-Resolution

N. Efrat, D. Glasner, A. Apartsin, B. Nadler, A. Levin (2013) Iccv, 2013..

On learning parametric-output HMMs

A. Kontorovich, B. Nadler And R. Weiss (2013) Icml, 2013..

MINIMAX BOUNDS FOR SPARSE PCA WITH NOISY HIGH-DIMENSIONAL DATA

Aharon Birnbaum; Iain M. Johnstone; Boaz Nadler; Debashis Paul (2013) Annals of Statistics. 41, 3, p. 1055-1084 Abstract

We study the problem of estimating the leading eigenvectors of a high-dimensional population covariance matrix based on independent Gaussian observations. We establish a lower bound on the minimax risk of estimators under the l(2) loss, in the joint limit as dimension and sample size increase to infinity, under various models of sparsity for the population eigenvectors. The lower bound on the risk points to the existence of different regimes of sparsity of the eigenvectors. We also propose a new method for estimating the eigenvectors by a two-stage coordinate selection scheme.

NORMALIZED CUTS ARE APPROXIMATELY INVERSE EXIT TIMES

Matan Gavish; Boaz Nadler (2013) SIAM Journal on Matrix Analysis and Applications. 34, 2, p. 757-772 Abstract

Normalized cut is a widely used measure of separation between clusters in a graph. In this paper we provide a novel probabilistic perspective on this measure. We show that for a partition of a graph into two weakly connected sets, V = A (sic) B, the multiway normalized cut is approximately MNcut approximate to 1/tau(A -> B) + 1/tau(B -> A), where tau(A -> B) is the unidirectional characteristic exit time of a random walk from subset A to subset B. Using matrix perturbation theory, we show that this approximation is exact to first order in the connection strength between the two subsets A and B, and we derive an explicit bound for the approximation error. Our result implies that for a Markov chain composed of a reversible subset A that is weakly connected to an absorbing subset B, the easy-to-compute normalized cut measure is a reliable proxy for the chain's spectral gap.

2012

Patch Complexity, Finite Pixel Correlations and Optimal Denoising

A. Levin, B. Nadler, F. Durand, W. Freeman, (2012) European Conference Computer Vision (ECCV), 2012..

Active Learning with Distributional Estimates

J. Roeder, B. Nadler, K. Kunzmann And F.A. Hamprecht (2012) Conference on Uncertainty in Artificial Intelligence (UAI), 2012..

2011

Performance of Eigenvalue-based Signal Detectors with Known and Unknown Noise Level

Boaz Nadler; Federico Penna; Roberto Garello (2011) 2011 Ieee International Conference On Communications (Icc). Abstract

In this paper we consider signal detection in cognitive radio networks, under a non-parametric, multi-sensor detection scenario, and compare the cases of known and unknown noise level. The analysis is focused on two eigenvalue-based methods, namely Roy's largest root test, which requires knowledge of the noise variance, and the generalized likelihood ratio test, which can be interpreted as a test of the largest eigenvalue vs. a maximum-likelihood estimate of the noise variance. The detection performance of the two considered methods is expressed by closed-form analytical formulas, shown to be accurate even for small number of sensors and samples. We then derive an expression of the gap between the two detectors in terms of the signal-to-noise ratio of the signal to be detected, and we identify critical settings where this gap is significant (e. g., small number of sensors and signal strength). Our results thus provide a measure of the impact of noise level knowledge and highlight the importance of accurate noise estimation.

Model Selection for Sinusoids in Noise: Statistical Analysis and a New Penalty Term

Boaz Nadler; Aryeh (Leonid) Kontorovich (2011) IEEE Transactions on Signal Processing. 59, 4, p. 1333-1345 Abstract

Detection of the number of sinusoids embedded in noise is a fundamental problem in statistical signal processing. Most parametric methods minimize the sum of a data fit (likelihood) term and a complexity penalty term. The latter is often derived via information theoretic criteria, such as minimum description length (MDL), or via Bayesian approaches including Bayesian information criterion (BIC) or maximum a posteriori (MAP). While the resulting estimators are asymptotically consistent, empirically their finite sample performance is strongly dependent on the specific penalty term chosen. In this paper we elucidate the source of this behavior, by relating the detection performance to the extreme value distribution of the maximum of the periodogram and of related random fields. Based on this relation, we propose a combined detection-estimation algorithm with a new penalty term. Our proposed penalty term is sharp in the sense that the resulting estimator achieves a nearly constant false alarm rate. A series of simulations support our theoretical analysis and show the superior detection performance of the suggested estimator.

DETECTION PERFORMANCE OF ROY'S LARGEST ROOT TEST WHEN THE NOISE COVARIANCE MATRIX IS ARBITRARY

Boaz Nadler; I. M. Johnstone (2011) 2011 Ieee Statistical Signal Processing Workshop (Ssp). p. 681-684 Abstract

Detecting the presence of a signal embedded in noise from a multi-sensor system is a fundamental problem in signal and array processing. In this paper we consider the case where the noise covariance matrix is arbitrary and unknown but we are given both signal bearing and noise-only samples. Using a matrix perturbation approach, combined with known results on the eigenvalues of inverse Wishart matrices, we study the behavior of the largest eigenvalue of the relevant covariance matrix, and derive an approximate expression for the detection probability of Roy's largest root test. The accuracy of our expressions is confirmed by simulations.

On the distribution of the ratio of the largest eigenvalue to the trace of a Wishart matrix

Boaz Nadler (2011) Journal of Multivariate Analysis. 102, 2, p. 363-371 Abstract

The ratio of the largest eigenvalue divided by the trace of a p x p random Wishart matrix with n degrees of freedom and an identity covariance matrix plays an important role in various hypothesis testing problems both in statistics and in signal processing In this paper we derive an approximate explicit expression for the distribution of this ratio by considering the joint limit as both p, n -> infinity with p/n -> c Our analysis reveals that even though asymptotically in this limit the ratio follows a Tracy-Widom (TW) distribution one of the leading error terms depends on the second derivative of the TW distribution and is non-negligible for practical values of p in particular for determining tail probabilities We thus propose to explicitly include this term in the approximate distribution for the ratio We illustrate empirically using simulations that adding this term to the TW distribution yields a quite accurate expression to the empirical distribution of the ratio even for small values of p, n (C) 2010 Elsevier Inc All rights reserved

Matlab Code available here

On the distribution of Roy's Largest Root Test in MANOVA and in Signal Detection in Noise

B. Nadler And I. M. Johnstone (2011) Technical Report, Department of Statistics, Stanford University, 2011..

Vectorial Phase Retrieval for Linear Characterization of Attosecond Pulses

Oren Raz; Oded Schwartz; D. Austin et al. (2011) Physical Review Letters. 107, 13, Abstract

The waveforms of attosecond pulses produced by high-harmonic generation carry information on the electronic structure and dynamics in atomic and molecular systems. Current methods for the temporal characterization of such pulses have limited sensitivity and impose significant experimental complexity. We propose a new linear and all-optical method inspired by widely used multidimensional phase retrieval algorithms. Our new scheme is based on the spectral measurement of two attosecond sources and their interference. As an example, we focus on the case of spectral polarization measurements of attosecond pulses, relying on their most fundamental property-being well confined in time. We demonstrate this method numerically by reconstructing the temporal profiles of attosecond pulses generated from aligned CO(2) molecules.

[All authors]

Natural Image Denoising: Optimality and Inherent Bounds

Anat Levin; Boaz Nadler (2011) 2011 Ieee Conference On Computer Vision And Pattern Recognition (Cvpr). p. 40 Abstract

The goal of natural image denoising is to estimate a clean version of a given noisy image, utilizing prior knowledge on the statistics of natural images. The problem has been studied intensively with considerable progress made in recent years. However, it seems that image denoising algorithms are starting to converge and recent algorithms improve over previous ones by only fractional dB values. It is thus important to understand how much more can we still improve natural image denoising algorithms and what are the inherent limits imposed by the actual statistics of the data. The challenge in evaluating such limits is that constructing proper models of natural image statistics is a long standing and yet unsolved problem. To overcome the absence of accurate image priors, this paper takes a non parametric approach and represents the distribution of natural images using a huge set of 1010 patches. We then derive a simple statistical measure which provides a lower bound on the optimal Bayesian minimum mean square error (MMSE). This imposes a limit on the best possible results of denoising algorithms which utilize a fixed support around a denoised pixel and a generic natural image prior. Our findings suggest that for small windows, state of the art denoising algorithms are approaching optimality and cannot be further improved beyond similar to 0.1dB values.

2010

Nonparametric Detection of Signals by Information Theoretic Criteria: Performance Analysis and an Improved Estimator

Boaz Nadler (2010) IEEE Transactions on Signal Processing. 58, 5, p. 2746-2756 Abstract

Determining the number of sources from observed data is a fundamental problem in many scientific fields. In this paper we consider the nonparametric setting, and focus on the detection performance of two popular estimators based on information theoretic criteria, the Akaike information criterion (AIC) and minimum description length (MDL). We present three contributions on this subject. First, we derive a new expression for the detection performance of the MDL estimator, which exhibits a much closer fit to simulations in comparison to previous formulas. Second, we present a random matrix theory viewpoint of the performance of the AIC estimator, including approximate analytical formulas for its overestimation probability. Finally, we show that a small increase in the penalty term of AIC leads to an estimator with a very good detection performance and a negligible overestimation probability.

Detecting Faint Curved Edges in Noisy Images

Sharon Alpert; Meirav Galun; Boaz Nadler; Ronen Basri (2010) Computer Vision-Eccv 2010, Pt Iv. 6314, p. 750-763 Abstract

A fundamental question for edge detection is how faint an edge can be and still be detected. In this paper we offer a formalism to study this question and subsequently introduce a hierarchical edge detection algorithm designed to detect faint curved edges in noisy images. In our formalism we view edge detection as a search in a space of feasible curves, and derive expressions to characterize the behavior of the optimal detection threshold as a function of curve length and the combinatorics of the search space. We then present an algorithm that efficiently searches for edges through a very large set of curves by hierarchically constructing difference filters that match the curves traced by the sought edges. We demonstrate the utility of our algorithm in simulations and in applications to challenging real images.

Clustering of high-dimensional gene expression data with feature filtering methods and diffusion maps

Rui Xu; Steven Damelin; Boaz Nadler; Donald C., Ii Wunsch (2010) Artificial Intelligence in Medicine. 48, 3-Feb, p. 91-98 Abstract

Objective: The importance of gene expression data in cancer diagnosis and treatment has become widely known by cancer researchers in recent years. However, one of the major challenges in the computational analysis of such data is the curse of dimensionality because of the overwhelming number of variables measured (genes) versus the small number of samples. Here, we use a two-step method to reduce the dimension of gene expression data and aim to address the problem of high dimensionality. Methods: First, we extract a subset of genes based on statistical characteristics of their corresponding gene expression levels. Then, for further dimensionality reduction, we apply diffusion maps, which interpret the eigenfunctions of Markov matrices as a system of coordinates on the original data set, in order to obtain efficient representation of data geometric descriptions. Finally, a neural network clustering theory, fuzzy ART, is applied to the resulting data to generate clusters of cancer samples. Results: Experimental results on the small round blue-cell tumor data set, compared with other widely used clustering algorithms, such as the hierarchical clustering algorithm and K-means, show that our proposed method can effectively identify different cancer types and generate high-quality cancer sample clusters. Conclusion: The proposed feature selection methods and diffusion maps can achieve useful information from the multidimensional gene expression data and prove effective at addressing the problem of high dimensionality inherent in gene expression data analysis. (C) 2009 Elsevier BM. All rights reserved.

Multiscale Wavelets on Trees, Graphs and High Dimensional Data: Theory and Applications to Semi Supervised Learning

M. Gavish, B. Nadler, R.R. Coifman (2010) International Conference on Machine Learning, 2010..

Global Features of Neural Activity in the Olfactory System Form a Parallel Code That Predicts Olfactory Behavior and Perception

Rafi Haddad; Tali Weiss; Rehan Khan; Boaz Nadler; Nathalie Mandairon; Moustafa Bensafi; Elad Schneidman; Noam Sobel (2010) Journal of Neuroscience. 30, 27, p. 9017-9026 Abstract

Odor identity is coded in spatiotemporal patterns of neural activity in the olfactory bulb. Here we asked whether meaningful olfactory information could also be read from the global olfactory neural population response. We applied standard statistical methods of dimensionality-reduction to neural activity from 12 previously published studies using seven different species. Four studies reported olfactory receptor activity, seven reported glomerulus activity, and one reported the activity of projection-neurons. We found two linear axes of neural population activity that accounted for more than half of the variance in neural response across species. The first axis was correlated with the total sum of odor-induced neural activity, and reflected the behavior of approach or withdrawal in animals, and odorant pleasantness in humans. The second and orthogonal axis reflected odorant toxicity across species. We conclude that in parallel with spatiotemporal pattern coding, the olfactory system can use simple global computations to read vital olfactory information from the neural population response.

Parametric Joint Detection-Estimation of the Number of Sources in Array Processing

N. Arkind, B. Nadler (2010) Sixth IEEE Sensor Array and Multichannel Signal Processing Workshop, 2010..

2009

Diffusion Interpretation of Nonlocal Neighborhood Filters for Signal Denoising

Amit Singer; Yoel Shkolnisky; Boaz Nadler (2009) SIAM Journal on Imaging Sciences. 2, 1, p. 118-139 Abstract

Nonlocal neighborhood filters are modern and powerful techniques for image and signal denoising. In this paper, we give a probabilistic interpretation and analysis of the method viewed as a random walk on the patch space. We show that the method is intimately connected to the characteristics of diffusion processes, their escape times over potential barriers, and their spectral decomposition. In particular, the eigenstructure of the diffusion operator leads to novel insights on the performance and limitations of the denoising method, as well as a proposal for an improved filtering algorithm.

Predicting Neuroendocrine Tumor (Carcinoid) Neoplasia Using Gene Expression Profiling and Supervised Machine Learning

Ignat Drozdov; Mark Kidd; Boaz Nadler; Robert L. Camp; Shrikant M. Mane; Oyvind Hauso; Bjorn I. Gustafsson; Irvin M. Modlin (2009) Cancer. 115, 8, p. 1638-1650 Abstract

BACKGROUND: A more accurate taxonomy of small intestinal (SI) neuroendocrine tumors (NETs) is necessary to accurately predict tumor behavior and prognosis and to define therapeutic strategy. In this study, the authors identified a panel of such markers that have been implicated in tumorigenicity, metastasis, and hormone production and hypothesized that transcript levels of the genes melanoma antigen family D2 (MAGE-D2), metastasis-associated 1 (MTA1), nucleosome assembly protein I-like (NAP1L1), Ki-67 (a marker of proliferation), survivin, frizzled homolog 7 (FZD7), the Kiss1 metastasis suppressor (Kiss1), neuropilin 2 (NPP2), and chromogranin A (CgA) could be used to define primary SI NETs and to predict the development of metastases. METHODS: Seventy-three clinically and World Health Organization pathologically classified NET samples (primary tumor, n = 44 samples; liver metastases, n = 29 samples) and 30 normal human enterochromafffin (EC) cell preparations were analyzed using real-time polymerase chain reaction. Transcript levels were normalized to 3 NET housekeeping genes (asparagine-linked glycosylation 9 or ALG9, transcription factor CP2 or TFCP2, and zinc finger protein 410 or ZNF410) using geNorm analysis. A predictive gene-based model was constructed using supervised learning algorithms from the transcript expression levels. RESULTS: Primary SI NETS could be differentiated from normal human EC cell preparations with 100% specificity and 92% sensitivity. Well differentiated NETs (WDNETs), well differentiated neuroendocrine carcinomas, and poorly differentiated NETs (PDNETs) were classified with a specificity of 78%, 78%, and 71%, respectively; whereas poorly differentiated neuroendocrine carcinomas were misclassified as either WDNETs or PDNETs. Metastases were predicted in all cases with 100% sensitivity and specificity. CONCLUSIONS: The current results indicated that gene expression profiling and supervised machine learning can be used to classify SI NE

REFINED PERTURBATION BOUNDS FOR EIGENVALUES OF HERMITIAN AND NON-HERMITIAN MATRICES

I. C. F. Ipsen; Boaz Nadler (2009) Siam Journal On Matrix Analysis And Applications. 31, p. 40-53 Abstract

We present eigenvalue bounds for perturbations of Hermitian matrices and express the change in eigenvalues in terms of a projection of the perturbation onto a particular eigenspace, rather than in terms of the full perturbation. The perturbations we consider are Hermitian of rank one, and Hermitian or non-Hermitian with norm smaller than the spectral gap of a specific eigenvalue. Applications include principal component analysis under a spiked covariance model, and pseudo-arclength continuation methods for the solution of nonlinear systems.

Universal Kernel-Based Learning with Applications to Regular Languages

Leonid (Aryeh) Kontorovich; Boaz Nadler (2009) Journal of Machine Learning Research. 10, p. 1095-1129 Abstract

We propose a novel framework for supervised learning of discrete concepts. Since the 1970's, the standard computational primitive has been to find the most consistent hypothesis in a given complexity class. In contrast, in this paper we propose a new basic operation: for each pair of input instances, count how many concepts of bounded complexity contain both of them. Our approach maps instances to a Hilbert space, whose metric is induced by a universal kernel coinciding with our computational primitive, and identifies concepts with half-spaces. We prove that all concepts are linearly separable under this mapping. Hence, given a labeled sample and an oracle for evaluating the universal kernel, we can efficiently compute a linear classifier (via SVM, for example) and use margin bounds to control its generalization error. Even though exact evaluation of the universal kernel may be infeasible, in various natural situations it is efficiently approximable. Though our approach is general, our main application is to regular languages. Our approach presents a substantial departure from current learning paradigms and in particular yields a novel method for learning this fundamental concept class. Unlike existing techniques, we make no structural assumptions on the corresponding unknown automata, the string distribution or the completeness of the training set. Instead, given a labeled sample our algorithm outputs a classifier with guaranteed distribution-free generalization bounds; to our knowledge, the proposed framework is the only one capable of achieving the latter. Along the way, we touch upon several fundamental questions in complexity, automata, and machine learning.

Semi-Supervised Learning with the Graph-Laplacian: The limit of infinite unlabelled data

B. Nadler, N. Srebro And X. Zhou (2009) Neural Information Processing Systems (NIPS), Vol. 22, 2009..

On Consistency and Sparsity for Principal Components Analysis in High Dimensions Discussion

Boaz Nadler (2009) Journal of the American Statistical Association. 104, 486, p. 694-697

Principal Component Analysis, Hierarchical Clustering, and Decision Tree Assessment of Plasma mRNA and Hormone Levels as an Early Detection Strategy for Small Intestinal Neuroendocrine (Carcinoid) Tumors

Irvin M. Modlin; Bjorn I. Gustafsson; Ignat Drozdov; Boaz Nadler; Roswitha Pfragner; Mark Kidd (2009) Annals of Surgical Oncology. 16, 2, p. 487-498 Abstract

Incidence of neuroendocrine tumors (NETs) is increasing (approximately 6%/year), but clinical presentation is nonspecific, resulting in delays in diagnosis (5-7 years; approximately 70% have metastases). This reflects absence of a sensitive plasma marker. The aim of this study is to investigate whether detection of circulating messenger RNA (mRNA) alone or in combination with circulating NET-related hormones and growth factors can detect gastrointestinal NET disease. The small intestinal (SI) NET cell line KRJ-I was used to define the sensitivity of real-time polymerase chain reaction (PCR) for mRNA detection in blood. NSE, Tph-1, and VMAT (2) transcripts were identified from one KRJ-I cell/ml blood. mRNA from the tissue and plasma of SI-NETs (n = 12) and gastric NETs (n = 7), and plasma from healthy controls (n = 9) was isolated and real-time PCR performed. Tph-1 was a specific marker of SI-NETs (58%, p <0.03) whereas CgA transcripts did not differentiate tumors from controls. Patients with metastatic disease expressed more marker transcripts than localized tumors (75% versus 18%, p <0.02). Plasma 5-hydroxytryptamine (5-HT), chromogranin A (CgA), ghrelin, and connective tissue growth factor (CTGF) fragments were measured, combined with mRNA levels, and a predictive mathematical model for NET diagnosis developed using decision trees. The sensitivity and specificity to diagnose SI-NETs and gastric NETs were 81.2% and 100%, and 71.4% and 55.6%, respectively. We conclude that mRNA from one NET cell/ml blood can be detected. Circulating plasma Tph-1 is a promising marker gene for SI-NET disease (specificity 100%) while an increased number of marker transcripts (> 2) correlated with disease spread. Including NET-related circulating hormones and growth factors in the algorithm increased the sensitivity of detection of SI-NETs from 58 to 82%.

Non-Parametric Detection of the Number of Signals: Hypothesis Testing and Random Matrix Theory

Shira Kritchman; Boaz Nadler (2009) IEEE Transactions on Signal Processing. 57, 10, p. 3930-3941 Abstract

Detection of the number of signals embedded in noise is a fundamental problem in signal and array processing. This paper focuses on the non-parametric setting where no knowledge of the array manifold is assumed. First, we present a detailed statistical analysis of this problem, including an analysis of the signal strength required for detection with high probability, and the form of the optimal detection test under certain conditions where such a test exists. Second, combining this analysis with recent results from random matrix theory, we present a new algorithm for detection of the number of sources via a sequence of hypothesis tests. We theoretically analyze the consistency and detection performance of the proposed algorithm, showing its superiority compared to the standard minimum description length (MDL)-based estimator. A series of simulations confirm our theoretical analysis.

2008

Determining the number of components in a factor model from limited noisy data

Shira Kritchman; Boaz Nadler (2008) Chemometrics and Intelligent Laboratory Systems. 94, 1, p. 19-32 Abstract

Determining the number of components in a linear mixture model is a fundamental problem in many scientific fields, including chemometrics and signal processing. In this paper we present a new method to automatically determine the number of components from a limited number of (possibly) high dimensional noisy samples. The proposed method, based on the eigenvalues of the sample covariance matrix, combines a matrix perturbation approach for the interaction of signal and noise eigenvalues, with recent results from random matrix theory regarding the behavior of noise eigenvalues. We present the theoretical derivation of the algorithm and an analysis of its consistency and limit of detection. Results on simulated data show that under a wide range of conditions our method compares favorably with other common algorithms. (C) 2008 Elsevier B.V. All rights reserved.

Matlab Code available here

REJOINDER OF: TREELETS - AN ADAPTIVE MULTI-SCALE BASIS FOR SPARE UNORDERED DATA

Ann B. Lee; Boaz Nadler; Larry Wasserman (2008) Annals of Applied Statistics. 2, 2, p. 494-500

TREELETS - AN ADAPTIVE MULTI-SCALE BASIS FOR SPARSE UNORDERED DATA

Ann B. Lee; Boaz Nadler; Larry Wasserman (2008) Annals of Applied Statistics. 2, 2, p. 435-471 Abstract

In many modern applications, including analysis of gene expression and texts documents, the data are noisy, high-dimensional, and unordered-with no particular meaning to the given order of the variables. Yet, successful learning is often possible due to sparsity: the fact that the data are typically redundant with underlying structures that can be represented by only a few features. In this paper we present treelets-a novel construction of multi-scale bases that extends wavelets to nonsmooth signals. The method is fully adaptive, as it returns a hierarchical tree and an orthonormal basis which both reflect the internal structure of the data. Treelets are especially well-suited as dimensionality reduction and feature selection tool prior to regression and classification, in situations where sample sizes are small and the data are sparse with unknown groupings of correlated or collinear variables. The method is also simple to implement and analyze theoretically. Here we describe a variety of situations where treelets perform better than principal component analysis, as well as some common variable selection and cluster averaging schemes. We illustrate treelets on a blocked covariance model and on several data sets (hyperspectral image data, DNA microarray data, and internet advertisements) with highly complex dependencies between variables.

Matlab Code available here

FINITE SAMPLE APPROXIMATION RESULTS FOR PRINCIPAL COMPONENT ANALYSIS: A MATRIX PERTURBATION APPROACH

Boaz Nadler (2008) Annals of Statistics. 36, 6, p. 2791-2817 Abstract

Principal component analysis (PCA) is a standard tool for dimensional reduction of a set of n observations (samples), each with p variables. In this paper, using a matrix perturbation approach, we study the nonasymptotic relation between the eigenvalues and eigenvectors of PCA computed on a finite sample of size n, and those of the limiting population PCA as n -> infinity. As in machine learning, we present a finite sample theorem which holds with high probability for the closeness between the leading eigenvalue and eigenvector of sample PCA and population PCA under a spiked covariance model. In addition, we also consider the relation between finite sample PCA and the asymptotic results in the joint limit p, n -> infinity, with p/n = c. We present a matrix perturbation view of the "phase transition phenomenon," and a simple linear-algebra based derivation of the eigenvalue and eigenvector overlap in this asymptotic limit. Moreover, our analysis also applies for finite p, n where we show that although there is no sharp phase transition as in the infinite case, either as a function of noise level or as a function of sample size n, the eigenvector of sample PICA may exhibit a sharp "loss of tracking," suddenly losing its relation to the (true) eigenvector of the population PCA matrix. This occurs due to a crossover between the eigenvalue due to the signal and the largest eigenvalue due to noise, whose eigenvector points in a random direction.

Nonparametric detection of the number of signals and random matrix theory

Shira Kritchman; Boaz Nadler (2008) 2008 42Nd Asilomar Conference On Signals, Systems And Computers, Vols 1-4. p. 1680-1683 Abstract

Nonparametric detection of the number of sources is a fundamental and extensively studied problem in signal processing. The common approach to solving this problem is via information theoretic criteria, in particular minimum description length (MDL). In this paper we propose a different approach and present a new algorithm based on a sequence of hypothesis tests. Our algorithm combines recent results from random matrix theory with a matrix perturbation approach. The new method has superior performance over the standard MDL estimator, as shown by simulations.

3D shape reconstruction of Mooney faces

Ira Kemelmacher-Shlizerman; Ronen Basri; Boaz Nadler (2008) 2008 Ieee Conference On Computer Vision And Pattern Recognition, Vols 1-12. p. 3323-3330 Abstract

Two-tone ("Mooney") images seem to arouse vivid 3D percept of faces, both familiar and unfamiliar, despite their seemingly poor content. Recent psychological and fMRI studies suggest that this percept is guided primarily by top-down procedures in which recognition precedes reconstruction. In this paper we investigate this hypothesis from a mathematical standpoint. We show that indeed, under standard shape from shading assumptions, a Mooney image can give rise to multiple different 3D reconstructions even if reconstruction is restricted to the Mooney transition curve (the boundary curve between black and white) alone. We then use top-down reconstruction methods to recover the shape of novel faces from single Mooney images exploiting prior knowledge of the structure of at least one face of a different individual. We apply these methods to thresholded images of real faces and compare the reconstruction quality relative to reconstruction from gray level images.

DIFFUSION MAPS, REDUCTION COORDINATES, AND LOW DIMENSIONAL REPRESENTATION OF STOCHASTIC SYSTEMS

R. R. Coifman; I. G. Kevrekidis; S. Lafon; M. Maggioni; Boaz Nadler (2008) Multiscale Modeling & Simulation. 7, 2, p. 842-864 Abstract

The concise representation of complex high dimensional stochastic systems via a few reduced coordinates is an important problem in computational physics, chemistry, and biology. In this paper we use the first few eigenfunctions of the backward Fokker-Planck diffusion operator as a coarse-grained low dimensional representation for the long-term evolution of a stochastic system and show that they are optimal under a certain mean squared error criterion. We denote the mapping from physical space to these eigenfunctions as the diffusion map. While in high dimensional systems these eigenfunctions are difficult to compute numerically by conventional methods such as finite differences or finite elements, we describe a simple computational data-driven method to approximate them from a large set of simulated data. Our method is based on de. ning an appropriately weighted graph on the set of simulated data and computing the first few eigenvectors and eigenvalues of the corresponding random walk matrix on this graph. Thus, our algorithm incorporates the local geometry and density at each point into a global picture that merges data from different simulation runs in a natural way. Furthermore, we describe lifting and restriction operators between the diffusion map space and the original space. These operators facilitate the description of the coarse-grained dynamics, possibly in the form of a low dimensional effective free energy surface parameterized by the diffusion map reduction coordinates. They also enable a systematic exploration of such effective free energy surfaces through the design of additional "intelligently biased" computational experiments. We conclude by demonstrating our method in a few examples.

2007

Variable-free exploration of stochastic models: A gene regulatory network example

Radek Erban; Thomas A. Frewen; Xiao Wang; Timothy C. Elston; Ronald Coifman; Boaz Nadler; Ioannis G. Kevrekidis (2007) Journal of Chemical Physics. 126, 15, Abstract

Finding coarse-grained, low-dimensional descriptions is an important task in the analysis of complex, stochastic models of gene regulatory networks. This task involves (a) identifying observables that best describe the state of these complex systems and (b) characterizing the dynamics of the observables. In a previous paper [R. Erban , J. Chem. Phys. 124, 084106 (2006)] the authors assumed that good observables were known a priori, and presented an equation-free approach to approximate coarse-grained quantities (i.e., effective drift and diffusion coefficients) that characterize the long-time behavior of the observables. Here we use diffusion maps [R. Coifman , Proc. Natl. Acad. Sci. U.S.A. 102, 7426 (2005)] to extract appropriate observables ("reduction coordinates") in an automated fashion; these involve the leading eigenvectors of a weighted Laplacian on a graph constructed from network simulation data. We present lifting and restriction procedures for translating between physical variables and these data-based observables. These procedures allow us to perform equation-free, coarse-grained computations characterizing the long-term dynamics through the design and processing of short bursts of stochastic simulation initialized at appropriate values of the data-based observables. (c) 2007 American Institute of Physics.

GeneChip, geNorm, and gastrointestinal tumors: novel reference genes for real-time PCR

Mark Kidd; Boaz Nadler; Shrikant Mane; Geeta Eick; Maximillian Malfertheiner; Manish Champaneria; Roswitha Pfragner; Irvin Modlin (2007) Physiological Genomics. 30, 3, p. 363-370 Abstract

Accurate quantitation of target genes depends on correct normalization. Use of genes with variable tissue transcription ( GAPDH) is problematic, particularly in clinical samples, which are derived from different tissue sources. Using a large- scale gene database ( Affymetrix U133A) data set of 36 gastrointestinal ( GI) tumors and normal tissues, we identified 8 candidate reference genes and established expression levels by real-time RT- PCR in an independent data set ( n = 42). A geometric averaging method ( geNorm) identified ALG9, TFCP2, and ZNF410 as the most robustly expressed control genes. Examination of raw CT values demonstrated that these genes were tightly correlated between themselves ( R(2) > 0.86, P <0.0001), with low variability [ coefficient of variation ( CV) <12.7%] and high interassay reproducibility ( r = 0.93, P = 0.001). In comparison, the alternative control gene, GAPDH, exhibited the highest variability ( CV = 18.1%), was significantly differently expressed between tissue types ( P = 0.05), was poorly correlated with the three reference genes ( R2 <0.4), and was considered the least stable gene. To illustrate the importance of correct normalization, the target gene, MTA1, was significantly overexpressed ( P = 0.0006) in primary GI neuroendocrine tumor ( NET) samples ( vs. normal GI samples) when normalized by geNorm(ATZ) but not when normalized using GAPDH. The geNormATZ approach was, in addition, applicable to adenocarcinomas; MTA1 was overexpressed ( P <0.04) in malignant colon, pancreas, and breast tumors compared with normal tissues. We provide a robust basis for the establishment of a reference gene set using GeneChip data and provide evidence for the utility of normalizing a malignancy- associated gene ( MTA1) using novel reference genes and the geNorm approach in GI NETs as well as in adenocarcinomas and breast tumors.

Diffusion Maps - a probabilistic interpretation for spectral embedding and clustering algorithms

B. Nadler, S. Lafon, R. R. Coifman, I. G. Kevrekidis (2007) in Principal Manifolds for Data Visualization and Dimension Reduction, A.N. Gorban, B. Kegl, D.C. Wunsch and A. Zinovyev (Eds), Springer, 2007. .

2006

Diffusion maps, spectral clustering and reaction coordinates of dynamical systems

Boaz Nadler; Stephane Lafon; Ronald R. Coifman; Ioannis G. Kevrekidis (2006) Applied and Computational Harmonic Analysis. 21, 1, p. 113-127 Abstract

A central problem in data analysis is the low dimensional representation of high dimensional data and the concise description of its underlying geometry and density. In the analysis of large scale simulations of complex dynamical systems, where the notion of time evolution comes into play, important problems are the identification of slow variables and dynamically meaningful reaction coordinates that capture the long time evolution of the system. In this paper we provide a unifying view of these apparently different tasks, by considering a family of diffusion maps, defined as the embedding of complex (high dimensional) data onto a low dimensional Euclidean space, via the eigenvectors of suitably defined random walks defined on the given datasets. Assuming that the data is randomly sampled from an underlying general probability distribution p(x) = e(-U(x)), we show that as the number of samples goes to infinity, the eigenvectors of each diffusion map converge to the eigenfunctions of a corresponding differential operator defined on the support of the probability distribution. Different normalizations of the Markov chain on the graph lead to different limiting differential operators. Specifically, the normalized graph Laplacian leads to a backward Fokker-Planck operator with an underlying potential of 2U(x), best suited for spectral clustering. A different anisotropic normalization of the random walk leads to the backward Fokker-Planck operator with the potential U(x), best suited for the analysis of the long time asymptotics of high dimensional stochastic systems governed by a stochastic differential equation with the same potential U(x). Finally, yet another normalization leads to the eigenfunctions of the Laplace-Beltrami (heat) operator on the manifold in which the data resides, best suited for the analysis of the geometry of the dataset regardless of its possibly non-uniform density. (C) 2006 Published by Elsevier Inc.

Fundamental Limitations of Spectral Clustering

B. Nadler, M. Galun (2006) Neural Information Processing systems, Vol. 19, 2006..

2005

Geometric Diffusion as a tool for harmonic analysis and structure definition of data, part I: Diffusion Maps

R. R. Coifman, S. Lafon, A.B. Lee, M. Maggioni,B. Nadler, F. Warner, S. Zucker (2005) Proceedings of the National Academy of Sciences, 102(21):7426-31 (2005)..

Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

B. Nadler, S. Lafon, R. R. Coifman, I. G. Kevrekidis (2005) Neural Information Processing Systems (NIPS), Vol 18, 2005..

Geometric Diffusion as a tool for harmonic analysis and structure definition of data, part II: Multiscale methods

R. R. Coifman, S. Lafon, A.B. Lee, M. Maggioni,B. Nadler, F. Warner, S. Zucker (2005) Proceedings of the National Academy of Sciences, 102(21):7432-37 (2005).

The prediction error in CLS and PLS: the importance of feature selection prior to multivariate calibration

B. Nadler, R. R. Coifman (2005) preprint of an article published in Journal of Chemometrics, 19(2):107-118 (2005).

The final version can be accessed at the publisher's website by clicking here

Partial least squares, Beer's law and the net analyte signal: statistical modeling and analysis

B. Nadler, R. R. Coifman (2005) preprint of an article published in Journal of Chemometrics, 19(1):45-54 (2005).

The final version can be accessed at the publisher's website by clicking here

_{This paper was awarded the Kowalski prize for best theoretical paper in J. Chemometrics in 2004-5.}

Langevin Trajectories between fixed concentrations

B. Nadler, Z. Schuss, A. Singer (2005) Physical Review Letters, 94 (2005) 218101..

2004

Saturation of conductance in single ion channels: The blocking effect of the near reaction field

B. Nadler, Z. Schuss, U. Hollerbach, R.S. Eisenberg, (2004) Physical Review E, 70 (2004) 051912.

Memoryless control of boundary concentrations of diffusing particles

A. Singer, Z. Schuss, B. Nadler, R.S. Eisenberg (2004) Physical Review E, 70 (2004) 061106..

Ionic diffusion through confined geometries: from Langevin equations to partial differential equations

B. Nadler, Z. Schuss, A. Singer, R.S. Eisenberg (2004) Journal of Physics: Condensed Matter, 16(22) (2004) S2153-S2165..

2003

Dielectric boundary force and its crucial role in gramicidin

B. Nadler, U. Hollerbach And R.S. Eisenberg (2003) Physical Review E, 68 (2003) 021905..

Connecting a Discrete Ionic Simulation to a Continuum

B. Nadler, T. Naeh, Z. Schuss (2003) SIAM Journal on Applied Mathematics, 63, Number 3 (2003), pp. 850-873..

2001

Derivation of Poisson and Nernst-Planck equations in a bath and channel from a molecular model

Z. Schuss, B. Nadler, R.S. Eisenberg (2001) Physical Review E, 64 (2001) 036116..

The Stationary Arrival Process of Independent Diffusers from a Continuum to an Absorbing Boundary Is Poissonian

B. Nadler, T. Naeh, Z. Schuss (2001) SIAM Journal on Applied Mathematics, 62(2) (2001), pp. 433-447..

1999

A qualitative and quantitative visibility analysis in urban scenes

B. Nadler, G. Fibich, S. Lev-Yehudi, Daniel Cohen-Or (1999) Computers and Graphics, 23(5):655-666 (1999)..