## Select Seminar Series

All seminars- Home
- ›
- Studying at the Faculty
- ›
- Seminars ›
- Machine Learning and Statistics Seminar

# Machine Learning and Statistics Seminar

This talk presents new computational and statistical barriers in machine learning, along with the algorithmic developments that they inspire.

The computational barriers arise in nonconvex optimization: we prove lower bounds on the (oracle) complexity of finding stationary points using (stochastic) gradient methods, showing that gradient descent is unimprovable for a natural class of problems. We bypass this barrier by designing an algorithm that outperforms gradient descent for a large subclass of problems with high-order smoothness. Our algorithm leverages classical momentum techniques from convex optimization using a "convex until proven guilty" principle that we develop.

The statistical barrier is the large amount of data required for adversarially robust learning. In a Gaussian model, we prove that unlabeled data allows us to circumvent an information theoretic gap between robust and standard classification. Our analysis directly leads to a general robust self-training procedure; we use it to significantly improve state-of-the-ar

In recent years there has been an increasing gap between the success of machine learning algorithms and our ability to explain their success theoretically.

Namely, many of the problems that are solved to a satisfactory degree of precision are computationally hard in the worst case. Fortunately, there are often reasonable assumptions which help us to get around these worst-case impediments and allow us to rigorously analyze heuristics that are used in practice.

In this talk I will advocate a complementary approach, where instead of explicitly characterizing some desired "niceness" properties of the data, we assume access to an optimization oracle that solves a relatively simpler task. This allows us to identify the sources of hardness and extend our theoretical understanding to new domains. Furthermore we will show that seemingly innocents (and arguably justifiable) modifications to the oracle can lead to tractable reductions and even to bypass hardness results.

We demonstrate these ideas using the following results: i) An efficient algorithm for non-convex online learning using an optimization oracle. b) A faster boosting algorithm using a "simple" weak learner. iii) An efficient reduction from online to private learning.

Joint works with Naman Agarwal, Noga Alon, Elad Hazan, and Shay Moran.

In recent years there has been an increasing gap between the success of machine learning algorithms and our ability to explain their success theoretically.

Namely, many of the problems that are solved to a satisfactory degree of precision are computationally hard in the worst case. Fortunately, there are often reasonable assumptions which help us to get around these worst-case impediments and allow us to rigorously analyze heuristics that are used in practice.

In this talk I will advocate a complementary approach, where instead of explicitly characterizing some desired "niceness" properties of the data, we assume access to an optimization oracle that solves a relatively simpler task. This allows us to identify the sources of hardness and extend our theoretical understanding to new domains. Furthermore we will show that seemingly innocents (and arguably justifiable) modifications to the oracle can lead to tractable reductions and even to bypass hardness results.

We demonstrate these ideas using the following results: i) An efficient algorithm for non-convex online learning using an optimization oracle. b) A faster boosting algorithm using a "simple" weak learner. iii) An efficient reduction from online to private learning.

Joint works with Naman Agarwal, Noga Alon, Elad Hazan, and Shay Moran.

It is becoming increasingly clear that implicit regularization afforded by the optimization algorithms play a central role in machine

learning, and especially so when using large, deep, neural networks. We have a good understanding of the implicit regularization afforded by stochastic approximation algorithms, such as SGD, for convex problem, and we understand and can characterize the implicit bias of different algorithms, and can design algorithms with specific biases. But in this talk I will focus on implicit biases of local search algorithms for non-convex underdetermined problem, such as deep networks. In an effort to uncover the implicit biases of gradient-based optimization of neural networks, which holds the key to their empirical success, I will discuss recent

work on implicit regularization for matrix factorization, linear convolutional networks, and two-layer ReLU networks, as well as a general bottom-up understanding on implicit regularization in terms of optimization geometry.

The linear-quadrati

We examine several empirical and theoretical results on the training of deep networks. For example, Why are common "over-fitting" indicators (e.g., very low training error, high validation loss) misleading? Why, sometimes, early-stopping time never arrives? Why can adaptive rate methods (e.g., adam) degrade generalization? Why commonly used loss functions exhibit better generalization than others? Why use weight decay before batch-norm? When can we use low numerical precision, and how low can we get? and discuss the practical implications of these results.

Bio == Since October 2017, Daniel soudry is an assistant professor (Taub Fellow) in the Department of Electrical Engineering at the Technion, working in the areas of machine learning and theoretical neuroscience. Before that, he did his post-doc (as a Gruss Lipper fellow) working with Prof. Liam Paninski in the Department of Statistics, the Center for Theoretical Neuroscience the Grossman Center for Statistics of the Mind at Columbia University. He did his Ph.D. in the Department of Electrical Engineering at the Technion, Israel Institute of technology, under the guidance of Prof. Ron Meir. He received his B.Sc. degree in Electrical Engineering and Physics from the Technion.

Universal learning is considered from an information theoretic point of view following the universal prediction approach pursued in the 90's by F&Merhav. Interestingly, the extension to learning is not straight-forwar

Complex classifiers may exhibit "embarrassing" failures even in "easy" cases where humans can provide a simple justified explanation. Avoiding such failures is obviously of key importance. In this work, we focus on one such setting, where a label is perfectly predictable if the input contains certain features, or rules, and otherwise it is predictable by a linear classifier. We define a hypothesis class that captures this notion and determine its sample complexity. We also give evidence that efficient algorithms cannot achieve this sample complexity. We then derive a simple and efficient algorithm and show that its sample complexity is close to optimal, among efficient algorithms. Experiments on synthetic and sentiment analysis data demonstrate the efficacy of the method, both in terms of accuracy and interpretabilit

Deep learning has been amongst the most emerging fields in computer science and engineering. In recent years it has been shown that deep networks are vulnerable to attacks by adversarial examples. I will introduce a novel flexible approach named Houdini for generating adversarial examples for complex and structured tasks and demonstrate successful attacks on different applications such as speech recognition, pose estimation, semantic image segmentation, speaker verification, and malware detection. Then I will discuss how this weakness can be turned into two secure applications. The first is a new technique for watermarking deep network models in a black-box way. That is, concealing information within the model that can be used by the owner of the model to claim ownership. The second application is a novel method to Speech Steganography, namely hiding a secret spoken message within an ordinary public spoken message. I will conclude the talk by a brief discussion of our attempts to detect such adversarial attacks, based on multiple semantic label representations

Generative Adversarial Networks (GANs) is a recent algorithmic framework that has won considerable attention. In a nutshell, GANs receive as input an IID sample and outputs synthetic data that should resemble data from the true underlying distribution. For example, consider an algorithm that receives as input some tunes from a specific music genre (e.g. jazz, rock, pop) and then outputs a new, original, tune from that genre.

From a theoretical perspective, the distinction between algorithms that genuinely generate original new examples vs. algorithms that perform naive manipulations (or even merely memorization) of the input sample is an elusive distinction. This makes the theoretical analysis of GANs algorithms challenging.

In this work we introduce two mathematical frameworks for the task of generating synthetic data. The first model we consider is inspired by GANs, and the learning algorithm has only an indirect access to the target distribution via a discriminator. The second model, called DP-Foolability, exploits the notion of differential privacy as a criterion for "non-memorization".

We characterize learnability in each of these models as well as discuss the interrelations. As an application we prove that privately PAC learnable classes are DP-foolable. As we will discuss, this can be seen as an analogue of the equivalence between uniform convergence and learnability in classical PAC learning.

Joint work with Olivier Bousquet and Shay Moran.

https://arxiv.org/pdf/1902.03468.pdf

Recently, the Riemannian geometry of the space of symmetric positive-defini

* Joint work with Or Yair, Ori Katz, and Miri Ben-Chen

The Blind Source Separation (BSS) problem consists of retrieving signals of interest, termed the sources, from a dataset consisting of their mixtures. One of the most popular and common paradigms for solving the BSS problem is Independent Component Analysis (ICA), where the sources are assumed to be (only) mutually statistically independent random processes, and the mixtures are assumed to be linear combinations thereof, where the linear mixing operator is unknown. In this talk, we shall start with the Gaussian Maximum Likelihood (GML) approach for the semi-blind problem, in which the sources are assumed to be temporally-dive

We show that when data are endogenously truncated the widely-used IV fails to render the relationship causal as well as introduces bias into the exogenous covariates. We offer a newly-introduce

Deep learning has become pervasive in everyday life, powering language applications like Apple's Siri, Amazon's Alexa, and Google Translate. The inherent limitation of these deep learning systems, however, is that they often function as a "black box," preventing researchers and users from discerning the roles of different components and what they learn during the training process. In this talk, I will describe my research on interpreting deep learning models for language along three lines. First, I will present a methodological framework for investigating how these models capture various language properties. The experimental evaluation will reveal a learned hierarchy of internal representations in deep models for machine translation and speech recognition. Second, I will demonstrate that despite their success, deep models of language fail to deal even with simple kinds of noise, of the type that humans are naturally robust to. I will then propose simple methods for improving their robustness to noise. Finally, I will turn to an intriguing problem in language understanding, where dataset biases enable trivial solutions to complex language tasks. I will show how to design models that are more robust to such biases, and learn less biased latent representations

Understanding deep learning calls for addressing three fundamental questions: expressiveness, optimization and generalization. Expressiveness refers to the ability of compactly sized deep neural networks to represent functions capable of solving real-world problems. Optimization concerns the effectiveness of simple gradient-based algorithms in solving non-convex neural network training programs. Generalization treats the phenomenon of deep learning models not overfitting despite having much more parameters than examples to learn from. This talk will describe a series of works aimed at unraveling some of the mysteries behind optimization and expressiveness. I will begin by discussing recent analyses of optimization for deep linear neural networks. By studying the trajectories of gradient descent, we will derive the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that, sometimes, gradient descent can train a deep linear network faster than a classic linear model. In other words, depth can accelerate optimization, even without any gain in expressiveness, and despite introducing non-convexity to a formerly convex problem. In the second (shorter) part of the talk, I will present an equivalence between convolutional and recurrent networks --- the most successful deep learning architectures to date --- and hierarchical tensor decompositions. The equivalence brings forth answers to various questions concerning expressiveness, resulting in new theoretically-b

Optimization works covered in this talk were in collaboration with Sanjeev Arora, Elad Hazan, Noah Golowich and Wei Hu. Expressiveness works were with Amnon Shashua, Or Sharir, Yoav Levine, Ronen Tamari and David Yakira.

Despite their superb empirical performance, deep learning models for natural language processing (NLP) are often considered black boxes, as relatively little is known as to what accounts for their success. This lack of understanding turns model development into a slow and expensive trial-and-error process, which limits many researchers from developing state-of-the-ar

This is joint work with Hao Peng, Sam Thomson and Noah A. Smith

As algorithmic prediction systems have become more widespread, so too have concerns that these systems may be discriminatory against groups of people protected by laws and ethics. We present a recent line of work that takes a complexity theoretic perspective towards combating discrimination in prediction systems. We'll focus on fair classification within the versatile framework of Dwork et al. [ITCS'12], which assumes the existence of a metric that measures similarity between pairs of individuals. Unlike earlier work, we do not assume that the entire metric is known to the learning algorithm; instead, the learner can query this metric a bounded number of times. We propose a new notion of fairness called *metric multifairness* and show how to achieve this notion in our setting. Metric multifairness is parameterized by a similarity metric d on pairs of individuals to classify and a rich collection C of (possibly overlapping) "comparison sets" over pairs of individuals. At a high level, metric multifairness guarantees that *similar subpopulations are treated similarly*, as long as these subpopulations are identified within the class C.

The tremendous success of the Machine Learning paradigm heavily relies on the development of powerful optimization methods. The canonical algorithm for training learning models is SGD (Stochastic Gradient Descent), yet this method has its limitations. It is often unable to exploit useful statistical/geometric structure, it might degrade upon encountering prevalent non-convex phenomena, and it is hard to parallelize. In this talk I will discuss an ongoing line of research where we develop alternative methods that resolve some of SGD"s limitations. The methods that I describe are as efficient as SGD, and implicitly adapt to the underlying structure of the problem in a data dependent manner.

In the first part of the talk, I will discuss a method that is able to take advantage of hard/easy training samples. In the second part, I will discuss a method that enables an efficient parallelization of SGD. Finally, I will briefly describe a method that implicitly adapts to the smoothness and noise properties of the learning objective.

Single-particle cryo-electron microscopy (cryo-EM) is an innovative technology for elucidating structures of biological molecules at atomic-scale resolution. In a cryo-EM experiment, tomographic projections of a molecule, taken at unknown viewing directions, are embedded in highly noisy images at unknown locations. The cryo-EM problem is to estimate the 3-D structure of a molecule from these noisy images.

Inspired by cryo-EM, the talk will focus on two estimation problems: *multi-reference alignment* and *blind deconvolution*. These problems abstract away much of the intricacies of cryo-EM, while retaining some of its essential features. In multi-reference alignment, we aim to estimate a signal from its noisy, rotated observations. While the rotations and the signal are unknown, the goal is only to estimate the signal. In the blind deconvolution problem, the goal is to estimate a signal from its convolution with an unknown, sparse signal in the presence of noise. Focusing on the low SNR regime, I will propose the *method of moments* as a computationally efficient estimation framework for both problems and will introduce its properties. In particular, I will show that the method of moments allows estimating the sought signal accurately in any noise level, provided sufficiently many observations are collected, with only one pass over the data. I will then argue that the same principles carry through to cryo-EM, show examples, and draw potential implications.

There is a movement in design of experiments away from the classic randomization put forward by Fisher, Cochran and others to one based on optimization. In fixed-sample trials comparing two groups, measurements of subjects are known in advance and subjects can be divided optimally into two groups based on a criterion of homogeneity or "imbalance" between the two groups. These designs are far from random. This talk seeks to understand the benefits and the costs over classic randomization in the context of different performance criterions such as Efron's worst-case analysis. In the criterion that we motivate, randomization beats optimization. However, the optimal design is shown to lie between these two extremes. Much-needed further work will provide a procedure to find this optimal designs in different scenarios in practice. Until then, it is best to randomize.

A classical problem in causal inference is that of matching treatment units to control units in an observational dataset. This problem is distinct from simple estimation of treatment effects as it provides additional practical interpretability of the underlying causal mechanisms that is not available without matching. Some of the main challenges in developing matching methods arise from the tension among (i) inclusion of as many covariates as possible in defining the matched groups, (ii) having matched groups with enough treated and control units for a valid estimate of average treatment effect in each group, (iii) computing the matched pairs efficiently for large datasets, and (iv) dealing with complicating factors such as non-independence among units. We propose the Fast Large-scale Almost Matching Exactly (FLAME) framework to tackle these problems for categorical covariates. At its core this framework proposes an optimization objective for match quality that captures covariates that are integral for making causal statements while encouraging as many matches as possible. We demonstrate that this framework is able to construct good matched groups on relevant covariates and further extend the methodology to incorporate continuous and other complex covariates. related papers: https://arxiv.org/abs/1707.06315, https://arxiv.org/abs/1806.06802

Cryo-electron microscopy (cryo-EM) is an imaging technology that is revolutionizing structural biology, enabling reconstruction of molecules at near-atomic resolution.

Cryo-EM produces a large number of noisy two-dimensional tomographic projection images of a molecule, taken at unknown viewing directions.

The extreme levels of noise make classical tasks in statistics and signal processing, such as alignment, detection and clustering, very challenging.

I will start the talk by studying the multi-reference alignment problem, which can be interpreted as a simplified model for cryo-EM. In multi-reference alignment, we aim to estimate multiple signals from circularly-translated, unlabeled, noisy copies.

In high noise regimes, the measurements cannot be aligned or clustered. Nonetheless, accurate and efficient estimation can be achieved via group-invariant representations (invariant polynomials). Furthermore, such estimators achieve the optimal estimation rate.

Then, I will show how this framework can be applied to the problem of 2-D classification in cryo-EM. In the last part of the talk, I will introduce the analog invariants of the cryo-EM problem and discuss how they can be used for ab initio modeling.

The use of latent variables in probabilistic modeling is a standard approach in numerous data analysis applications. In recent years, there has been a surge of interest in spectral methods for latent variable models, where inference is done by analyzing the lower order moments of the observed data. In contrast to iterative approaches such as the EM algorithm, under appropriate conditions spectral methods are guaranteed to converge to the true model parameters given enough data samples.

The focus of the seminar is the development of novel spectral based methods for two problems in statistical machine learning. In the first part, we address unsupervised ensemble learning, where one obtains predictions from different sources or classifiers, yet without knowing the reliability and expertise of each source, and with no labeled data to directly assess it. We develop algorithms to estimate the reliability of the classifiers based on a common assumption that different classifiers make statistically independent errors. In addition, we show how one can detect subsets of classifiers that strongly violate the model of independent errors, in a fully unsupervised manner.

In the second part of the seminar we show how one can use spectral methods to learn the parameters of binary latent variable models. This model has many applications such as overlapping clustering and Gaussian-Bernoulli restricted Boltzmann machines. Our methods are based on computing the eigenvectors of both the second and third moments of the observed variables.

For both problems, we show that spectral based methods can be applied effectively, achieving results that are state of the art in various problems in computational biology and population genetics.

This talk will include an introduction to the topic of V(D)J rearrangements of particular subsets of T cells and B cells of the adaptive human immune system, in particular of IgG heavy chains. There are many statistical problems that arise in trying to understand these cells. They involve estimating aspects of functionals of discrete probabilities on (random) finite sets. Topics include but are not limited to exchangeability, estimating non-centrality parameters, and estimating covariance matrices from what are called "replicates" that have been amplified by the PCR process and (partially) sequenced.

I have received considerable assistance from Lu Tian, and also Yi Liu; as well, I have been helped considerably by Andrew Fire and Scott Boyd, and also Jorg Goronzy

Interacting systems are prevalent in nature, from dynamical systems in physics to complex societal dynamics. In this talk I will introduce our neural relational inference model: an unsupervised model that learns to infer interactions while simultaneously learning the dynamics purely from observational data. Our model takes the form of a variational auto-encoder, in which the latent code represents the underlying interaction graph and the reconstruction is based on graph neural networks.

Language technology has become pervasive in everyday life, powering applications like Apple's Siri or Google's Assistant. Neural networks are a key component in these systems thanks to their ability to model large amounts of data. Contrary to traditional systems, models based on deep neural networks (a.k.a. deep learning) can be trained in an end-to-end fashion on input-output pairs, such as a sentence in one language and its translation in another language, or a speech utterance and its transcription. The end-to-end training paradigm simplifies the engineering process while giving the model flexibility to optimize for the desired task. This, however, often comes at the expense of model interpretability: understanding the role of different parts of the deep neural network is difficult, and such models are often perceived as "black-box". In this work, we study deep learning models for two core language technology tasks: machine translation and speech recognition. We advocate an approach that attempts to decode the information encoded in such models while they are being trained. We perform a range of experiments comparing different modules, layers, and representations in the end-to-end models. Our analyses illuminate the inner workings of end-to-end machine translation and speech recognition systems, explain how they capture different language properties, and suggest potential directions for improving them. The methodology is also applicable to other tasks in the language domain and beyond.

Machine learning has recently been revolutionized by the introduction of Deep Neural Networks. However, from a theoretical viewpoint these methods are still poorly understood. Indeed the key challenge in Machine Learning today is to derive rigorous results for optimization and generalization in deep learning. In this talk I will present several tractable approaches to training neural networks. At the second part I will discuss a new sequential algorithm for decision making that can take into account the structure in the action space and is more tuned with realistic decision making scenarios.

I will present our work that provides some of the first positive results and yield new, provably efficient, and practical algorithms for training certain types of neural networks. In a second work I will present a new online algorithm that learns by sequentially sampling random networks and asymptotically converges, in performance, to the optimal network. Our approach improves on previous random features based learning in terms of sample/computational complexity, and expressiveness. In a more recent work we take a different perspective on this problem. I will provide sufficient conditions that guarantee tractable learning, using the notion of refutation complexity. I will then discuss how this new idea can lead to new interesting generalization bounds that can potentially explain generalization in settings that are not always captured by classical theory.

In the setting of reinforcement learning I will present a recently developed new algorithm for decision making in a metrical action space. As an application, we consider a dynamic pricing problem in which a seller is faced with a stream of patient buyers. Each buyer buy at the lowest price in a certain time window. We use our algorithm to achieve an optimal regret, improving on previously known regret bound.

Cryo-EM is an imaging technology that is revolutionizing structural biology; the Nobel Prize in Chemistry 2017 was recently awarded to Jacques Dubochet, Joachim Frank and Richard Henderson "for developing cryo-electron microscopy for the high-resolution structure determination of biomolecules in solution".

Cryo-electron microscopes produce a large number of very noisy two-dimensional projection images of individual frozen molecules. Unlike related methods, such as computed tomography (CT), the viewing direction of each image is unknown. The unknown directions, together with extreme levels of noise and additional technical factors, make the determination of the structure of molecules challenging.

While other methods for structure determination, such as x-ray crystallography and nuclear magnetic resonance (NMR), measure ensembles of molecules together, cryo-EM produces measurements of individual molecules. Therefore, cryo-EM could potentially be used to study mixtures of different conformations of molecules. Indeed, current algorithms have been very successful at analyzing homogeneous samples, and can recover some distinct conformations mixed in solutions, but, the determination of multiple conformations, and in particular, continuums of similar conformations (continuous heterogeneity), remains one of the open problems in cryo-EM.

I will discuss a one-dimensional discrete model problem, Heterogeneous Multireference Alignment, which captures many of the group properties and other mathematical properties of the cryo-EM problem. I will then discuss different components which we are introducing in order to address the problem of continuous heterogeneity in cryo-EM: 1. "hyper-molecules", the first mathematical formulation of truly continuously heterogeneous molecules, 2. The optimal representation of objects that are highly concentrated in both the spatial domain and the frequency domain using high-dimensional prolate spheroidal functions, and 3. Bayesian algorithms for inverse problems with an unsupervised-learning component for recovering such hyper-molecules in cryo-EM.

Covariance matrix estimation is essential in many areas of modern Statistics and Machine Learning including Graphical Models, Classification/Discriminant Analysis, Principal Component Analysis, and many others. Classical statistics suggests using Sample Covariance Matrix (SCM) which is a Maximum Likelihood Estimator (MLE) in the Gaussian populations. Real world data, however, usually exhibits heavy-tailed behavior and/or contains outliers, making the SCM non-efficient or even useless. This problem and many similar ones gave rise to the Robust Statistics field in early 60s, where the main goal was to develop estimators stable under reasonable deviations from the basic Gaussian assumptions. One of the most prominent robust covariance matrix estimators was introduced and thoroughly studied by D. Tyler in the mid-80s. This important representative of the family of M-estimators can be defined as an MLE of a certain population. The problem of robust covariance estimation becomes even more involved in the high-dimensional scenario, where the number of samples n is of the order of the dimension p, or even less. In such cases, prior knowledge, often referred to as structure, is utilized to decrease the number of degrees of freedom and make the estimation possible. Unlike the Gaussian setting, in Tyler's case even imposition of linear structure becomes challenging due to the non-convexity of the negative log-likelihood. Recently, Tyler's target function was shown to become convex under a certain change of metric (geodesic convexity), which stimulated further investigation of the estimator.

In this work, we focus on the so-called group symmetry structure, which essentially means that the true covariance matrix commutes with a group of unitary matrices. In engineering applications such structures appear due to the natural symmetries of the physical processes; examples include circulant, perHermitian, proper quaternion matrices, etc. Group symmetric constraints are linear, and thus convex in the regular Euclidean metric. We show that they are also convex in the geodesic metric. These properties allow us to develop symmetric versions of the SCM and Tyler's estimator and build a general framework for their performance analysis. The classical results claim that at least n = p and n = p+1 samples in general position are necessary to ensure the existence and uniqueness of the SCM and Tyler's estimator, respectively. We significantly improve the sample complexity requirements for both estimators under the symmetry structure and show that in some cases even 1 or 2 samples are enough to guarantee the existence and uniqueness regardless of the ambient dimension.

We consider the problem of hidden common manifold extraction from multiple data sets, which have observation-specific distortions and artifacts. A new manifold learning method is presented based on alternating products of diffusion operators and local kernels. We provide theoretical analysis showing that our method is able to build a variant of the Laplacian of the hidden common manifold, while suppressing the observation-specific artifacts. The generality of this method is demonstrated in data analysis applications, where different types of devices are used to measure the same activity. In particular, we present applications to problems in biomedicine, neuroscience, and audio analysis.

This is joint work with Roy Lederman and Hau-tieng Wu.

The past five years have seen a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for feature learning and classification. However, the mathematical reasons for this success remain elusive. In this talk we will briefly survey some existing theory of deep learning. In particular, we will focus on data structure based theory and discuss two recent developments.

The first work studies the generalization error of deep neural network. We will show how the generalization error of deep networks can be bounded via their classification margin. We will also discuss the implications of our results for the regularization of the networks. For example, the popular weight decay regularization guarantees the margin preservation, but it leads to a loose bound to the classification margin. We show that a better regularization strategy can be obtained by directly controlling the properties of the network's Jacobian matrix.

The second work focuses on solving minimization problems with neural networks. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence of certain iterative algorithms with an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA (iterative shrinkage and thresholding algorithm) solution by neural networks with layers representing iterations.

Joint work with Guillermo Sapiro, Miguel Rodrigues, Jure Sokolic, Alex Bronstein and Yonina Eldar.

We propose a procedure (the first of its kind) for computing a fully data-dependent interval that traps the mixing time t_mix of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge.

The interval is constructed around the relaxation time t_relax, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a sqrt{n} rate, where n is the length of the sample path. Upper and lower bounds are given on the number of samples required to achieve constant-factor multiplicative accuracy. The lower bounds indicate that, unless further restrictions are placed on the chain, no procedure can achieve this accuracy level before seeing each state at least \Omega(t_relax) times on the average. Future directions of research are identified. Time permitting, we will mention some recent further developments by D. Levin and Y. Peres.

Joint work with Daniel Hsu and Csaba Szepesvari.

By analytical and numerical studies of Deep Neural Networks (using standard TensorFlow) in the "Information Plane" - the Mutual Information the network layers preserve on the input and the output variables - we obtain the following new insights.

- The training epochs, for each layer, are divided into two phases: (1) fitting the training data - increasing the mutual information on the labels; (2) compressing the representation - reducing the mutual information on the inputs. The layers are learnt hierarchically, from the bottom to the top layer, with some overlaps.
- Most (~80%) of the training time - optimization with SGD - is spent on compressing the representation (the second phase) - NOT on fitting the training data labels, even when the training has no regularization or terms that directly aim at such compression.
- The convergence point, FOR EVERY HIDDEN LAYER, lies on or very close to the Information Bottleneck IB) theoretical bound. Thus, the mappings from the input to the hidden layer and from the hidden layer to the output obey the IB self-consistent equations for some value of the compression-prediction tradeoff.
- The main benefit of adding more hidden layers is in the optimization/training time, as the compression phase for each layer amounts to relaxation to a Maximum conditional Entropy state, subject to the proper constraints on the error/information on the labels. As such relaxation takes super-linear time in the compressed entropy, adding more hidden layers dramatically reduces the training time. There is also benefit in sample complexity to adding hidden layers, but this is a smaller effect.

I will explain these new observations and the benefits of exploring Deep Learning in the "Information Plane", and discuss some of the exciting theoretical and practical consequences of our analysis.

Joint work with Ravid Ziv and Noga Zaslavsky.

We study an online learning framework introduced by Mannor and Shamir (2011) in which the feedback is specified by a graph, in a setting where the graph may vary from round to round and is \emph{never fully revealed} to the learner. We show a large gap between the adversarial and the stochastic cases. In the adversarial case, we prove that even for dense feedback graphs, the learner cannot improve upon a trivial regret bound obtained by ignoring any additional feedback besides her own loss. In contrast, in the stochastic case we give an algorithm that achieves $\widetilde \Theta(\sqrt{\alpha T})$ regret over $T$ rounds, provided that the independence numbers of the hidden feedback graphs are at most $\alpha$. completely unlearnable. We also extend our results to a more general feedback model, in which the learner does not necessarily observe her own loss, and show that, even in simple cases, concealing the feedback graphs might render the problem unlearnable.

Projected gradient descent (PGD), and its close variants, are often considered the methods of choice for solving a large variety of machine learning optimization problems, including empirical risk minimization, statistical learning, and online convex optimization. This is not surprising, since PGD is often optimal in a very appealing information-theoretic sense. However, for many problems PGD is infeasible both in theory and practice since each step requires to compute an orthogonal projection onto the feasible set. In many important cases, such as when the feasible set is a non-trivial polytope, or a convex surrogate for a low-rank structure, computing the projection is computationally inefficient in high-dimensional settings. An alternative is the conditional gradient method (CG), aka Frank-Wolfe algorithm, that replaces the expensive projection step with a linear optimization step over the feasible set. Indeed in many problems of interest, the linear optimization step admits much more efficient algorithms than the projection step, which is the reason to the substantial regained interest in this method in the past decade. On the downside, the convergence rates of the CG method often fall behind that of PGD and its variants.

In this talk I will survey an ongoing effort to design CG variants that on one hand enjoy the cheap iteration complexity of the original method, and on the other hand converge provably faster, and are applicable to a wider variety of machine learning settings. In particular I will focus on the cases in which the feasible set is either a polytope or a convex surrogate for low-rank matrices. Results will be demonstrated on applications including: LASSO, video co-localization, optical character recognition, matrix completion, and multi-class classification.

There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting's interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for essentially similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self-averaging, interpolating algorithm which fits training data without error but is nevertheless somewhat smooth. We show that AdaBoost has the same property. We conjecture that both AdaBoost and random forests succeed because of this mechanism. We provide a number of examples and some theoretical justification to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees and without direct regularization or early stopping.

It is common practice in multivariate and matrix-valued data analysis to reduce dimensionality by performing a Singular Value Decomposition or Principal Component Analysis, and keeping only $r$ singular values or principal components, the rest being presumably associated with noise. However, the literature does not propose a disciplined criterion to determine $r$; most practitioners still look for the ``elbow in the Scree Plot'', a 50-years-old heuristic performed by eye. I'll review a line of work which develops a systematic approach to eigenvalue and singular value thresholding. This approach assumes that the signal is low-rank and that the noise is rotationally invariant. Recent results derive optimal thresholds in the presence of quite general noise distributions.

Joint work with David Donoho, Iain Johnstone and Edgar Dobriban (Stanford).

Many sequence prediction tasks---such as automatic speech recognition and video analysis---benefit from long-range temporal features. One way of utilizing long-range information is through segmental (semi-Markov) models such as segmental conditional random fields. Such models have had some success, but have been constrained by the computational needs of considering all possible segmentations. We have developed new segmental models with rich features based on neural segment embeddings, trained with discriminative large-margin criteria, that are efficient enough for first-pass decoding. In our initial work with these models, we have found that they can outperform frame-based HMM/deep network baselines on two disparate tasks, phonetic recognition and sign language recognition from video. I will present the models and their results on these tasks, as well as (time permitting) related recent work on neural segmental acoustic word embeddings.

This is joint work with Hao Tang, Weiran Wang, Herman Kamper, Taehwan Kim, and Kevin Gimpel

Parameter estimation is performed by fitting data measurements to a model using Bayesian statistics, assuming additional prior information. The estimation requires a numerical solution of large scale optimization problem, whose objective traditionally includes data fidelity and regularization terms. In this talk I will present numerical solution methods for two such estimation problems.

In the first part of the talk I will concentrate on parameter estimation of physical models, obtained by solving optimization problems that are constrained by partial differential equations (PDEs). I will focus on my recent work on 3D Full Waveform Inversion, which arises in seismic exploration of oil and gas reservoirs, earth sub-surface mapping, ultrasound imaging and more. I will demonstrate how to computationally treat this inverse problem, and improve its solution by using travel time tomography in a joint inversion framework. This includes efficient algorithms for the solution of the Helmholtz and eikonal equations (the two associated PDEs), and a parallel software framework for applying these algorithms for the joint inversion using a Gauss Newton algorithm.

In the second part of the talk, I will consider the estimation of large scale sparse inverse covariance matrices of multivariate Gaussian distribution. Such matrices are often used to characterize and analyze data measurements in fields that range from machine learning, signal processing and computational biology. To estimate these matrices, an l1 regularized log-determinant optimization problem needs to be solved. I will present a block-coordinate descent algorithm that can efficiently solve this problem at large scales with low memory footprint, and a multilevel acceleration framework that is suitable for general sparse optimization problems. These algorithms can be used as a tool for enriching inverse problems by "learning" appropriate prior information, adopting an empirical Bayesian framework.

In many practical parameter estimation problems, such as medical experiments and cognitive radio communications, parameter selection is performed prior to estimation. The selection process has a major impact on subsequent estimation by introducing a selection bias and creating coupling between decoupled parameters. As a result, classical estimation theory may be inappropriate and inaccurate and a new methodology is needed. In this study, the problem of estimating a preselected unknown deterministic parameter, chosen from a parameter set based on a predetermined data-based selection rule, \Psi, is considered. In this talk, I present a general non-Bayesian estimation theory for estimation after parameter selection, includes estimation methods, performance analysis, and adaptive sampling strategies. First, I use the post-selection mean-square-error (PSMSE) criterion as a performance measure instead of the commonly used mean-square-error (MSE). The corresponding Cramér-Rao-type bound on the PSMSE of any \Psi-unbiased estimator is derived, where the \Psi -unbiasedness is in the Lehmann-unbiasedness sense. The post-selection maximum-likelihood (PSML) estimator is presented and its \Psi–efficiency properties are demonstrated. Practical implementations of the PSML estimator are proposed as well. Finally, I discuss the concept of adaptive sampling in a two-sampling stages scheme of selection and estimation.

We consider the fundamental problem of prediction with expert advice where the experts are "optimizable": there is a black-box optimization oracle that can be used to compute, in constant time, the leading expert in retrospect at any point in time. In this setting, we give a novel online algorithm that attains vanishing regret with respect to $N$ experts in total $\sqrt{N}$ computation time. We also give a lower bound showing that this running time cannot be improved (up to log factors) in the oracle model, thereby exhibiting a quadratic speedup as compared to the standard, oracle-free setting where the required time for vanishing regret is linear in $N$. These results demonstrate an exponential gap between the power of optimization in online learning and its power in statistical learning: in the latter, an optimization oracle---i.e., an efficient empirical risk minimizer---allows to learn a finite hypothesis class of size $N$ in time $\log{N}$.

We also study the implications of our results to learning in repeated zero-sum games, in a setting where the players have access to oracles that compute, in constant time, their best-response to any mixed strategy of their opponent. We show that the runtime required for approximating the minimax value of the game in this setting is $\sqrt{N}$, yielding again a quadratic improvement upon the oracle-free setting, where linear time in $N$ is known to be tight.

"Circular inference" is a pejorative coined for methods in which a hypothesis is selected after looking at the data, but the inferential procedures treat it as if it was determined in advance. Unfortunately, many throughput screening experiments in genomics or neuroimaging seek to do exactly this: identify regions (bumps) of high signal in the data **and** evaluate these found regions using the same data. Simple estimators that ignore the selection will be biased; when the data is non-stationary, this bias can vary dramatically between different regions. Nevertheless, methods for evaluating and comparing selected regions are crucial, because typically only a handful of regions can be further explored in tailored follow up studies.

In this talk I describe a new conditional inference approach for characterizing these found regions by estimating their population parameters. Our method explicitly models the selection procedure, and simulates from the conditional distribution to estimate the underlying parameters. Efficient strategies for providing p-value, estimators and intervals will be discussed, as well as power versus accuracy tradeoffs. I will demonstrate the new method for estimating bumps in a comparison of DNA-methylation patterns across tissue type.

This is joint work with Jonathan Taylor and Rafael Irizarry.

The subject of this talk is the problem of estimating service time distribution of the $M/G/\infty$ queue from incomplete data on the queue. The goal is to estimate $G$ from observations of the queue--length process at the points of the regular grid on a fixed time interval. We propose an estimator and analyze its accuracy over a family of target service time distributions. The original $M/G/\infty$ problem is closely related to the problem of estimating derivatives of the covariance function of a stationary Gaussian process. We consider the latter problem and derive lower bounds on the minimax risk. The obtained results strongly suggest that the proposed estimator of the service time distribution is rate optimal.

A common approach to statistical learning on big data is to randomly split it among m machines and calculate the parameter of interest by averaging their m individual estimates.

Focusing on empirical risk minimization, or equivalently M-estimation, we study the statistical error incurred by this strategy.

We consider two asymptotic settings: one where the number of samples per machine n->inf but the number of parameters p is fixed, and a second high-dimensional regime where both p,n-> inf with p/n-> kappa.

Most previous works provided only moment bounds on the error incurred by splitting the data in the fixed p setting. In contrast, we present for both regimes asymptotically exact distributions for this estimation error. In the fixed-p setting, under suitable assumptions, we thus prove that to leading order, averaging is as accurate as the centralized solution. In the high-dimensional setting, we show a qualitatively different behavior: data splitting does incur a first order accuracy loss, which we quantify precisely. In addition, our asymptotic distributions allow the construction of confidence intervals and hypothesis testing on the estimated parameters.

Our main conclusion is that in both regimes, averaging parallelized estimates is an attractive way to speedup computations and save on memory, while incurring a quantifiable and typically moderate excess error.