You are here

Machine Learning and Statistics Seminar

WednesdayJan 27, 202114:00
Machine Learning and Statistics Seminar
Speaker:Haggai Maron Title:Deep Learning on Structured and Geometric DataAbstract:opens in new windowin html    pdfopens in new window

Deep Learning of structured and geometric data, such as sets, graphs, and surfaces, is a prominent research direction that has received considerable attention in the last few years. Given a learning task that involves structured data, the main challenge is identifying suitable neural network architectures and understanding their theoretical and practical tradeoffs.

This talk will focus on a popular learning setup where the learning task is invariant to a group of transformations of the input data. This setup is relevant to many popular learning tasks and data types. In the first part of the talk, I will present a general framework for designing neural network architectures based on layers that respect these transformations. In particular, I will show that these layers can be implemented using parameter-sharing schemes induced by the group. In the second part of the talk, I will demonstrate the framework’s applicability by presenting novel neural network architectures for two widely used data types: graphs and sets. I will also show that these architectures have desirable theoretical properties and that they perform well in practice.

ThursdayJan 21, 202118:00
Machine Learning and Statistics Seminar
Speaker:Or Litany Title:Learning on Pointclouds for 3D Scene UnderstandingAbstract:opens in new windowin html    pdfopens in new window

In this talk i'll be covering several works in the topic of 3D deep learning on pointclouds for scene understanding tasks.
First, I'll describe VoteNet (ICCV 2019, best paper nomination): a method for object detection from 3D pointclouds input, inspired by the classical generalized Hough voting technique. I'll then explain how we integrated image information into the voting scheme to further boost 3D detection (ImVoteNet, CVPR 2020). In the second part of my talk I'll describe recent studies focusing on reducing supervision for 3D scene understanding tasks, including PointContrast -- a self-supervised representation learning framework for 3D pointclods (ECCV 2020). Our findings in PointContrast are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -- demonstrating that the learned representation can generalize across domains.

WednesdayJan 13, 202115:15
Machine Learning and Statistics Seminar
Speaker:Barak Sober Title:Estimation of Manifolds from Point Clouds: Building Models from Data Abstract:opens in new windowin html    pdfopens in new window

A common observation in data-driven applications is that high dimensional data has a low intrinsic dimension, at least locally. Thus, when one wishes to work with data that is not governed by a clear set of equations, but still wishes to perform statistical or other scientific analysis, an optional model is the assumption of an underlying manifold from which the data was sampled. This manifold, however, is not given explicitly but we can obtain samples of it (i.e., the individual data points). In this talk, we will consider the mathematical problem of estimating manifolds from a finite set of samples, possibly recorded with noise. Using a Manifold Moving Least-Squares approach we provide an approximant manifold with high approximation order in both the Hausdorff sense as well as in the Riemannian metric (i.e., a nearly isometry). In the case of bounded noise, we present an algorithm that is guaranteed to converge in probability when the number of samples tends to infinity. The motivation for this work is based on the analysis of the evolution of shapes through time (e.g., handwriting or primates' teeth) and we will show how this framework can be utilized to answer scientific questions in paleontology and archaeology.

WednesdayDec 23, 202011:15
Machine Learning and Statistics Seminar
Speaker:Yossi ArjevaniTitle:Computational Barriers in Continuous Optimization: Two Complexity Theories and One Tale of SymmetryAbstract:opens in new windowin html    pdfopens in new window

Since the modern formulation of mathematical optimization, researchers in the field have repeatedly expanded and re-defined the realm of tractable optimization problems. This endeavor has culminated in the well-known class of convex optimization problems with applications in a wide range of scientific fields. In this talk, I will present the traditional oracle-based complexity model, which has dominated (unstructured) continuous optimization for the past 40 years, and highlight some of its successes and failures in predicting the hardness of convex optimization problems. I will then introduce a novel structural-based model aimed at addressing major oracle-based complexity gaps. The new approach is intimately tied with approximation theory, and is proven to be particularly advantageous for characterizing the complexity of optimization methods in machine learning.

Studies in recent years have indicated that the realm of tractable optimization problems may be expanded once more -- this time into that of the nonconvex world. In the second part of the talk, I will present a novel symmetry-based approach to study nonconvex optimization landscapes, and use it to address optimization-related aspects of neural networks. The approach employs techniques, new to the field, from symmetry breaking and representation theory, and carries important implications for one’s ability to argue about statistical generalization through local curvature.



TuesdayDec 22, 202015:00
Machine Learning and Statistics Seminar
Speaker:Nadav DymTitle:On the injectivity and (in)stability of invariant encodingAbstract:opens in new windowin html    pdfopens in new window

Quotient spaces are a natural mathematical tool to describe a variety of algorithmic problems in computer vision and related fields, where different objects are to be compared while their natural symmetries are to be ignored. The computational complications arising from these symmetries can be alleviated by mapping the given object to features invariant to the object’s symmetries. But can this be done without losing information? In this talk we will discuss this question for two different problems:

(a) Neural networks for point clouds: The natural symmetries of 3D point clouds are permutations and rigid motions. We will describe our recent work which provides a general framework for constructing such invariant networks to be lossless, in the sense that they can approximate any continuous invariant function. As a result, invariant networks such as Tensor Field Networks are shown to have universal approximation power.

(b) Phase retrieval: Phase retrieval is the problem of retrieving a signal, up to global phase, from linear measurements whose phase is lost. The phase ambiguity here can be considered as the symmetries of the problem, and the phaseless linear measurements as the invariants. We will discuss results on the injectivity and stability of phase retrieval, and particularly our results connecting phase retrieval stability to measures of graph connectivity.

To conclude, we will point out some insights which can be obtained from viewing these two different problems in the same framework.

WednesdayJan 08, 202011:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yair CarmonTitle:Fundamental limits of modern machine learning and how to get around themAbstract:opens in new windowin html    pdfopens in new window

This talk presents new computational and statistical barriers in machine learning, along with the algorithmic developments that they inspire.

The computational barriers arise in nonconvex optimization: we prove lower bounds on the (oracle) complexity of finding stationary points using (stochastic) gradient methods, showing that gradient descent is unimprovable for a natural class of problems. We bypass this barrier by designing an algorithm that outperforms gradient descent for a large subclass of problems with high-order smoothness. Our algorithm leverages classical momentum techniques from convex optimization using a "convex until proven guilty" principle that we develop.

The statistical barrier is the large amount of data required for adversarially robust learning. In a Gaussian model, we prove that unlabeled data allows us to circumvent an information theoretic gap between robust and standard classification. Our analysis directly leads to a general robust self-training procedure; we use it to significantly improve state-of-the-art performance on the challenging and extensively studied CIFAR-10 adversarial robustness benchmark.

SundayJan 05, 202009:45
Machine Learning and Statistics SeminarRoom 1
Speaker:Alon GonenTitle:Beyond Worst Case In Machine Learning: The Oracle ModelAbstract:opens in new windowin html    pdfopens in new window

In recent years there has been an increasing gap between the success of machine learning algorithms and our ability to explain their success theoretically. 

Namely, many of the problems that are solved to a satisfactory degree of precision are computationally hard in the worst case. Fortunately, there are often reasonable assumptions which help us to get around these worst-case impediments and allow us to rigorously analyze heuristics that are used in practice. 

In this talk I will advocate a complementary approach, where instead of explicitly characterizing some desired "niceness" properties of the data, we assume access to an optimization oracle that solves a relatively simpler task. This allows us to identify the sources of hardness and extend our theoretical understanding to new domains. Furthermore we will show that seemingly innocents (and arguably justifiable) modifications to the oracle can lead to tractable reductions and even to bypass hardness results. 

We demonstrate these ideas using the following results: i) An efficient algorithm for non-convex online learning using an optimization oracle. b) A faster boosting algorithm using a "simple" weak learner. iii) An efficient reduction from online to private learning.

Joint works with Naman Agarwal, Noga Alon, Elad Hazan, and Shay Moran.

SundayJan 05, 202009:45
Machine Learning and Statistics SeminarRoom 1
Speaker:Alon GonenTitle:Beyond Worst Case In Machine Learning: The Oracle ModelAbstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL TIME AND DAY

In recent years there has been an increasing gap between the success of machine learning algorithms and our ability to explain their success theoretically.

Namely, many of the problems that are solved to a satisfactory degree of precision are computationally hard in the worst case. Fortunately, there are often reasonable assumptions which help us to get around these worst-case impediments and allow us to rigorously analyze heuristics that are used in practice.

In this talk I will advocate a complementary approach, where instead of explicitly characterizing some desired "niceness" properties of the data, we assume access to an optimization oracle that solves a relatively simpler task. This allows us to identify the sources of hardness and extend our theoretical understanding to new domains. Furthermore we will show that seemingly innocents (and arguably justifiable) modifications to the oracle can lead to tractable reductions and even to bypass hardness results.

We demonstrate these ideas using the following results: i) An efficient algorithm for non-convex online learning using an optimization oracle. b) A faster boosting algorithm using a "simple" weak learner. iii) An efficient reduction from online to private learning.

Joint works with Naman Agarwal, Noga Alon, Elad Hazan, and Shay Moran.

WednesdayNov 20, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nathan Srebro Title:Optimization's Hidden Gift to Learning: Implicit RegularizationAbstract:opens in new windowin html    pdfopens in new window

It is becoming increasingly clear that implicit regularization afforded by the optimization algorithms play a central role in machine
learning, and especially so when using large, deep, neural networks. We have a good understanding of the implicit regularization afforded by stochastic approximation algorithms, such as SGD, for convex problem, and we understand and can characterize the implicit bias of different algorithms, and can design algorithms with specific biases. But in this talk I will focus on implicit biases of local search algorithms for non-convex underdetermined problem, such as deep networks. In an effort to uncover the implicit biases of gradient-based optimization of neural networks, which holds the key to their empirical success, I will discuss recent
work on implicit regularization for matrix factorization, linear convolutional networks, and two-layer ReLU networks, as well as a general bottom-up understanding on implicit regularization in terms of optimization geometry.

TuesdayAug 13, 201914:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Stefano SoattoTitle:The Information in the Weights of a Deep Network, and its consequences for transfer learning and critical learning periodsAbstract:opens in new windowin html    pdfopens in new windowNOTE THE UNUSUAL TIME
The information in the weights of deep networks plays a key role in understanding their behavior. When used as a regularizer during training (either explicitly or implicitly) it induces generalization through the PAC-Bayes bound. Rather than being computed and minimized explicitly, it can be directly controlled by injecting noise in the weights, a process known as Information Dropout. It can also be shown that stochastic gradient descent (SGD), when taken to the continuous limit and interpreted in the Wasserstein metric, adds the information in the weights as inductive bias, even if not explicitly present in the loss function. The Emergence Bound shows that, provided that a trained network has sufficient capacity, minimizing the information in the weights, which is a function of the training set consisting of data seen in the past, guarantees minimality, invariance, and independence of the components of the representation of the test (future) data. The trace of the information in the weights during training shows that, rather than increasingly monotonically through the learning process, as one would expect, first increases rapidly in the first few epochs, and then decreases to an asymptote that is a fraction of its peak. This unusual behavior qualitatively follows the sensitivity to critical learning periods observed in biological systems, from cats to humans, as well as recently in deep artificial networks. Together with the Emergence Bound and the PAC-Bayes bound, this shows that forgetting is a necessary part of the learning process, and happens while precision increases monotonically. The information in the weights can also be used to defined a topology in the space of tasks, and an asymmetric distance that can be used to predict the cost and performance of fine-tuning a network trained for a different task, without actually performing the experiment. These phenomena collectively point to the importance of the dynamics of learning, and suggests that studying the transient behavior can yield insight beyond those that can be gleaned from the asymptotics. Depending on the context, we use Shannon, Fisher, or Kolmogorov information to prove the results described.
WednesdayJul 03, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Alon CohenTitle:Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ RegretAbstract:opens in new windowin html    pdfopens in new window

The linear-quadratic regulator is a simple and ubiquitous model in optimal control. Classical results in control theory pertain to asymptotic convergence and stability of such systems. Recently, it has received renewed interest from a learning-theoretic perspective with a focus on computational tractability and finite-time convergence guarantees. Among the most challenging problems in LQ control is that of adaptive control: regulating a system with parameters that are initially unknown and have to be learned while incurring the associated costs. In its modern incarnation, this problem is approached through the lens of online learning with partial feedback (e.g., multi-armed bandits) and regret minimization. Still, recent results derive algorithms that are either computationally intractable or suffer from high regret. In this talk, I will present the first computationally-efficient algorithm with $\widetilde O(\sqrt{T})$ regret for learning in linear quadratic control systems with unknown dynamics. By that, this resolves an open question of Abbasi-Yadkori and Szepesvari (2011) and Dean, Mania, Matni, Recht, and Tu (2018). The key to the efficiency of our algorithm is in a novel reformulation of the LQ control problem as a convex semi-definite program. This is joint work with Tomer Koren and Yishay Mansour presented at ICML 2019.

WednesdayJun 19, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Daniel SoudryTitle:Theoretical and Empirical Investigation of Several Common Practices in Deep LearningAbstract:opens in new windowin html    pdfopens in new window

We examine several empirical and theoretical results on the training of deep networks. For example, Why are common "over-fitting" indicators (e.g., very low training error, high validation loss) misleading? Why, sometimes, early-stopping time never arrives? Why can adaptive rate methods (e.g., adam) degrade generalization? Why commonly used loss functions exhibit better generalization than others? Why use weight decay before batch-norm? When can we use low numerical precision, and how low can we get? and discuss the practical implications of these results.

Bio == Since October 2017, Daniel soudry is an assistant professor (Taub Fellow) in the Department of Electrical Engineering at the Technion, working in the areas of machine learning and theoretical neuroscience. Before that, he did his post-doc (as a Gruss Lipper fellow) working with Prof. Liam Paninski in the Department of Statistics, the Center for Theoretical Neuroscience the Grossman Center for Statistics of the Mind at Columbia University. He did his Ph.D. in the Department of Electrical Engineering at the Technion, Israel Institute of technology, under the guidance of Prof. Ron Meir. He received his B.Sc. degree in Electrical Engineering and Physics from the Technion.

WednesdayMay 22, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Meir FederTitle:Universal Learning for Individual DataAbstract:opens in new windowin html    pdfopens in new window

Universal learning is considered from an information theoretic point of view following the universal prediction approach pursued in the 90's by F&Merhav. Interestingly, the extension to learning is not straight-forward. In previous works we considered on-line learning and supervised learning in a stochastic setting. Yet, the most challenging case is batch learning where prediction is done on a test sample once the entire training data is observed, in the individual setting where the features and labels, both training and test, are specific individual quantities. This work provides schemes that for any individual data compete with a "genie" (or a reference) that knows the true test label. It suggests design criteria and derive the corresponding universal learning schemes. The main proposed scheme is termed Predictive Normalized Maximum Likelihood (pNML). As demonstrated, pNML learning and its variations provide robust and "stable" learning solutions that outperform the current leading approach based on Empirical Risk Minimization (ERM). Furthermore, the pNML construction provides a pointwise indication for the learnability that measures the uncertainty in learning the specific test challenge with the given training examples; hence the learner knows when it does not know. The improved performance of the pNML, the induced learnability measure and its utilization are demonstrated in several learning problems including deep neural networks models. Joint work with Yaniv Fogel and Koby Bibas

WednesdayMay 15, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Gal ElidanTitle:Learning Rules-first ClassifiersAbstract:opens in new windowin html    pdfopens in new window

Complex classifiers may exhibit "embarrassing" failures even in "easy" cases where humans can provide a simple justified explanation. Avoiding such failures is obviously of key importance. In this work, we focus on one such setting, where a label is perfectly predictable if the input contains certain features, or rules, and otherwise it is predictable by a linear classifier. We define a hypothesis class that captures this notion and determine its sample complexity. We also give evidence that efficient algorithms cannot achieve this sample complexity. We then derive a simple and efficient algorithm and show that its sample complexity is close to optimal, among efficient algorithms. Experiments on synthetic and sentiment analysis data demonstrate the efficacy of the method, both in terms of accuracy and interpretability. At the end of the talk, I will also give a teaser to demonstrate the non-intuitive nature of the more general problem of embarrassment.

WednesdayApr 10, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yossi KeshetTitle:Deep Learning: Vulnerabilities, Defenses and BeyondAbstract:opens in new windowin html    pdfopens in new window

Deep learning has been amongst the most emerging fields in computer science and engineering. In recent years it has been shown that deep networks are vulnerable to attacks by adversarial examples. I will introduce a novel flexible approach named Houdini for generating adversarial examples for complex and structured tasks and demonstrate successful attacks on different applications such as speech recognition, pose estimation, semantic image segmentation, speaker verification, and malware detection. Then I will discuss how this weakness can be turned into two secure applications. The first is a new technique for watermarking deep network models in a black-box way. That is, concealing information within the model that can be used by the owner of the model to claim ownership. The second application is a novel method to Speech Steganography, namely hiding a secret spoken message within an ordinary public spoken message. I will conclude the talk by a brief discussion of our attempts to detect such adversarial attacks, based on multiple semantic label representations.

WednesdayMar 27, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Roi Livni Title:Passing Tests Without Memorizing Abstract:opens in new windowin html    pdfopens in new window

Generative Adversarial Networks (GANs) is a recent algorithmic framework that has won considerable attention.  In a nutshell, GANs receive as input an IID sample and outputs synthetic data that should resemble data from the true underlying distribution. For example, consider an algorithm that receives as input some tunes from a specific music genre (e.g. jazz, rock, pop) and then outputs a new, original, tune from that genre.  

From a theoretical perspective, the distinction between algorithms that genuinely generate original new examples vs. algorithms that perform naive manipulations (or even merely memorization) of the input sample is an elusive distinction. This makes the theoretical analysis of GANs algorithms challenging. 

In this work we introduce two mathematical frameworks for the task of generating synthetic data. The first model we consider is inspired by GANs, and the learning algorithm has only an indirect access to the target distribution via a discriminator. The second model, called DP-Foolability, exploits the notion of differential privacy as a criterion for "non-memorization".

We characterize learnability in each of these models as well as discuss the interrelations. As an application we prove that privately PAC learnable classes are DP-foolable.  As we will discuss, this can be seen as an analogue of the equivalence between uniform convergence and learnability in classical PAC learning.

Joint work with Olivier Bousquet and Shay Moran.

WednesdayJan 30, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ronen TalmonTitle: Data Analysis with the Riemannian Geometry of Symmetric Positive-Definite MatricesAbstract:opens in new windowin html    pdfopens in new window

Recently, the Riemannian geometry of the space of symmetric positive-definite matrices has become a central ingredient in a broad range of data analysis and learning tools. Broadly, it enables to observe complex high-dimensional data through the lens of objects with a known non-Euclidean geometry. In this talk, I will present a method for domain adaptation using parallel transport on the cone manifold of symmetric positive-definite matrices. Using Riemannian geometry, I will show the theoretical guarantees and discuss the benefits of the presented method. In addition, I will demonstrate these benefits on simulations as well as on real-measured data. While the majority of the talk will be focused on the particular symmetric positive-definite covariance matrices, I will present generalizations to kernels, diffusion operators, and graph-Laplacians and show intriguing properties using spectral analysis with application to source separation and multimodal data fusion.
* Joint work with Or Yair, Ori Katz, and Miri Ben-Chen

WednesdayJan 23, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Amir WeissTitle:The Gaussian Maximum Likelihood Approach for Independent Component and Vector AnalysisAbstract:opens in new windowin html    pdfopens in new window

The Blind Source Separation (BSS) problem consists of retrieving signals of interest, termed the sources, from a dataset consisting of their mixtures. One of the most popular and common paradigms for solving the BSS problem is Independent Component Analysis (ICA), where the sources are assumed to be (only) mutually statistically independent random processes, and the mixtures are assumed to be linear combinations thereof, where the linear mixing operator is unknown. In this talk, we shall start with the Gaussian Maximum Likelihood (GML) approach for the semi-blind problem, in which the sources are assumed to be temporally-diverse Gaussian processes. Based on the principles of this approach, we shall then discuss two extensions. First, the noisy Gaussian ICA problem, for which two asymptotically optimal solutions, w.r.t. two different optimality criteria, will be presented. We shall see (both analytically and empirically) that these solutions possess attractive properties even for non-Gaussian mixtures. Then, we shall consider the Independent Vector Analysis (IVA) framework, which has emerged in recent years as an extension of ICA into multiple datasets of mixtures. In IVA, the sources in each set are independent, but may depend on sources in the other sets. We will show that in IVA, the GML approach leads to consistent separation regardless of the sources' distributions.

WednesdayJan 16, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nir BillfeldTitle: Semiparametric Wavelet-based JPEG IV Estimator for Endogenously Truncated DataAbstract:opens in new windowin html    pdfopens in new window

We show that when data are endogenously truncated the widely-used IV fails to render the relationship causal as well as introduces bias into the exogenous covariates. We offer a newly-introduced semiparametric biorthogonal wavelet-based JPEG IV estimator and its associated symmetry preserving kernel, which is closely related to object recognition methods in Artificial Intelligence. The newly-introduced enriched JPEG algorithm is a denoising tool amenable for identifying redundancies in a sequence of irregular noisy data points which also accommodates a reference-free criterion function for optimal denoising. This is suitable for situations where the original data distribution is unobservable such as in the case of endogenous truncation. This estimator corrects both biases, the one generated by endogenous truncation and the one generated by endogenous covariates, by means of denoising. We introduce a multifocal variant of the local GMM (MFGMM) estimator to establish jointly the entire parameter set asymptotic properties. Using Monte Carlo simulations attest to very high accuracy of our offered semiparametric JPEG IV estimator as well as high efficiency as reflected by root n consistency. These results have emerged from utilizing 2,000,000 different distribution functions, generating 100 million realizations to construct the various data sets.

WednesdayJan 09, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Jonathan BelinkovTitle:Deep Learning Models for Language: What they learn, where they fail, and how to make them more robustAbstract:opens in new windowin html    pdfopens in new window

Deep learning has become pervasive in everyday life, powering language applications like Apple's Siri, Amazon's Alexa, and Google Translate. The inherent limitation of these deep learning systems, however, is that they often function as a "black box," preventing researchers and users from discerning the roles of different components and what they learn during the training process. In this talk, I will describe my research on interpreting deep learning models for language along three lines. First, I will present a methodological framework for investigating how these models capture various language properties. The experimental evaluation will reveal a learned hierarchy of internal representations in deep models for machine translation and speech recognition. Second, I will demonstrate that despite their success, deep models of language fail to deal even with simple kinds of noise, of the type that humans are naturally robust to. I will then propose simple methods for improving their robustness to noise. Finally, I will turn to an intriguing problem in language understanding, where dataset biases enable trivial solutions to complex language tasks. I will show how to design models that are more robust to such biases, and learn less biased latent representations.

WednesdayDec 26, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nadav CohenTitle:On Optimization and Expressiveness in Deep LearningAbstract:opens in new windowin html    pdfopens in new window

Understanding deep learning calls for addressing three fundamental questions: expressiveness, optimization and generalization. Expressiveness refers to the ability of compactly sized deep neural networks to represent functions capable of solving real-world problems. Optimization concerns the effectiveness of simple gradient-based algorithms in solving non-convex neural network training programs. Generalization treats the phenomenon of deep learning models not overfitting despite having much more parameters than examples to learn from. This talk will describe a series of works aimed at unraveling some of the mysteries behind optimization and expressiveness. I will begin by discussing recent analyses of optimization for deep linear neural networks. By studying the trajectories of gradient descent, we will derive the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that, sometimes, gradient descent can train a deep linear network faster than a classic linear model. In other words, depth can accelerate optimization, even without any gain in expressiveness, and despite introducing non-convexity to a formerly convex problem. In the second (shorter) part of the talk, I will present an equivalence between convolutional and recurrent networks --- the most successful deep learning architectures to date --- and hierarchical tensor decompositions. The equivalence brings forth answers to various questions concerning expressiveness, resulting in new theoretically-backed tools for deep network design.
Optimization works covered in this talk were in collaboration with Sanjeev Arora, Elad Hazan, Noah Golowich and Wei Hu. Expressiveness works were with Amnon Shashua, Or Sharir, Yoav Levine, Ronen Tamari and David Yakira.

WednesdayDec 12, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Roy SchwartzTitle:Towards Interpretable Deep Learning for Natural Language ProcessingAbstract:opens in new windowin html    pdfopens in new window

Despite their superb empirical performance, deep learning models for natural language processing (NLP) are often considered black boxes, as relatively little is known as to what accounts for their success. This lack of understanding turns model development into a slow and expensive trial-and-error process, which limits many researchers from developing state-of-the-art models. Customers of deep learning also suffer from this lack of understanding, because they are using tools that they cannot interpret. In this talk I will show that many deep learning models are much more understandable than originally thought. I will present links between several deep learning models and classical NLP models: weighted finite-state automata. As the theory behind the latter is well studied, these findings allow for the development of more interpretable and better-performing NLP models. As a case study, I will focus on convolutional neural networks (ConvNets), one of the most widely used deep models in NLP. I will show that ConvNets are mathematically equivalent to a simple, linear chain weighted finite-state automaton. By uncovering this link, I will present an extension of ConvNets that is both more robust and more interpretable than the original model. I will then present similar observations regarding six recently introduced recurrent neural network (RNN) models, demonstrating the empirical benefits of these findings to the performance of NLP systems.

This is joint work with Hao Peng, Sam Thomson and Noah A. Smith

WednesdayNov 28, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Michael KimTitle:Fairness through Computationally-Bounded AwarenessAbstract:opens in new windowin html    pdfopens in new window

As algorithmic prediction systems have become more widespread, so too have concerns that these systems may be discriminatory against groups of people protected by laws and ethics. We present a recent line of work that takes a complexity theoretic perspective towards combating discrimination in prediction systems. We'll focus on fair classification within the versatile framework of Dwork et al. [ITCS'12], which assumes the existence of a metric that measures similarity between pairs of individuals. Unlike earlier work, we do not assume that the entire metric is known to the learning algorithm; instead, the learner can query this metric a bounded number of times. We propose a new notion of fairness called *metric multifairness* and show how to achieve this notion in our setting. Metric multifairness is parameterized by a similarity metric d on pairs of individuals to classify and a rich collection C of (possibly overlapping) "comparison sets" over pairs of individuals. At a high level, metric multifairness guarantees that *similar subpopulations are treated similarly*, as long as these subpopulations are identified within the class C.

WednesdayNov 14, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Kfir LevyTitle:Beyond SGD: Data Adaptive Methods for Machine Learning Abstract:opens in new windowin html    pdfopens in new window

The tremendous success of the Machine Learning paradigm heavily relies on the development of powerful optimization methods. The canonical algorithm for training learning models is SGD (Stochastic Gradient Descent), yet this method has its limitations. It is often unable to exploit useful statistical/geometric structure, it might degrade upon encountering prevalent non-convex phenomena, and it is hard to parallelize. In this talk I will discuss an ongoing line of research where we develop alternative methods that resolve some of SGD"s limitations. The methods that I describe are as efficient as SGD, and implicitly adapt to the underlying structure of the problem in a data dependent manner.

In the first part of the talk, I will discuss a method that is able to take advantage of hard/easy training samples. In the second part, I will discuss a method that enables an efficient parallelization of SGD. Finally, I will briefly describe a method that implicitly adapts to the smoothness and noise properties of the learning objective.

WednesdayNov 07, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tamir Bendory Title:Estimation in extreme noise levels with application to cryo-electron microscopyAbstract:opens in new windowin html    pdfopens in new window

Single-particle cryo-electron microscopy (cryo-EM) is an innovative technology for elucidating structures of biological molecules at atomic-scale resolution. In a cryo-EM experiment, tomographic projections of a molecule, taken at unknown viewing directions, are embedded in highly noisy images at unknown locations. The cryo-EM problem is to estimate the 3-D structure of a molecule from these noisy images.

Inspired by cryo-EM, the talk will focus on two estimation problems: multi-reference alignment and blind deconvolution. These problems abstract away much of the intricacies of cryo-EM, while retaining some of its essential features. In multi-reference alignment, we aim to estimate a signal from its noisy, rotated observations. While the rotations and the signal are unknown, the goal is only to estimate the signal. In the blind deconvolution problem, the goal is to estimate a signal from its convolution with an unknown, sparse signal in the presence of noise. Focusing on the low SNR regime, I will propose the method of moments as a computationally efficient estimation framework for both problems and will introduce its properties. In particular, I will show that the method of moments allows estimating the sought signal accurately in any noise level, provided sufficiently many observations are collected, with only one pass over the data. I will then argue that the same principles carry through to cryo-EM, show examples, and draw potential implications.

WednesdayOct 17, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Adam KapelnerTitle:Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial ExperimentsAbstract:opens in new windowin html    pdfopens in new window

There is a movement in design of experiments away from the classic randomization put forward by Fisher, Cochran and others to one based on optimization. In fixed-sample trials comparing two groups, measurements of subjects are known in advance and subjects can be divided optimally into two groups based on a criterion of homogeneity or "imbalance" between the two groups. These designs are far from random. This talk seeks to understand the benefits and the costs over classic randomization in the context of different performance criterions such as Efron's worst-case analysis. In the criterion that we motivate, randomization beats optimization. However, the optimal design is shown to lie between these two extremes. Much-needed further work will provide a procedure to find this optimal designs in different scenarios in practice. Until then, it is best to randomize.

WednesdayJul 04, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Alexander Volfovsky Title:Causal inference from complex observational data Abstract:opens in new windowin html    pdfopens in new window

A classical problem in causal inference is that of matching treatment units to control units in an observational dataset. This problem is distinct from simple estimation of treatment effects as it provides additional practical interpretability of the underlying causal mechanisms that is not available without matching. Some of the main challenges in developing matching methods arise from the tension among (i) inclusion of as many covariates as possible in defining the matched groups, (ii) having matched groups with enough treated and control units for a valid estimate of average treatment effect in each group, (iii) computing the matched pairs efficiently for large datasets, and (iv) dealing with complicating factors such as non-independence among units. We propose the Fast Large-scale Almost Matching Exactly (FLAME) framework to tackle these problems for categorical covariates. At its core this framework proposes an optimization objective for match quality that captures covariates that are integral for making causal statements while encouraging as many matches as possible. We demonstrate that this framework is able to construct good matched groups on relevant covariates and further extend the methodology to incorporate continuous and other complex covariates. related papers:,

WednesdayJun 27, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tamir Bendory Title:Estimation in low SNR environment and under group action with application to cryo-EMAbstract:opens in new windowin html    pdfopens in new window

Cryo-electron microscopy (cryo-EM) is an imaging technology that is revolutionizing structural biology, enabling reconstruction of molecules at near-atomic resolution.  
Cryo-EM produces a large number of noisy two-dimensional tomographic projection images of a molecule, taken at unknown viewing directions. 
The extreme levels of noise make classical tasks in statistics and signal processing, such as alignment, detection and clustering, very challenging. 
I will start the talk by studying the multi-reference alignment problem, which can be interpreted as a simplified model for cryo-EM. In multi-reference alignment, we aim to estimate multiple signals from circularly-translated, unlabeled, noisy copies. 
In high noise regimes, the measurements cannot be aligned or clustered. Nonetheless, accurate and efficient estimation can be achieved via group-invariant representations (invariant polynomials). Furthermore, such estimators achieve the optimal estimation rate. 
Then, I will show how this framework can be applied to the problem of 2-D classification in cryo-EM. In the last part of the talk, I will introduce the analog invariants of the cryo-EM problem and discuss how they can be used for ab initio modeling.

MondayJun 25, 201816:00
Machine Learning and Statistics SeminarRoom 155
Speaker:Ariel Jaffe / Phd defenceTitle:Spectral methods for unsupervised ensemble learning and latent variable modelsAbstract:opens in new windowin html    pdfopens in new windowNOTE SPECIAL DATE, TIME AND ROOM

The use of latent variables in probabilistic modeling is a standard approach in numerous data analysis applications. In recent years, there has been a surge of interest in spectral methods for latent variable models, where inference is done by analyzing the lower order moments of the observed data. In contrast to iterative approaches such as the EM algorithm, under appropriate conditions spectral methods are guaranteed to converge to the true model parameters given enough data samples.
The focus of the seminar is the development of novel spectral based methods for two problems in statistical machine learning. In the first part, we address unsupervised ensemble learning, where one obtains predictions from different sources or classifiers, yet without knowing the reliability and expertise of each source, and with no labeled data to directly assess it. We develop algorithms to estimate the reliability of the classifiers based on a common assumption that different classifiers make statistically independent errors. In addition, we show how one can detect subsets of classifiers that strongly violate the model of independent errors, in a fully unsupervised manner.
In the second part of the seminar we show how one can use spectral methods to learn the parameters of binary latent variable models. This model has many applications such as overlapping clustering and Gaussian-Bernoulli restricted Boltzmann machines. Our methods are based on computing the eigenvectors of both the second and third moments of the observed variables.
For both problems, we show that spectral based methods can be applied effectively, achieving results that are state of the art in various problems in computational biology and population genetics.

WednesdayMay 09, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Richard Olshen Title:V(D)J Diversity of Subsets of T and B Cells, Statistics and Probability Abstract:opens in new windowin html    pdfopens in new window

This talk will include an introduction to the topic of V(D)J rearrangements of particular subsets of T cells and B cells of the adaptive human immune system, in particular of IgG heavy chains. There are many statistical problems that arise in trying to understand these cells. They involve estimating aspects of functionals of discrete probabilities on (random) finite sets. Topics include but are not limited to exchangeability, estimating non-centrality parameters, and estimating covariance matrices from what are called "replicates" that have been amplified by the PCR process and (partially) sequenced.

I have received considerable assistance from Lu Tian, and also Yi Liu; as well, I have been helped considerably by Andrew Fire and Scott Boyd, and also Jorg Goronzy

WednesdayApr 11, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ethan Fetaya Title:Neural Relational Inference for Interacting Systems Abstract:opens in new windowin html    pdfopens in new window

Interacting systems are prevalent in nature, from dynamical systems in physics to complex societal dynamics. In this talk I will introduce our neural relational inference model: an unsupervised model that learns to infer interactions while simultaneously learning the dynamics purely from observational data. Our model takes the form of a variational auto-encoder, in which the latent code represents the underlying interaction graph and the reconstruction is based on graph neural networks.

WednesdayJan 10, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yonatan BelinkovTitle:Understanding Internal Representations in Deep Learning Models for Language and Speech ProcessingAbstract:opens in new windowin html    pdfopens in new window

Language technology has become pervasive in everyday life, powering applications like Apple's Siri or Google's Assistant. Neural networks are a key component in these systems thanks to their ability to model large amounts of data. Contrary to traditional systems, models based on deep neural networks (a.k.a. deep learning) can be trained in an end-to-end fashion on input-output pairs, such as a sentence in one language and its translation in another language, or a speech utterance and its transcription. The end-to-end training paradigm simplifies the engineering process while giving the model flexibility to optimize for the desired task. This, however, often comes at the expense of model interpretability: understanding the role of different parts of the deep neural network is difficult, and such models are often perceived as "black-box". In this work, we study deep learning models for two core language technology tasks: machine translation and speech recognition. We advocate an approach that attempts to decode the information encoded in such models while they are being trained. We perform a range of experiments comparing different modules, layers, and representations in the end-to-end models. Our analyses illuminate the inner workings of end-to-end machine translation and speech recognition systems, explain how they capture different language properties, and suggest potential directions for improving them. The methodology is also applicable to other tasks in the language domain and beyond.

WednesdayDec 13, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Roi Livni Title:Overcoming Intractability in LearningAbstract:opens in new windowin html    pdfopens in new window

Machine learning has recently been revolutionized by the introduction of Deep Neural Networks. However, from a theoretical viewpoint these methods are still poorly understood. Indeed the key challenge in Machine Learning today is to derive rigorous results for optimization and generalization in deep learning. In this talk I will present several tractable approaches to training neural networks. At the second part I will discuss a new sequential algorithm for decision making that can take into account the structure in the action space and is more tuned with realistic decision making scenarios.

I will present our work that provides some of the first positive results and yield new, provably efficient, and practical algorithms for training certain types of neural networks. In a second work I will present a new online algorithm that learns by sequentially sampling random networks and asymptotically converges, in performance, to the optimal network. Our approach improves on previous random features based learning in terms of sample/computational complexity, and expressiveness. In a more recent work we take a different perspective on this problem. I will provide sufficient conditions that guarantee tractable learning, using the notion of refutation complexity. I will then discuss how this new idea can lead to new interesting generalization bounds that can potentially explain generalization in settings that are not always captured by classical theory.

In the setting of reinforcement learning I will present a recently developed new algorithm for decision making in a metrical action space. As an application, we consider a dynamic pricing problem in which a seller is faced with a stream of patient buyers. Each buyer buy at the lowest price in a certain time window. We use our algorithm to achieve an optimal regret, improving on previously known regret bound.

WednesdayNov 29, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Roy Lederman Title:Inverse Problems and Unsupervised Learning with applications to Cryo-Electron Microscopy (cryo-EM)Abstract:opens in new windowin html    pdfopens in new window

Cryo-EM is an imaging technology that is revolutionizing structural biology; the Nobel Prize in Chemistry 2017 was recently awarded to Jacques Dubochet, Joachim Frank and Richard Henderson "for developing cryo-electron microscopy for the high-resolution structure determination of biomolecules in solution".
Cryo-electron microscopes produce a large number of very noisy two-dimensional projection images of individual frozen molecules. Unlike related methods, such as computed tomography (CT), the viewing direction of each image is unknown. The unknown directions, together with extreme levels of noise and additional technical factors, make the determination of the structure of molecules challenging.
While other methods for structure determination, such as x-ray crystallography and nuclear magnetic resonance (NMR), measure ensembles of molecules together, cryo-EM produces measurements of individual molecules. Therefore, cryo-EM could potentially be used to study mixtures of different conformations of molecules. Indeed, current algorithms have been very successful at analyzing homogeneous samples, and can recover some distinct conformations mixed in solutions, but, the determination of multiple conformations, and in particular, continuums of similar conformations (continuous heterogeneity), remains one of the open problems in cryo-EM.
I will discuss a one-dimensional discrete model problem, Heterogeneous Multireference Alignment, which captures many of the group properties and other mathematical properties of the cryo-EM problem. I will then discuss different components which we are introducing in order to address the problem of continuous heterogeneity in cryo-EM: 1. "hyper-molecules", the first mathematical formulation of truly continuously heterogeneous molecules, 2. The optimal representation of objects that are highly concentrated in both the spatial domain and the frequency domain using high-dimensional prolate spheroidal functions, and 3. Bayesian algorithms for inverse problems with an unsupervised-learning component for recovering such hyper-molecules in cryo-EM.

WednesdayNov 15, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ilya SoloveychikTitle:Group Symmetric Robust Covariance EstimationAbstract:opens in new windowin html    pdfopens in new window

Covariance matrix estimation is essential in many areas of modern Statistics and Machine Learning including Graphical Models, Classification/Discriminant Analysis, Principal Component Analysis, and many others. Classical statistics suggests using Sample Covariance Matrix (SCM) which is a Maximum Likelihood Estimator (MLE) in the Gaussian populations. Real world data, however, usually exhibits heavy-tailed behavior and/or contains outliers, making the SCM non-efficient or even useless. This problem and many similar ones gave rise to the Robust Statistics field in early 60s, where the main goal was to develop estimators stable under reasonable deviations from the basic Gaussian assumptions. One of the most prominent robust covariance matrix estimators was introduced and thoroughly studied by D. Tyler in the mid-80s. This important representative of the family of M-estimators can be defined as an MLE of a certain population. The problem of robust covariance estimation becomes even more involved in the high-dimensional scenario, where the number of samples n is of the order of the dimension p, or even less. In such cases, prior knowledge, often referred to as structure, is utilized to decrease the number of degrees of freedom and make the estimation possible. Unlike the Gaussian setting, in Tyler's case even imposition of linear structure becomes challenging due to the non-convexity of the negative log-likelihood. Recently, Tyler's target function was shown to become convex under a certain change of metric (geodesic convexity), which stimulated further investigation of the estimator.

In this work, we focus on the so-called group symmetry structure, which essentially means that the true covariance matrix commutes with a group of unitary matrices. In engineering applications such structures appear due to the natural symmetries of the physical processes; examples include circulant, perHermitian, proper quaternion matrices, etc. Group symmetric constraints are linear, and thus convex in the regular Euclidean metric. We show that they are also convex in the geodesic metric. These properties allow us to develop symmetric versions of the SCM and Tyler's estimator and build a general framework for their performance analysis. The classical results claim that at least n = p and n = p+1 samples in general position are necessary to ensure the existence and uniqueness of the SCM and Tyler's estimator, respectively. We significantly improve the sample complexity requirements for both estimators under the symmetry structure and show that in some cases even 1 or 2 samples are enough to guarantee the existence and uniqueness regardless of the ambient dimension.

WednesdayJul 05, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ronen TalmonTitle:Common Manifold Learning with Alternating DiffusionAbstract:opens in new windowin html    pdfopens in new window

We consider the problem of hidden common manifold extraction from multiple data sets, which have observation-specific distortions and artifacts. A new manifold learning method is presented based on alternating products of diffusion operators and local kernels. We provide theoretical analysis showing that our method is able to build a variant of the Laplacian of the hidden common manifold, while suppressing the observation-specific artifacts. The generality of this method is demonstrated in data analysis applications, where different types of devices are used to measure the same activity. In particular, we present applications to problems in biomedicine, neuroscience, and audio analysis. 
This is joint work with Roy Lederman and Hau-tieng Wu.

WednesdayJun 21, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Raja Giryes Title:On the relationship between structure in the data and what deep learning can learnAbstract:opens in new windowin html    pdfopens in new window

The past five years have seen a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for feature learning and classification. However, the mathematical reasons for this success remain elusive. In this talk we will briefly survey some existing theory of deep learning. In particular, we will focus on data structure based theory and discuss two recent developments. 
The first work studies the generalization error of deep neural network. We will show how the generalization error of deep networks can be bounded via their classification margin. We will also discuss the implications of our results for the regularization of the networks. For example, the popular weight decay regularization guarantees the margin preservation, but it leads to a loose bound to the classification margin. We show that a better regularization strategy can be obtained by directly controlling the properties of the network's Jacobian matrix. 
The second work focuses on solving minimization problems with neural networks. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence of certain iterative algorithms with an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA (iterative shrinkage and thresholding algorithm) solution by neural networks with layers representing iterations. 

Joint work with Guillermo Sapiro, Miguel Rodrigues, Jure Sokolic, Alex Bronstein and Yonina Eldar. 

WednesdayMay 17, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Aryeh KontorovichTitle:Mixing Time Estimation in Reversible Markov Chains from a Single Sample PathAbstract:opens in new windowin html    pdfopens in new window

We propose a procedure (the first of its kind) for computing a fully data-dependent interval that traps the mixing time t_mix of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge.

The interval is constructed around the relaxation time t_relax, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a sqrt{n} rate, where n is the length of the sample path. Upper and lower bounds are given on the number of samples required to achieve constant-factor multiplicative accuracy. The lower bounds indicate that, unless further restrictions are placed on the chain, no procedure can achieve this accuracy level before seeing each state at least \Omega(t_relax) times on the average. Future directions of research are identified. Time permitting, we will mention some recent further developments by D. Levin and Y. Peres.

Joint work with Daniel Hsu and Csaba Szepesvari.

WednesdayApr 19, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Naftali TishbyTitle:A Deeper Understanding of Deep LearningAbstract:opens in new windowin html    pdfopens in new window

By analytical and numerical studies of Deep Neural Networks (using standard TensorFlow) in the "Information Plane" - the Mutual Information the network layers preserve on the input and the output variables - we obtain the following new insights.

  1. The training epochs, for each layer, are divided into two phases: (1) fitting the training data - increasing the mutual information on the labels; (2) compressing the representation - reducing the mutual information on the inputs. The layers are learnt hierarchically, from the bottom to the top layer, with some overlaps.
  2. Most (~80%) of the training time  - optimization with SGD -  is spent on compressing the representation (the second phase) - NOT on fitting the training data labels, even when the training has no regularization or terms that directly aim at such compression.  
  3. The convergence point, FOR EVERY HIDDEN LAYER, lies on or very close to the Information Bottleneck IB) theoretical bound. Thus, the mappings from the input to the hidden layer and from the hidden layer to the output obey the IB self-consistent equations for some value of the compression-prediction tradeoff.
  4. The main benefit of adding more hidden layers is in the optimization/training time, as the compression phase for each layer amounts to relaxation to a Maximum conditional Entropy state, subject to the proper constraints on the error/information on the labels. As such relaxation takes super-linear time in the compressed entropy, adding more hidden layers dramatically reduces the training time. There is also benefit in sample complexity to adding hidden layers, but this is a smaller effect.

I will explain these new observations and the benefits of exploring Deep Learning in the "Information Plane", and discuss some of the exciting theoretical and practical consequences of our analysis.

Joint work with Ravid Ziv and Noga Zaslavsky.

WednesdayFeb 08, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Haim AvronTitle:Large-scale and Non-approximate Kernel Methods Using Random FeaturesAbstract:opens in new windowin html    pdfopens in new window
Kernel methods constitute a mathematically elegant framework for general-purpose infinite-dimensional non-parametric statistical inference. By providing a principled framework to extend classical linear statistical techniques to non-parametric modeling, their applications span the entire spectrum of statistical learning. However, training procedures naturally derived via this framework scale poorly and with limited opportunities for parallelization. This poor scalability poses a significant barrier for the use of kernel methods in big data applications. As such, with the growth in data across a multitude of applications, scaling up kernel methods has acquired renewed and somewhat urgent significance. Random feature maps, such as random Fourier features, have recently emerged as a powerful technique for speeding up and scaling the training of kernel-based methods. However, random feature maps only provide crude approximations to the kernel function, so delivering state-of-the-art results requires huge amount of random features. Nevertheless, in some cases, even when the number of random features is driven to be as large as the training size, full recovery of the generalization performance of the exact kernel method is not attained. In the talk I will show how random feature maps can be used to efficiently perform non-approximate kernel ridge regression, and thus there is no need to compromise between quality and running time. The core idea is to use random feature maps to form preconditioners to be used in solving kernel ridge regression to high accuracy. I will describe theoretical conditions on when this yields an effective preconditioner, and empirically evaluate the method and show it is highly effective for datasets of up to one million training examples.
WednesdayFeb 01, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ran Gilad-Bachrach Title:CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and AccuracyAbstract:opens in new windowin html    pdfopens in new window
Applying machine learning to a problem which involves medical, financial, or other types of sensitive data, not only requires accurate predictions but also careful attention to maintaining data privacy and security. Legal and ethical requirements may prevent the use of cloud-based machine learning solutions for such tasks. In this work, we will present a method to convert learned neural networks to CryptoNets, neural networks that can be applied to encrypted data. This allows a data owner to send their data in an encrypted form to a cloud service that hosts the network. The encryption ensures that the data remains confidential since the cloud does not have access to the keys needed to decrypt it. Nevertheless, we will show that the cloud service is capable of applying the neural network to the encrypted data to make encrypted predictions, and also return them in encrypted form. These encrypted predictions can be sent back to the owner of the secret key who can decrypt them. Therefore, the cloud service does not gain any information about the raw data nor about the prediction it made. We demonstrate CryptoNets on the MNIST optical character recognition tasks. CryptoNets achieve 99% accuracy and can make around 59000 predictions per hour on a single PC. Therefore, they allow high throughput, accurate, and private predictions. This is a joint work with Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, John Wernsing.
WednesdayJan 25, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Amir Globerson Title:Variational Conditional ProbabilitiesAbstract:opens in new windowin html    pdfopens in new window
Predicting the label Y of an object X is a core task in machine learning. From a probabilistic perspective, this involves reasoning about conditional probabilities p(y|x). However, it is hard to obtain reliable estimates for these probabilities. Here we show how to obtain lower and upper bounds on p(y|x) given statistical information, and show how it can be used within various learning setups. We also extend this formulation to the structured case, where y can be multivariate.
WednesdayJan 04, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Sivan SabatoTitle:Active Nearest-Neighbor Learning in Metric SpacesAbstract:opens in new windowin html    pdfopens in new window
We propose a pool-based non-parametric active learning algorithm for general metric spaces, which outputs a nearest-neighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of the new algorithm is significantly lower than that of any passive learner with similar error guarantees. Our algorithm is based on a generalized sample compression scheme and a new label-efficient active model-selection procedure. Based on joint work with Aryeh Kontorovich and Ruth Urner.
MondayDec 26, 201611:45
Machine Learning and Statistics SeminarRoom 141
Speaker:Elad Hazan Title:A Non-generative Framework and Convex Relaxations for Unsupervised LearningAbstract:opens in new windowin html    pdfopens in new windowNote unusual time and place
We will describe a novel theoretical framework for unsupervised learning which is not based on generative assumptions. It is comparative, and allows to avoid known computational hardness results and improper algorithms based on convex relaxations. We show how several families of unsupervised learning models, which were previously only analyzed under probabilistic assumptions and are otherwise provably intractable, can be efficiently learned in our framework by convex optimization. These includes dictionary learning and learning of algebraic manifolds. Joint work with Tengyu Ma. === Bio === Elad Hazan is a professor of computer science at Princeton university. His research focuses on the design and analysis of algorithms for basic problems in machine learning and optimization. Amongst his contributions are the co-development of the AdaGrad algorithm for training learning machines, and the first sublinear-time algorithms for convex optimization. He is the recipient of (twice) the IBM Goldberg best paper award in 2012 for contributions to sublinear time algorithms for machine learning, and in 2008 for decision making under uncertainty, a European Research Council grant, a Marie Curie fellowship and a Google Research Award (twice). He served on the steering committee of the Association for Computational Learning and has been program chair for COLT 2015.
WednesdayDec 14, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Alon CohenTitle:Online Learning with Feedback Graphs Without the GraphsAbstract:opens in new windowin html    pdfopens in new window

We study an online learning framework introduced by Mannor and Shamir (2011) in which the feedback is specified by a graph, in a setting where the graph may vary from round to round and is \emph{never fully revealed} to the learner. We show a large gap between the adversarial and the stochastic cases. In the adversarial case, we prove that even for dense feedback graphs, the learner cannot improve upon a trivial regret bound obtained by ignoring any additional feedback besides her own loss. In contrast, in the stochastic case we give an algorithm that achieves $\widetilde \Theta(\sqrt{\alpha T})$ regret over $T$ rounds, provided that the independence numbers of the hidden feedback graphs are at most $\alpha$. completely unlearnable. We also extend our results to a more general feedback model, in which the learner does not necessarily observe her own loss, and show that, even in simple cases, concealing the feedback graphs might render the problem unlearnable.

WednesdayNov 16, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Dan Garber Title:Faster Projection-free Machine Learning and OptimizationAbstract:opens in new windowin html    pdfopens in new window

Projected gradient descent (PGD), and its close variants, are often considered the methods of choice for solving a large variety of machine learning optimization problems, including empirical risk minimization, statistical learning, and online convex optimization. This is not surprising, since PGD is often optimal in a very appealing information-theoretic sense. However, for many problems PGD is infeasible both in theory and practice since each step requires to compute an orthogonal projection onto the feasible set. In many important cases, such as when the feasible set is a non-trivial polytope, or a convex surrogate for a low-rank structure, computing the projection is computationally inefficient in high-dimensional settings. An alternative is the conditional gradient method (CG), aka Frank-Wolfe algorithm, that replaces the expensive projection step with a linear optimization step over the feasible set. Indeed in many problems of interest, the linear optimization step admits much more efficient algorithms than the projection step, which is the reason to the substantial regained interest in this method in the past decade. On the downside, the convergence rates of the CG method often fall behind that of PGD and its variants. 
In this talk I will survey an ongoing effort to design CG variants that on one hand enjoy the cheap iteration complexity of the original method, and on the other hand converge provably faster, and are applicable to a wider variety of machine learning settings. In particular I will focus on the cases in which the feasible set is either a polytope or a convex surrogate for low-rank matrices. Results will be demonstrated on applications including: LASSO, video co-localization, optical character recognition, matrix completion, and multi-class classification.

WednesdayJun 08, 201611:15
Machine Learning and Statistics SeminarRoom 290C
Speaker:Daniel SoudryTitle:No bad local minima: Data independent training error guarantees for multilayer neural networksAbstract:opens in new windowin html    pdfopens in new window
We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.
WednesdayJun 01, 201611:15
Machine Learning and Statistics SeminarRoom 290C
Speaker:Shalom LappinTitle:Deep Learning and Semantic Interpretation of Natural LanguageAbstract:opens in new windowin html    pdfopens in new window
Classical approaches to formal and computational semantics assign values to the terminal elements of hierarchical syntactic structures and define combinatorial operations on the semantic representations of phrases to compute the values of sentences. While these approaches offer formally elegant models of interpretation, they have not produced wide coverage systems. They do not provide for semantic learning. They have also not succeeded in integrating lexical and compositional semantics in an interesting or computationally efficient way. Recent developments in image caption generation suggest an alternative approach, which can overcome these difficulties. This work formulates the problem of matching images with descriptions as a task in machine translation. Deep neural networks use an encoder to map regions of pixels in an image to vector representations of graphic features, and a decoder to align these features with the distributional vectors of lexical and phrasal items. This approach can be generalized to deep neural networks that identify correspondences between multi-modal data structures and sentences. To the extent that this research program is successful, it will satisfy the core objective of the classical formal semantic program. It will assign truth (fulfilment) conditions to the sentences of a language, where these conditions are specified in terms of multi-modal representations of situations (scenes) in the world. These correspondences are generated not by a recursive definition of a truth predicate in a formal semantic theory, but by an extended deep neural language model.
WednesdayMay 18, 201611:15
Machine Learning and Statistics Seminar
Speaker:Abraham WynerTitle:Explaining the Success of AdaBoost and Random Forests as Interpolating ClassifiersAbstract:opens in new windowin html    pdfopens in new windowroom 155

There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting's interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for essentially similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self-averaging, interpolating algorithm which fits training data without error but is nevertheless somewhat smooth. We show that AdaBoost has the same property. We conjecture that both AdaBoost and random forests  succeed because of this mechanism. We provide a number of examples and some theoretical justification to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees and without direct regularization or early stopping.

WednesdayMay 04, 201611:15
Machine Learning and Statistics Seminar
Speaker:Amit DanielyTitle:The Power of Initialization and a Dual View on ExpressivityAbstract:opens in new windowin html    pdfopens in new windowRoom 155
We develop a general duality between neural networks and compositional kernels. We show that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space. Hence, though the training objective is hard to optimize in the worst case, the initial weights form a good starting point for optimization. Our dual view also reveals a pragmatic and aesthetic perspective of neural networks and underscores their expressive power. Joint work with Roy Frostig and Yoram Singer
WednesdayApr 06, 201611:15
Machine Learning and Statistics Seminar
Speaker:Moshe KoppelTitle:Reconstructing Ancient Documents from Noisy ManuscriptsAbstract:opens in new windowin html    pdfopens in new windowRoom 155
Given multiple corrupted versions of the same text, as is common with ancient manuscripts, we wish to reconstruct the original text from which the extant corrupted versions were copied (possibly via latent intermediary versions). This is a challenge of cardinal importance in the humanities. We use a variant of EM to solve this problem and demonstrate the efficacy of the method on both synthetic and real-world data.
WednesdayMar 30, 201611:15
Machine Learning and Statistics SeminarRoom 261
Speaker:Matan GavishTitle:Optimal thresholding of singular values and eigenvaluesAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE ROOM CHANGE

It is common practice in multivariate and matrix-valued data analysis to reduce dimensionality by performing a Singular Value Decomposition or Principal Component Analysis, and keeping only $r$ singular values or principal components, the rest being presumably associated with noise. However, the literature does not propose a disciplined criterion to determine $r$; most practitioners still look for the ``elbow in the Scree Plot'', a 50-years-old heuristic performed by eye. I'll review a line of work which develops a systematic approach to eigenvalue and singular value thresholding. This approach assumes that the signal is low-rank and that the noise is rotationally invariant. Recent results derive optimal thresholds in the presence of quite general noise distributions.

Joint work with David Donoho, Iain Johnstone and Edgar Dobriban (Stanford).

WednesdayMar 16, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Amichai PainskyTitle:Generalized Independent Component Analysis Over Finite AlphabetsAbstract:opens in new windowin html    pdfopens in new window
Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector into components that are as statistically independent as possible from each other. Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA in which both the observations and the independent components are over a finite alphabet. In this work we consider a generalization of this framework in which an observation vector is decomposed to its independent components (as much as possible) with no prior assumption on the way it was generated. This generalization is also known as Barlow's minimal redundancy representation problem [Barlow, '89] and is considered an open problem. We propose several theorems and show that this hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Moreover, we show that there exists a simple transformation (namely, order permutation) which provides a greedy yet very effective approximation of the optimal solution. We further show that while not every random vector can be efficiently decomposed into independent components, the vast majority of vectors do decompose very well (that is, with a small constant cost), as the dimension increases. The minimal redundancy representation (also known as factorial coding) has many applications, mainly in the fields of neural networks and deep learning. In this work we show that this formulation further applies to large alphabet source coding. Joint work with Prof. Saharon Rosset from the Statistics Department and Prof. Meir Feder from the EE department, Tel Aviv University
WednesdayMar 09, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Sebastien BubeckTitle:New Results at the Crossroads of Convexity, Learning and Information TheoryAbstract:opens in new windowin html    pdfopens in new window
I will present three new results: (i) the Cramer transform of the uniform measure on a convex body is a universal self-concordant barrier; (ii) projected gradient descent with Gaussian noise allows to sample from a log-concave measure in polynomial time; and (iii) Thompson sampling combined with a multi-scale exploration solves the Bayesian convex bandit problem. The unifying theme in these results is the interplay between concepts from convex geometry, learning and information theory. Joint work with Ronen Eldan, and for (ii) with Joseph Lehec.
WednesdayFeb 24, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nicolo Cesa-BianchiTitle:Real-time bidding and regret minimizationAbstract:opens in new windowin html    pdfopens in new window
In real-time bidding (RTB), ad exchanges run second-price auctions in a few milliseconds, allowing publishers to sell ad spaces to advertisers on a per-impression basis. The fact that RTB allows the accurate tailoring of impressions to the features of each individual user, has fueled the demand for algorithmic platforms that serve the needs of either the seller or the buyer. In this talk, we focus on the problem, faced by the seller, of dynamically optimizing the reserve price in each auction with the goal of maximizing overall revenue. We cast this problem in a regret minimization setting, and describe computationally efficient algorithms achieving regret of order T^{1/2} under various assumptions both on the information available to the seller and on the mechanism generating bids. Joint work with Claudio Gentile (Varese) and Yishay Mansour (Tel-Aviv).
WednesdayJan 27, 201611:30
Machine Learning and Statistics SeminarRoom 1
Speaker:Assaf HallakTitle:Off-policy Evaluation for MDPs with Unknown StructureAbstract:opens in new windowin html    pdfopens in new windowNEW DATE
In this talk I will present my work from ICML 2015. First, I will give a general introduction to Reinforcement Learning setup and define the off-policy evaluation problem and its core difficulties. I will present the model based solution for off-policy evaluation, and explain when structure can be exploited to improve performance of such solution. This will lead us to the core of our algorithm - the much more general problem of structure learning. The paper suggests solving this problem greedily and give conditions as to when such a solution works. Finally, I will present a few empirical results demonstrating our result.
WednesdayJan 06, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Karen LivescuTitle:Segmental Sequence Models in the Neural AgeAbstract:opens in new windowin html    pdfopens in new windowJOINT Vision and Machine Learning seminar

Many sequence prediction tasks---such as automatic speech recognition and video analysis---benefit from long-range temporal features.  One way of utilizing long-range information is through segmental (semi-Markov) models such as segmental conditional random fields.  Such models have had some success, but have been constrained by the computational needs of considering all possible segmentations.  We have developed new segmental models with rich features based on neural segment embeddings, trained with discriminative large-margin criteria, that are efficient enough for first-pass decoding.  In our initial work with these models, we have found that they can outperform frame-based HMM/deep network baselines on two disparate tasks, phonetic recognition and sign language recognition from video.  I will present the models and their results on these tasks, as well as (time permitting) related recent work on neural segmental acoustic word embeddings.

This is joint work with Hao Tang, Weiran Wang, Herman Kamper, Taehwan Kim, and Kevin Gimpel

WednesdayDec 30, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Eran Treister Title:Efficient algorithms for large scale parameter estimationAbstract:opens in new windowin html    pdfopens in new windowJoint Mathematical Analysis and Applications & Machine Learning and Statistics seminar

Parameter estimation is performed by fitting data measurements to a model using Bayesian statistics, assuming additional prior information. The estimation requires a numerical solution of large scale optimization problem, whose objective traditionally includes data fidelity and regularization terms. In this talk I will present numerical solution methods for two such estimation problems.

In the first part of the talk I will concentrate on parameter estimation of physical models, obtained by solving optimization problems that are constrained by partial differential equations (PDEs). I will focus on my recent work on 3D Full Waveform Inversion, which arises in seismic exploration of oil and gas reservoirs, earth sub-surface mapping, ultrasound imaging and more. I will demonstrate how to computationally treat this inverse problem, and improve its solution by using travel time tomography in a joint inversion framework. This includes efficient algorithms for the solution of the Helmholtz and eikonal equations (the two associated PDEs), and a parallel software framework for applying these algorithms for the joint inversion using a Gauss Newton algorithm.

In the second part of the talk, I will consider the estimation of large scale sparse inverse covariance matrices of multivariate Gaussian distribution. Such matrices are often used to characterize and analyze data measurements in fields that range from machine learning, signal processing and computational biology. To estimate these matrices, an l1 regularized log-determinant optimization problem needs to be solved. I will present a block-coordinate descent algorithm that can efficiently solve this problem at large scales with low memory footprint, and a multilevel acceleration framework that is suitable for general sparse optimization problems. These algorithms can be used as a tool for enriching inverse problems by "learning" appropriate prior information, adopting an empirical Bayesian framework.

WednesdayDec 23, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tirza RouttenbergTitle:Estimation after parameter selection: Estimation methods, performance analysis, and adaptive samplingAbstract:opens in new windowin html    pdfopens in new window

In many practical parameter estimation problems, such as medical experiments and cognitive radio communications, parameter selection is performed prior to estimation. The selection process has a major impact on subsequent estimation by introducing a selection bias and creating coupling between decoupled parameters. As a result,   classical estimation theory may be inappropriate and inaccurate and a new methodology is needed. In this study, the problem of estimating a preselected unknown deterministic parameter, chosen from a parameter set based on a predetermined data-based selection rule, \Psi, is considered.  In this talk, I present a general non-Bayesian estimation theory for estimation after parameter selection, includes estimation methods, performance analysis, and adaptive sampling strategies.  First, I use the post-selection mean-square-error (PSMSE) criterion   as a performance measure instead of the commonly used mean-square-error (MSE).  The corresponding Cramér-Rao-type bound on the PSMSE of any \Psi-unbiased estimator is derived, where the \Psi -unbiasedness is in the Lehmann-unbiasedness sense. The post-selection maximum-likelihood (PSML) estimator is presented and its \Psi–efficiency properties are demonstrated. Practical implementations of the PSML estimator are proposed as well. Finally, I discuss the concept of adaptive sampling in a two-sampling stages scheme of selection and estimation.

WednesdayDec 16, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tomer KorenTitle:The Computational Power of Optimization in Online LearningAbstract:opens in new windowin html    pdfopens in new window

We consider the fundamental problem of prediction with expert advice where the experts are "optimizable": there is a black-box optimization oracle that can be used to compute, in constant time, the leading expert in retrospect at any point in time. In this setting, we give a novel online algorithm that attains vanishing regret with respect to $N$ experts in total $\sqrt{N}$ computation time. We also give a lower bound showing that this running time cannot be improved (up to log factors) in the oracle model, thereby exhibiting a quadratic speedup as compared to the standard, oracle-free setting where the required time for vanishing regret is linear in $N$. These results demonstrate an exponential gap between the power of optimization in online learning and its power in statistical learning: in the latter, an optimization oracle---i.e., an efficient empirical risk minimizer---allows to learn a finite hypothesis class of size $N$ in time $\log{N}$.

We also study the implications of our results to learning in repeated zero-sum games, in a setting where the players have access to oracles that compute, in constant time, their best-response to any mixed strategy of their opponent. We show that the runtime required for approximating the minimax value of the game in this setting is $\sqrt{N}$, yielding again a quadratic improvement upon the oracle-free setting, where linear time in $N$ is known to be tight.

WednesdayDec 02, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yuval Benjamini Title:Estimating bumps: selective inference for regions in non-stationary spatial dataAbstract:opens in new windowin html    pdfopens in new window

"Circular inference" is a pejorative coined for methods in which a hypothesis is selected after looking at the data, but the inferential procedures treat it as if it was determined in advance. Unfortunately, many throughput screening experiments in genomics or neuroimaging seek to do exactly this: identify regions (bumps) of high signal in the data and evaluate these found regions using the same data. Simple estimators that ignore the selection will be biased; when the data is non-stationary, this bias can vary dramatically between different regions. Nevertheless, methods for evaluating and comparing selected regions are crucial, because typically only a handful of regions can be further explored in tailored follow up studies. 

In this talk I describe a new conditional inference approach for characterizing these found regions by estimating their population parameters. Our method explicitly models the selection procedure, and simulates from the conditional distribution to estimate the underlying parameters. Efficient strategies for providing p-value, estimators and intervals will be discussed, as well as power versus accuracy tradeoffs. I will demonstrate the new method for estimating bumps in a comparison of DNA-methylation patterns across tissue type.

This is joint work with Jonathan Taylor and Rafael Irizarry.

MondayJun 15, 201514:30
Machine Learning and Statistics SeminarRoom 261
Speaker:Alex GoldenshlugerTitle:Nonparametric estimation of service time distribution in the $M/G/\infty$ queue and related estimation problemsAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE UNUSUAL DAY/TIME/PLACE

The subject of this talk is the problem of estimating service time distribution of the $M/G/\infty$ queue from incomplete data on the queue. The goal is to estimate $G$ from   observations of the queue--length process at the points of  the regular grid on a fixed time interval. We propose an estimator and analyze its accuracy  over a family of target service time distributions. The original $M/G/\infty$ problem is closely related to the problem of estimating derivatives of the covariance function of a  stationary Gaussian process. We consider the latter problem and derive lower bounds on the minimax risk. The obtained results strongly suggest that the proposed estimator of the service time distribution is rate optimal.

WednesdayMay 27, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Niv BuchbinderTitle:A Primal-Dual Approach to Online Optimization ProblemsAbstract:opens in new windowin html    pdfopens in new window
The primal-dual method is one of the fundamental design methodologies in the areas of linear programming, combinatorial optimization, and approximation algorithms. In this talk I will provide an introduction to the area of online combinatorial optimization. I will then discuss the primal-dual method as a unified framework to the design and analysis of online algorithms. Examples for which this approach is applicable include central online problems such as paging, routing, scheduling, online set cover, and graph connectivity problems. Finally, I will show some connections between online combinatorial optimization and online learning.
WednesdayMay 13, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tamir HazanTitle:Machine Learning: Is it Artificial Intelligence, Statistics or Algorithms? Abstract:opens in new windowin html    pdfopens in new window
Machine learning originates from different fields: Artificial Intelligence (e.g, the Perceptron), Statistics (e.g., the Vapnik-Chervonenkis dimension) and Algorithms (e.g., Valiant's theory of the learnable). In this talk I will show how machine learning relates and differs from these fields while focusing on learning and inference of high-dimensional structures. Such structures arise in various AI applications, for example objects computer vision, parses in natural language processing and molecular structures in computational biology. I will present perturb-max models, which are high-dimensional probability models that measure the stability of prediction to random shifts of data measurements. These models replace the standard Gibbs distributions and allow efficient sampling, as well as generalization and regret bounds.
WednesdayApr 29, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Jonathan RosenblattTitle:On the Optimality of Averaging in Distributed Statistical LearningAbstract:opens in new windowin html    pdfopens in new window

A common approach to statistical learning on big data is to randomly split it among m machines and calculate the parameter of interest by averaging their m individual estimates.

Focusing on empirical risk minimization, or equivalently M-estimation, we study the statistical error  incurred by this strategy.
We consider two asymptotic settings: one where the number of samples per machine n->inf but the number of  parameters p is fixed, and a second high-dimensional regime where both p,n-> inf with p/n-> kappa.

Most previous works provided only moment bounds on the error incurred by splitting the data in the fixed p setting. In contrast, we present for both regimes asymptotically exact distributions for this estimation error. In the fixed-p setting, under suitable assumptions, we thus prove that to leading order, averaging is as accurate as the centralized solution. In the high-dimensional setting, we show a qualitatively different behavior: data splitting does incur a first order accuracy loss, which we quantify precisely. In addition, our asymptotic distributions allow the construction of confidence intervals and hypothesis testing on the estimated parameters. 

Our main conclusion is that in both regimes, averaging parallelized estimates is an attractive way to speedup computations and save on memory, while incurring a quantifiable and typically moderate excess error.

WednesdayMar 25, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Itai DattnerTitle:Statistical Inference for Systems of Differential EquationsAbstract:opens in new windowin html    pdfopens in new window
Many processes in biology, chemistry, physics, medicine, and engineering are modeled by systems of differential equations. These systems describe the interrelationships between the variables involved, and depend in a complicated way on unknown quantities (e.g., initial values, constants or time dependent parameters). Most often, the researcher would like to execute important tasks such as testing the validity of a model, analyzing its qualitative behavior or predicting future states of the system. In order to execute these tasks, one usually needs to estimate the unknown quantities of the system from real data. However, in the case of differential equations, the inverse problem of parameter estimation is considered as the bottleneck in modeling dynamical systems and estimating parameters based on observed noisy state variables has a relatively sparse statistical literature. In this talk we focus on the fairly general and often applied class of systems of ordinary differential equations linear in (functions of) the parameters. For such systems we first characterize a necessary and sufficient condition for identifiability of parameters. Then we present a novel estimation approach and support it by a general statistical theory that enables the development of estimators tailored for a variety of experimental scenarios. In particular, we present estimators corresponding to some common experimental setups and discuss their statistical properties. Simulation studies and application of the method to real data will be discussed.
WednesdayJan 21, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Omri AbendTitle:Machine-Learning the Universal Semantics of Natural LanguagesAbstract:opens in new windowin html    pdfopens in new window
The field of Natural Language Processing (NLP) has recently been pivotal in producing important language technologies such as machine translation and question answering. Such technologies are based on elaborate structural representations of text, detected by statistical methods. However, common approaches to structural representation are language-specific or even domain-specific, limiting the applicability of NLP tools and models. How to represent both the idiosyncrasies of specific domains and languages as well as their commonalities is still an open question. In my talk I will address these questions and propose an approach for learning a level of representation shared by all languages using latent variable models for structured prediction. Under this approach, learning starts from universally-applicable coarse-grained logical structures, which is used to bootstrap the learning of more fine-grained semantic distinctions, as well as the learning of the specifics of individual languages. I will discuss the value of universal semantic structures both to the computational modeling of child language acquisition and to leading NLP applications, focusing on machine translation. Joint work with Ari Rappoport, Shay Cohen and Mark Steedman.
WednesdayDec 24, 201411:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Jonathan BerantTitle:Scalable algorithms for translating natural language to logical formAbstract:opens in new windowin html    pdfopens in new window
Conversational interfaces and virtual assistants such as Apple's Siri, Google Now, and Facebook Graph Search, have led to a rising interest in systems that can translate natural language commands and questions to formal logical forms (like SQL queries) that can be executed against a knowledge base. A major challenge has been to scale these systems, known as semantic parsers, to large knowledge bases. In this talk, I will describe novel algorithms for large scale semantic parsing. A fundamental characteristic of semantic parsing against large knowledge bases is that the space of possible logical forms grows quickly with the length of the input sentence. Our first algorithm learns to efficiently search through this space by explicitly scoring partial logical forms, combining ideas from agenda-based parsing and reinforcement learning. Compared to previous methods, our parser is almost an order of magnitude faster, while maintaining state-of-the-art accuracy. The second algorithm addresses the problem of language variability, that is, the fact that the same logical form can be expressed in a myriad of ways in natural language. We learn to paraphrase an input question ("Where is Obama from?") to a canonical form ("What is the place of birth of Barack Obama?") that can be easily mapped to a logical form. This allows us to exploit the large amounts of free text that are available on the web, leading to a state-of-the-art semantic parser that scales to a knowledge base containing hundreds of millions of facts. This is joint work with Percy Liang.