You are here

Machine Learning and Statistics Seminar

WednesdayJul 05, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ronen TalmonTitle:Common Manifold Learning with Alternating DiffusionAbstract:opens in new windowin html    pdfopens in new window

We consider the problem of hidden common manifold extraction from multiple data sets, which have observation-specific distortions and artifacts. A new manifold learning method is presented based on alternating products of diffusion operators and local kernels. We provide theoretical analysis showing that our method is able to build a variant of the Laplacian of the hidden common manifold, while suppressing the observation-specific artifacts. The generality of this method is demonstrated in data analysis applications, where different types of devices are used to measure the same activity. In particular, we present applications to problems in biomedicine, neuroscience, and audio analysis. 
This is joint work with Roy Lederman and Hau-tieng Wu.

WednesdayJun 21, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Raja Giryes Title:On the relationship between structure in the data and what deep learning can learnAbstract:opens in new windowin html    pdfopens in new window

The past five years have seen a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for feature learning and classification. However, the mathematical reasons for this success remain elusive. In this talk we will briefly survey some existing theory of deep learning. In particular, we will focus on data structure based theory and discuss two recent developments. 
The first work studies the generalization error of deep neural network. We will show how the generalization error of deep networks can be bounded via their classification margin. We will also discuss the implications of our results for the regularization of the networks. For example, the popular weight decay regularization guarantees the margin preservation, but it leads to a loose bound to the classification margin. We show that a better regularization strategy can be obtained by directly controlling the properties of the network's Jacobian matrix. 
The second work focuses on solving minimization problems with neural networks. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence of certain iterative algorithms with an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA (iterative shrinkage and thresholding algorithm) solution by neural networks with layers representing iterations. 

Joint work with Guillermo Sapiro, Miguel Rodrigues, Jure Sokolic, Alex Bronstein and Yonina Eldar. 

WednesdayMay 17, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Aryeh KontorovichTitle:Mixing Time Estimation in Reversible Markov Chains from a Single Sample PathAbstract:opens in new windowin html    pdfopens in new window

We propose a procedure (the first of its kind) for computing a fully data-dependent interval that traps the mixing time t_mix of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge.

The interval is constructed around the relaxation time t_relax, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a sqrt{n} rate, where n is the length of the sample path. Upper and lower bounds are given on the number of samples required to achieve constant-factor multiplicative accuracy. The lower bounds indicate that, unless further restrictions are placed on the chain, no procedure can achieve this accuracy level before seeing each state at least \Omega(t_relax) times on the average. Future directions of research are identified. Time permitting, we will mention some recent further developments by D. Levin and Y. Peres.

Joint work with Daniel Hsu and Csaba Szepesvari.

WednesdayApr 19, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Naftali TishbyTitle:A Deeper Understanding of Deep LearningAbstract:opens in new windowin html    pdfopens in new window

By analytical and numerical studies of Deep Neural Networks (using standard TensorFlow) in the "Information Plane" - the Mutual Information the network layers preserve on the input and the output variables - we obtain the following new insights.

  1. The training epochs, for each layer, are divided into two phases: (1) fitting the training data - increasing the mutual information on the labels; (2) compressing the representation - reducing the mutual information on the inputs. The layers are learnt hierarchically, from the bottom to the top layer, with some overlaps.
  2. Most (~80%) of the training time  - optimization with SGD -  is spent on compressing the representation (the second phase) - NOT on fitting the training data labels, even when the training has no regularization or terms that directly aim at such compression.  
  3. The convergence point, FOR EVERY HIDDEN LAYER, lies on or very close to the Information Bottleneck IB) theoretical bound. Thus, the mappings from the input to the hidden layer and from the hidden layer to the output obey the IB self-consistent equations for some value of the compression-prediction tradeoff.
  4. The main benefit of adding more hidden layers is in the optimization/training time, as the compression phase for each layer amounts to relaxation to a Maximum conditional Entropy state, subject to the proper constraints on the error/information on the labels. As such relaxation takes super-linear time in the compressed entropy, adding more hidden layers dramatically reduces the training time. There is also benefit in sample complexity to adding hidden layers, but this is a smaller effect.

I will explain these new observations and the benefits of exploring Deep Learning in the "Information Plane", and discuss some of the exciting theoretical and practical consequences of our analysis.

Joint work with Ravid Ziv and Noga Zaslavsky.

WednesdayFeb 08, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Haim AvronTitle:Large-scale and Non-approximate Kernel Methods Using Random FeaturesAbstract:opens in new windowin html    pdfopens in new window
Kernel methods constitute a mathematically elegant framework for general-purpose infinite-dimensional non-parametric statistical inference. By providing a principled framework to extend classical linear statistical techniques to non-parametric modeling, their applications span the entire spectrum of statistical learning. However, training procedures naturally derived via this framework scale poorly and with limited opportunities for parallelization. This poor scalability poses a significant barrier for the use of kernel methods in big data applications. As such, with the growth in data across a multitude of applications, scaling up kernel methods has acquired renewed and somewhat urgent significance. Random feature maps, such as random Fourier features, have recently emerged as a powerful technique for speeding up and scaling the training of kernel-based methods. However, random feature maps only provide crude approximations to the kernel function, so delivering state-of-the-art results requires huge amount of random features. Nevertheless, in some cases, even when the number of random features is driven to be as large as the training size, full recovery of the generalization performance of the exact kernel method is not attained. In the talk I will show how random feature maps can be used to efficiently perform non-approximate kernel ridge regression, and thus there is no need to compromise between quality and running time. The core idea is to use random feature maps to form preconditioners to be used in solving kernel ridge regression to high accuracy. I will describe theoretical conditions on when this yields an effective preconditioner, and empirically evaluate the method and show it is highly effective for datasets of up to one million training examples.
WednesdayFeb 01, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ran Gilad-Bachrach Title:CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and AccuracyAbstract:opens in new windowin html    pdfopens in new window
Applying machine learning to a problem which involves medical, financial, or other types of sensitive data, not only requires accurate predictions but also careful attention to maintaining data privacy and security. Legal and ethical requirements may prevent the use of cloud-based machine learning solutions for such tasks. In this work, we will present a method to convert learned neural networks to CryptoNets, neural networks that can be applied to encrypted data. This allows a data owner to send their data in an encrypted form to a cloud service that hosts the network. The encryption ensures that the data remains confidential since the cloud does not have access to the keys needed to decrypt it. Nevertheless, we will show that the cloud service is capable of applying the neural network to the encrypted data to make encrypted predictions, and also return them in encrypted form. These encrypted predictions can be sent back to the owner of the secret key who can decrypt them. Therefore, the cloud service does not gain any information about the raw data nor about the prediction it made. We demonstrate CryptoNets on the MNIST optical character recognition tasks. CryptoNets achieve 99% accuracy and can make around 59000 predictions per hour on a single PC. Therefore, they allow high throughput, accurate, and private predictions. This is a joint work with Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, John Wernsing.
WednesdayJan 25, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Amir Globerson Title:Variational Conditional ProbabilitiesAbstract:opens in new windowin html    pdfopens in new window
Predicting the label Y of an object X is a core task in machine learning. From a probabilistic perspective, this involves reasoning about conditional probabilities p(y|x). However, it is hard to obtain reliable estimates for these probabilities. Here we show how to obtain lower and upper bounds on p(y|x) given statistical information, and show how it can be used within various learning setups. We also extend this formulation to the structured case, where y can be multivariate.
WednesdayJan 04, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Sivan SabatoTitle:Active Nearest-Neighbor Learning in Metric SpacesAbstract:opens in new windowin html    pdfopens in new window
We propose a pool-based non-parametric active learning algorithm for general metric spaces, which outputs a nearest-neighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of the new algorithm is significantly lower than that of any passive learner with similar error guarantees. Our algorithm is based on a generalized sample compression scheme and a new label-efficient active model-selection procedure. Based on joint work with Aryeh Kontorovich and Ruth Urner.
MondayDec 26, 201611:45
Machine Learning and Statistics SeminarRoom 141
Speaker:Elad Hazan Title:A Non-generative Framework and Convex Relaxations for Unsupervised LearningAbstract:opens in new windowin html    pdfopens in new windowNote unusual time and place
We will describe a novel theoretical framework for unsupervised learning which is not based on generative assumptions. It is comparative, and allows to avoid known computational hardness results and improper algorithms based on convex relaxations. We show how several families of unsupervised learning models, which were previously only analyzed under probabilistic assumptions and are otherwise provably intractable, can be efficiently learned in our framework by convex optimization. These includes dictionary learning and learning of algebraic manifolds. Joint work with Tengyu Ma. === Bio === Elad Hazan is a professor of computer science at Princeton university. His research focuses on the design and analysis of algorithms for basic problems in machine learning and optimization. Amongst his contributions are the co-development of the AdaGrad algorithm for training learning machines, and the first sublinear-time algorithms for convex optimization. He is the recipient of (twice) the IBM Goldberg best paper award in 2012 for contributions to sublinear time algorithms for machine learning, and in 2008 for decision making under uncertainty, a European Research Council grant, a Marie Curie fellowship and a Google Research Award (twice). He served on the steering committee of the Association for Computational Learning and has been program chair for COLT 2015.
WednesdayDec 14, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Alon CohenTitle:Online Learning with Feedback Graphs Without the GraphsAbstract:opens in new windowin html    pdfopens in new window

We study an online learning framework introduced by Mannor and Shamir (2011) in which the feedback is specified by a graph, in a setting where the graph may vary from round to round and is \emph{never fully revealed} to the learner. We show a large gap between the adversarial and the stochastic cases. In the adversarial case, we prove that even for dense feedback graphs, the learner cannot improve upon a trivial regret bound obtained by ignoring any additional feedback besides her own loss. In contrast, in the stochastic case we give an algorithm that achieves $\widetilde \Theta(\sqrt{\alpha T})$ regret over $T$ rounds, provided that the independence numbers of the hidden feedback graphs are at most $\alpha$. completely unlearnable. We also extend our results to a more general feedback model, in which the learner does not necessarily observe her own loss, and show that, even in simple cases, concealing the feedback graphs might render the problem unlearnable.

WednesdayNov 16, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Dan Garber Title:Faster Projection-free Machine Learning and OptimizationAbstract:opens in new windowin html    pdfopens in new window

Projected gradient descent (PGD), and its close variants, are often considered the methods of choice for solving a large variety of machine learning optimization problems, including empirical risk minimization, statistical learning, and online convex optimization. This is not surprising, since PGD is often optimal in a very appealing information-theoretic sense. However, for many problems PGD is infeasible both in theory and practice since each step requires to compute an orthogonal projection onto the feasible set. In many important cases, such as when the feasible set is a non-trivial polytope, or a convex surrogate for a low-rank structure, computing the projection is computationally inefficient in high-dimensional settings. An alternative is the conditional gradient method (CG), aka Frank-Wolfe algorithm, that replaces the expensive projection step with a linear optimization step over the feasible set. Indeed in many problems of interest, the linear optimization step admits much more efficient algorithms than the projection step, which is the reason to the substantial regained interest in this method in the past decade. On the downside, the convergence rates of the CG method often fall behind that of PGD and its variants. 
In this talk I will survey an ongoing effort to design CG variants that on one hand enjoy the cheap iteration complexity of the original method, and on the other hand converge provably faster, and are applicable to a wider variety of machine learning settings. In particular I will focus on the cases in which the feasible set is either a polytope or a convex surrogate for low-rank matrices. Results will be demonstrated on applications including: LASSO, video co-localization, optical character recognition, matrix completion, and multi-class classification.

WednesdayJun 08, 201611:15
Machine Learning and Statistics SeminarRoom 155
Speaker:Daniel SoudryTitle:No bad local minima: Data independent training error guarantees for multilayer neural networksAbstract:opens in new windowin html    pdfopens in new window
We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.
WednesdayJun 01, 201611:15
Machine Learning and Statistics SeminarRoom 155
Speaker:Shalom LappinTitle:Deep Learning and Semantic Interpretation of Natural LanguageAbstract:opens in new windowin html    pdfopens in new window
Classical approaches to formal and computational semantics assign values to the terminal elements of hierarchical syntactic structures and define combinatorial operations on the semantic representations of phrases to compute the values of sentences. While these approaches offer formally elegant models of interpretation, they have not produced wide coverage systems. They do not provide for semantic learning. They have also not succeeded in integrating lexical and compositional semantics in an interesting or computationally efficient way. Recent developments in image caption generation suggest an alternative approach, which can overcome these difficulties. This work formulates the problem of matching images with descriptions as a task in machine translation. Deep neural networks use an encoder to map regions of pixels in an image to vector representations of graphic features, and a decoder to align these features with the distributional vectors of lexical and phrasal items. This approach can be generalized to deep neural networks that identify correspondences between multi-modal data structures and sentences. To the extent that this research program is successful, it will satisfy the core objective of the classical formal semantic program. It will assign truth (fulfilment) conditions to the sentences of a language, where these conditions are specified in terms of multi-modal representations of situations (scenes) in the world. These correspondences are generated not by a recursive definition of a truth predicate in a formal semantic theory, but by an extended deep neural language model.
WednesdayMay 18, 201611:15
Machine Learning and Statistics Seminar
Speaker:Abraham WynerTitle:Explaining the Success of AdaBoost and Random Forests as Interpolating ClassifiersAbstract:opens in new windowin html    pdfopens in new windowroom 155

There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting's interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for essentially similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self-averaging, interpolating algorithm which fits training data without error but is nevertheless somewhat smooth. We show that AdaBoost has the same property. We conjecture that both AdaBoost and random forests  succeed because of this mechanism. We provide a number of examples and some theoretical justification to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees and without direct regularization or early stopping.

WednesdayMay 04, 201611:15
Machine Learning and Statistics Seminar
Speaker:Amit DanielyTitle:The Power of Initialization and a Dual View on ExpressivityAbstract:opens in new windowin html    pdfopens in new windowRoom 155
We develop a general duality between neural networks and compositional kernels. We show that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space. Hence, though the training objective is hard to optimize in the worst case, the initial weights form a good starting point for optimization. Our dual view also reveals a pragmatic and aesthetic perspective of neural networks and underscores their expressive power. Joint work with Roy Frostig and Yoram Singer
WednesdayApr 06, 201611:15
Machine Learning and Statistics Seminar
Speaker:Moshe KoppelTitle:Reconstructing Ancient Documents from Noisy ManuscriptsAbstract:opens in new windowin html    pdfopens in new windowRoom 155
Given multiple corrupted versions of the same text, as is common with ancient manuscripts, we wish to reconstruct the original text from which the extant corrupted versions were copied (possibly via latent intermediary versions). This is a challenge of cardinal importance in the humanities. We use a variant of EM to solve this problem and demonstrate the efficacy of the method on both synthetic and real-world data.
WednesdayMar 30, 201611:15
Machine Learning and Statistics SeminarRoom 261
Speaker:Matan GavishTitle:Optimal thresholding of singular values and eigenvaluesAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE ROOM CHANGE

It is common practice in multivariate and matrix-valued data analysis to reduce dimensionality by performing a Singular Value Decomposition or Principal Component Analysis, and keeping only $r$ singular values or principal components, the rest being presumably associated with noise. However, the literature does not propose a disciplined criterion to determine $r$; most practitioners still look for the ``elbow in the Scree Plot'', a 50-years-old heuristic performed by eye. I'll review a line of work which develops a systematic approach to eigenvalue and singular value thresholding. This approach assumes that the signal is low-rank and that the noise is rotationally invariant. Recent results derive optimal thresholds in the presence of quite general noise distributions.

Joint work with David Donoho, Iain Johnstone and Edgar Dobriban (Stanford).

WednesdayMar 16, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Amichai PainskyTitle:Generalized Independent Component Analysis Over Finite AlphabetsAbstract:opens in new windowin html    pdfopens in new window
Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector into components that are as statistically independent as possible from each other. Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA in which both the observations and the independent components are over a finite alphabet. In this work we consider a generalization of this framework in which an observation vector is decomposed to its independent components (as much as possible) with no prior assumption on the way it was generated. This generalization is also known as Barlow's minimal redundancy representation problem [Barlow, '89] and is considered an open problem. We propose several theorems and show that this hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Moreover, we show that there exists a simple transformation (namely, order permutation) which provides a greedy yet very effective approximation of the optimal solution. We further show that while not every random vector can be efficiently decomposed into independent components, the vast majority of vectors do decompose very well (that is, with a small constant cost), as the dimension increases. The minimal redundancy representation (also known as factorial coding) has many applications, mainly in the fields of neural networks and deep learning. In this work we show that this formulation further applies to large alphabet source coding. Joint work with Prof. Saharon Rosset from the Statistics Department and Prof. Meir Feder from the EE department, Tel Aviv University
WednesdayMar 09, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Sebastien BubeckTitle:New Results at the Crossroads of Convexity, Learning and Information TheoryAbstract:opens in new windowin html    pdfopens in new window
I will present three new results: (i) the Cramer transform of the uniform measure on a convex body is a universal self-concordant barrier; (ii) projected gradient descent with Gaussian noise allows to sample from a log-concave measure in polynomial time; and (iii) Thompson sampling combined with a multi-scale exploration solves the Bayesian convex bandit problem. The unifying theme in these results is the interplay between concepts from convex geometry, learning and information theory. Joint work with Ronen Eldan, and for (ii) with Joseph Lehec.
WednesdayFeb 24, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nicolo Cesa-BianchiTitle:Real-time bidding and regret minimizationAbstract:opens in new windowin html    pdfopens in new window
In real-time bidding (RTB), ad exchanges run second-price auctions in a few milliseconds, allowing publishers to sell ad spaces to advertisers on a per-impression basis. The fact that RTB allows the accurate tailoring of impressions to the features of each individual user, has fueled the demand for algorithmic platforms that serve the needs of either the seller or the buyer. In this talk, we focus on the problem, faced by the seller, of dynamically optimizing the reserve price in each auction with the goal of maximizing overall revenue. We cast this problem in a regret minimization setting, and describe computationally efficient algorithms achieving regret of order T^{1/2} under various assumptions both on the information available to the seller and on the mechanism generating bids. Joint work with Claudio Gentile (Varese) and Yishay Mansour (Tel-Aviv).
WednesdayJan 27, 201611:30
Machine Learning and Statistics SeminarRoom 1
Speaker:Assaf HallakTitle:Off-policy Evaluation for MDPs with Unknown StructureAbstract:opens in new windowin html    pdfopens in new windowNEW DATE
In this talk I will present my work from ICML 2015. First, I will give a general introduction to Reinforcement Learning setup and define the off-policy evaluation problem and its core difficulties. I will present the model based solution for off-policy evaluation, and explain when structure can be exploited to improve performance of such solution. This will lead us to the core of our algorithm - the much more general problem of structure learning. The paper suggests solving this problem greedily and give conditions as to when such a solution works. Finally, I will present a few empirical results demonstrating our result.
WednesdayJan 06, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Karen LivescuTitle:Segmental Sequence Models in the Neural AgeAbstract:opens in new windowin html    pdfopens in new windowJOINT Vision and Machine Learning seminar

Many sequence prediction tasks---such as automatic speech recognition and video analysis---benefit from long-range temporal features.  One way of utilizing long-range information is through segmental (semi-Markov) models such as segmental conditional random fields.  Such models have had some success, but have been constrained by the computational needs of considering all possible segmentations.  We have developed new segmental models with rich features based on neural segment embeddings, trained with discriminative large-margin criteria, that are efficient enough for first-pass decoding.  In our initial work with these models, we have found that they can outperform frame-based HMM/deep network baselines on two disparate tasks, phonetic recognition and sign language recognition from video.  I will present the models and their results on these tasks, as well as (time permitting) related recent work on neural segmental acoustic word embeddings.


This is joint work with Hao Tang, Weiran Wang, Herman Kamper, Taehwan Kim, and Kevin Gimpel

WednesdayDec 30, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Eran Treister Title:Efficient algorithms for large scale parameter estimationAbstract:opens in new windowin html    pdfopens in new windowJoint Mathematical Analysis and Applications & Machine Learning and Statistics seminar

Parameter estimation is performed by fitting data measurements to a model using Bayesian statistics, assuming additional prior information. The estimation requires a numerical solution of large scale optimization problem, whose objective traditionally includes data fidelity and regularization terms. In this talk I will present numerical solution methods for two such estimation problems.

In the first part of the talk I will concentrate on parameter estimation of physical models, obtained by solving optimization problems that are constrained by partial differential equations (PDEs). I will focus on my recent work on 3D Full Waveform Inversion, which arises in seismic exploration of oil and gas reservoirs, earth sub-surface mapping, ultrasound imaging and more. I will demonstrate how to computationally treat this inverse problem, and improve its solution by using travel time tomography in a joint inversion framework. This includes efficient algorithms for the solution of the Helmholtz and eikonal equations (the two associated PDEs), and a parallel software framework for applying these algorithms for the joint inversion using a Gauss Newton algorithm.

In the second part of the talk, I will consider the estimation of large scale sparse inverse covariance matrices of multivariate Gaussian distribution. Such matrices are often used to characterize and analyze data measurements in fields that range from machine learning, signal processing and computational biology. To estimate these matrices, an l1 regularized log-determinant optimization problem needs to be solved. I will present a block-coordinate descent algorithm that can efficiently solve this problem at large scales with low memory footprint, and a multilevel acceleration framework that is suitable for general sparse optimization problems. These algorithms can be used as a tool for enriching inverse problems by "learning" appropriate prior information, adopting an empirical Bayesian framework.

WednesdayDec 23, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tirza RouttenbergTitle:Estimation after parameter selection: Estimation methods, performance analysis, and adaptive samplingAbstract:opens in new windowin html    pdfopens in new window

In many practical parameter estimation problems, such as medical experiments and cognitive radio communications, parameter selection is performed prior to estimation. The selection process has a major impact on subsequent estimation by introducing a selection bias and creating coupling between decoupled parameters. As a result,   classical estimation theory may be inappropriate and inaccurate and a new methodology is needed. In this study, the problem of estimating a preselected unknown deterministic parameter, chosen from a parameter set based on a predetermined data-based selection rule, \Psi, is considered.  In this talk, I present a general non-Bayesian estimation theory for estimation after parameter selection, includes estimation methods, performance analysis, and adaptive sampling strategies.  First, I use the post-selection mean-square-error (PSMSE) criterion   as a performance measure instead of the commonly used mean-square-error (MSE).  The corresponding Cramér-Rao-type bound on the PSMSE of any \Psi-unbiased estimator is derived, where the \Psi -unbiasedness is in the Lehmann-unbiasedness sense. The post-selection maximum-likelihood (PSML) estimator is presented and its \Psi–efficiency properties are demonstrated. Practical implementations of the PSML estimator are proposed as well. Finally, I discuss the concept of adaptive sampling in a two-sampling stages scheme of selection and estimation.

WednesdayDec 16, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tomer KorenTitle:The Computational Power of Optimization in Online LearningAbstract:opens in new windowin html    pdfopens in new window

We consider the fundamental problem of prediction with expert advice where the experts are "optimizable": there is a black-box optimization oracle that can be used to compute, in constant time, the leading expert in retrospect at any point in time. In this setting, we give a novel online algorithm that attains vanishing regret with respect to $N$ experts in total $\sqrt{N}$ computation time. We also give a lower bound showing that this running time cannot be improved (up to log factors) in the oracle model, thereby exhibiting a quadratic speedup as compared to the standard, oracle-free setting where the required time for vanishing regret is linear in $N$. These results demonstrate an exponential gap between the power of optimization in online learning and its power in statistical learning: in the latter, an optimization oracle---i.e., an efficient empirical risk minimizer---allows to learn a finite hypothesis class of size $N$ in time $\log{N}$.

We also study the implications of our results to learning in repeated zero-sum games, in a setting where the players have access to oracles that compute, in constant time, their best-response to any mixed strategy of their opponent. We show that the runtime required for approximating the minimax value of the game in this setting is $\sqrt{N}$, yielding again a quadratic improvement upon the oracle-free setting, where linear time in $N$ is known to be tight.

WednesdayDec 02, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yuval Benjamini Title:Estimating bumps: selective inference for regions in non-stationary spatial dataAbstract:opens in new windowin html    pdfopens in new window

"Circular inference" is a pejorative coined for methods in which a hypothesis is selected after looking at the data, but the inferential procedures treat it as if it was determined in advance. Unfortunately, many throughput screening experiments in genomics or neuroimaging seek to do exactly this: identify regions (bumps) of high signal in the data and evaluate these found regions using the same data. Simple estimators that ignore the selection will be biased; when the data is non-stationary, this bias can vary dramatically between different regions. Nevertheless, methods for evaluating and comparing selected regions are crucial, because typically only a handful of regions can be further explored in tailored follow up studies. 

In this talk I describe a new conditional inference approach for characterizing these found regions by estimating their population parameters. Our method explicitly models the selection procedure, and simulates from the conditional distribution to estimate the underlying parameters. Efficient strategies for providing p-value, estimators and intervals will be discussed, as well as power versus accuracy tradeoffs. I will demonstrate the new method for estimating bumps in a comparison of DNA-methylation patterns across tissue type.

This is joint work with Jonathan Taylor and Rafael Irizarry.

MondayJun 15, 201514:30
Machine Learning and Statistics SeminarRoom 261
Speaker:Alex GoldenshlugerTitle:Nonparametric estimation of service time distribution in the $M/G/\infty$ queue and related estimation problemsAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE UNUSUAL DAY/TIME/PLACE

The subject of this talk is the problem of estimating service time distribution of the $M/G/\infty$ queue from incomplete data on the queue. The goal is to estimate $G$ from   observations of the queue--length process at the points of  the regular grid on a fixed time interval. We propose an estimator and analyze its accuracy  over a family of target service time distributions. The original $M/G/\infty$ problem is closely related to the problem of estimating derivatives of the covariance function of a  stationary Gaussian process. We consider the latter problem and derive lower bounds on the minimax risk. The obtained results strongly suggest that the proposed estimator of the service time distribution is rate optimal.

WednesdayMay 27, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Niv BuchbinderTitle:A Primal-Dual Approach to Online Optimization ProblemsAbstract:opens in new windowin html    pdfopens in new window
The primal-dual method is one of the fundamental design methodologies in the areas of linear programming, combinatorial optimization, and approximation algorithms. In this talk I will provide an introduction to the area of online combinatorial optimization. I will then discuss the primal-dual method as a unified framework to the design and analysis of online algorithms. Examples for which this approach is applicable include central online problems such as paging, routing, scheduling, online set cover, and graph connectivity problems. Finally, I will show some connections between online combinatorial optimization and online learning.
WednesdayMay 13, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tamir HazanTitle:Machine Learning: Is it Artificial Intelligence, Statistics or Algorithms? Abstract:opens in new windowin html    pdfopens in new window
Machine learning originates from different fields: Artificial Intelligence (e.g, the Perceptron), Statistics (e.g., the Vapnik-Chervonenkis dimension) and Algorithms (e.g., Valiant's theory of the learnable). In this talk I will show how machine learning relates and differs from these fields while focusing on learning and inference of high-dimensional structures. Such structures arise in various AI applications, for example objects computer vision, parses in natural language processing and molecular structures in computational biology. I will present perturb-max models, which are high-dimensional probability models that measure the stability of prediction to random shifts of data measurements. These models replace the standard Gibbs distributions and allow efficient sampling, as well as generalization and regret bounds.
WednesdayApr 29, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Jonathan RosenblattTitle:On the Optimality of Averaging in Distributed Statistical LearningAbstract:opens in new windowin html    pdfopens in new window

A common approach to statistical learning on big data is to randomly split it among m machines and calculate the parameter of interest by averaging their m individual estimates.

Focusing on empirical risk minimization, or equivalently M-estimation, we study the statistical error  incurred by this strategy.
We consider two asymptotic settings: one where the number of samples per machine n->inf but the number of  parameters p is fixed, and a second high-dimensional regime where both p,n-> inf with p/n-> kappa.

Most previous works provided only moment bounds on the error incurred by splitting the data in the fixed p setting. In contrast, we present for both regimes asymptotically exact distributions for this estimation error. In the fixed-p setting, under suitable assumptions, we thus prove that to leading order, averaging is as accurate as the centralized solution. In the high-dimensional setting, we show a qualitatively different behavior: data splitting does incur a first order accuracy loss, which we quantify precisely. In addition, our asymptotic distributions allow the construction of confidence intervals and hypothesis testing on the estimated parameters. 

Our main conclusion is that in both regimes, averaging parallelized estimates is an attractive way to speedup computations and save on memory, while incurring a quantifiable and typically moderate excess error.

WednesdayMar 25, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Itai DattnerTitle:Statistical Inference for Systems of Differential EquationsAbstract:opens in new windowin html    pdfopens in new window
Many processes in biology, chemistry, physics, medicine, and engineering are modeled by systems of differential equations. These systems describe the interrelationships between the variables involved, and depend in a complicated way on unknown quantities (e.g., initial values, constants or time dependent parameters). Most often, the researcher would like to execute important tasks such as testing the validity of a model, analyzing its qualitative behavior or predicting future states of the system. In order to execute these tasks, one usually needs to estimate the unknown quantities of the system from real data. However, in the case of differential equations, the inverse problem of parameter estimation is considered as the bottleneck in modeling dynamical systems and estimating parameters based on observed noisy state variables has a relatively sparse statistical literature. In this talk we focus on the fairly general and often applied class of systems of ordinary differential equations linear in (functions of) the parameters. For such systems we first characterize a necessary and sufficient condition for identifiability of parameters. Then we present a novel estimation approach and support it by a general statistical theory that enables the development of estimators tailored for a variety of experimental scenarios. In particular, we present estimators corresponding to some common experimental setups and discuss their statistical properties. Simulation studies and application of the method to real data will be discussed.
WednesdayJan 21, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Omri AbendTitle:Machine-Learning the Universal Semantics of Natural LanguagesAbstract:opens in new windowin html    pdfopens in new window
The field of Natural Language Processing (NLP) has recently been pivotal in producing important language technologies such as machine translation and question answering. Such technologies are based on elaborate structural representations of text, detected by statistical methods. However, common approaches to structural representation are language-specific or even domain-specific, limiting the applicability of NLP tools and models. How to represent both the idiosyncrasies of specific domains and languages as well as their commonalities is still an open question. In my talk I will address these questions and propose an approach for learning a level of representation shared by all languages using latent variable models for structured prediction. Under this approach, learning starts from universally-applicable coarse-grained logical structures, which is used to bootstrap the learning of more fine-grained semantic distinctions, as well as the learning of the specifics of individual languages. I will discuss the value of universal semantic structures both to the computational modeling of child language acquisition and to leading NLP applications, focusing on machine translation. Joint work with Ari Rappoport, Shay Cohen and Mark Steedman.
WednesdayDec 24, 201411:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Jonathan BerantTitle:Scalable algorithms for translating natural language to logical formAbstract:opens in new windowin html    pdfopens in new window
Conversational interfaces and virtual assistants such as Apple's Siri, Google Now, and Facebook Graph Search, have led to a rising interest in systems that can translate natural language commands and questions to formal logical forms (like SQL queries) that can be executed against a knowledge base. A major challenge has been to scale these systems, known as semantic parsers, to large knowledge bases. In this talk, I will describe novel algorithms for large scale semantic parsing. A fundamental characteristic of semantic parsing against large knowledge bases is that the space of possible logical forms grows quickly with the length of the input sentence. Our first algorithm learns to efficiently search through this space by explicitly scoring partial logical forms, combining ideas from agenda-based parsing and reinforcement learning. Compared to previous methods, our parser is almost an order of magnitude faster, while maintaining state-of-the-art accuracy. The second algorithm addresses the problem of language variability, that is, the fact that the same logical form can be expressed in a myriad of ways in natural language. We learn to paraphrase an input question ("Where is Obama from?") to a canonical form ("What is the place of birth of Barack Obama?") that can be easily mapped to a logical form. This allows us to exploit the large amounts of free text that are available on the web, leading to a state-of-the-art semantic parser that scales to a knowledge base containing hundreds of millions of facts. This is joint work with Percy Liang.