You are here

1

WednesdayJul 19, 202311:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Itay SafranTitle:How Many Neurons Does it Take to Approximate the Maximum?Abstract:opens in new windowin html    pdfopens in new window
Understanding the role of depth is a fundamental endeavor in explaining the practical success of deep learning. In this talk, we will focus on the problem of approximating the maximum function over $d$ inputs using deep ReLU networks and with respect to the uniform distribution over a hypercube. We will show that while approximating the maximum using depth 2 networks to arbitrary accuracy requires arbitrary width, this can be done to arbitrary accuracy using a depth 3 network with a fixed width of $d^2$. Additionally, we will also show that this upper bound is tight, namely that width $\Omega(d^2)$ is also necessary when approximating using depth 3. Moreover, using this efficient depth 3 construction, we will show that greater depths result in a lesser width requirement, where width $\mathcal{O}(d)$ suffices when we allow depth $\mathcal{O}(\log(\log(d)))$. Lastly, we will show a size lower bound of $d$ neurons for approximating the maximum using any depth. These results establish a partial depth hierarchy for approximating a naturally occuring function which helps explain the benefits of depth over width.
WednesdayJul 05, 202311:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Gal VardiTitle:Implicit Bias and Provable Generalization in Overparameterized Neural NetworksAbstract:opens in new windowin html    pdfopens in new window
When training large neural networks, there are typically many solutions that perfectly fit the training data. Nevertheless, gradient-based methods have a tendency to reach those which generalize well, and understanding this "implicit bias" has been a subject of extensive research. In this talk, I will discuss three works that show settings where the implicit bias provably implies generalization in two-layer neural networks: First, the implicit bias implies generalization in univariate ReLU networks. Second, in ReLU networks where the data consists of clusters and the correlations between cluster means are small, the implicit bias leads to solutions that generalize well, but are highly vulnerable to adversarial examples. Third, in Leaky-ReLU networks (as well as linear classifiers), under certain assumptions on the input distribution, the implicit bias leads to benign overfitting: the estimators interpolate noisy training data and simultaneously generalize well to test data. Based on joint works with Spencer Frei, Itay Safran, Peter L. Bartlett, Jason D. Lee, and Nati Srebro. Bio: Gal is a postdoc at TTI-Chicago and the Hebrew University, hosted by Nati Srebro and Amit Daniely as part of the NSF/Simons Collaboration on the Theoretical Foundations of Deep Learning. Prior to that, he was a postdoc at the Weizmann Institute, hosted by Ohad Shamir, and a PhD student at the Hebrew University, advised by Orna Kupferman. His research focuses on theoretical machine learning, with an emphasis on deep-learning theory.
WednesdayJun 28, 202311:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yaniv RomanoTitle:Conformal Prediction is Robust to Label NoiseAbstract:opens in new windowin html    pdfopens in new window
In real-world supervised learning problems, accurate and trustworthy labels are often elusive, with label noise being a pervasive challenge. In this talk, we will delve into the inherent robustness of conformal prediction---a powerful tool for quantifying predictive uncertainty---to label noise. We will address both regression and classification problems and characterize how and when we can generate uncertainty sets that include the true labels that are hidden from us. By navigating between theory and practice, we will showcase the conservative coverage of clean ground truth labels achieved by employing conformal prediction with noisy labels and commonly used score functions, except in adversarial cases.
WednesdayJun 28, 202311:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yaniv RomanoTitle:Conformal Prediction is Robust to Label NoiseAbstract:opens in new windowin html    pdfopens in new window
In real-world supervised learning problems, accurate and trustworthy labels are often elusive, with label noise being a pervasive challenge. In this talk, we will delve into the inherent robustness of conformal prediction---a powerful tool for quantifying predictive uncertainty---to label noise. We will address both regression and classification problems and characterize how and when we can generate uncertainty sets that include the true labels that are hidden from us. By navigating between theory and practice, we will showcase the conservative coverage of clean ground truth labels achieved by employing conformal prediction with noisy labels and commonly used score functions, except in adversarial cases.
SundayJun 25, 202312:15
Machine Learning and Statistics SeminarRoom 155
Speaker:Yftah ZiserTitle:Extending the Reach of NLP: Overcoming the Data BottleneckAbstract:opens in new windowin html    pdfopens in new window***Joint with Vision and AI Seminar*** ***Please note the unusual day and time***
Transformer-based models have revolutionized natural language processing (NLP) and significantly improved various NLP tasks. However, many researchers make implicit assumptions about their training setups, assuming that the train and test sets are drawn from the same distribution. This assumption can limit the applicability of these models across different languages and domains. The high cost of training state-of-the-art NLP models using various languages and domains has resulted in training them for only a subset of languages and domains, leading to a significant performance gap in excluded domains and languages. This performance gap marginalizes many individuals from accessing useful models. This talk will address the challenges, approaches, and opportunities for democratizing NLP across different languages and domains. Finally, we will explore future directions for making these models accessible to a broader audience.
WednesdayJun 14, 202311:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Mahmood SharifTitle:Assessing and enhancing ML systems' adversarial robustness in the real worldAbstract:opens in new windowin html    pdfopens in new window
Adversarial examples have emerged as a profound challenge for machine learning (ML), sparking interest in developing adversarially robust ML models and reliable techniques for assessing robustness. This talk will present our recent efforts on these fronts, offering attacks and defenses under practical constraints. First, I will introduce our work on black-box attacks composing and integrating data augmentations into evasion techniques to promote transferability to unseen models. Second, I will describe our work fooling face-recognition systems via physically realized adversarial accessories attackers can wear to dodge recognition or impersonate others. We evaluated face-recognition systems relying on different sensor types, including visible-light and near-infrared cameras, against evasion and found all were highly vulnerable, even when defended by state-of-the-art techniques. Third, I will present practical attacks against ML-based malware detection. Our attacks interweave binary-diversification techniques and optimization frameworks to mislead malware detection while preserving binaries' functionality. Unlike prior work, ours manipulate instructions that are a functional part of the binary, rendering them particularly challenging to defend against. I will conclude with our attempts to enhance the robustness of ML-based malware detection via adversarial training.
WednesdayMay 24, 202311:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Prof. Gilad Lerman Title:Cycle-edge message passing for group and non-group synchronizationAbstract:opens in new windowin html    pdfopens in new window
The general synchronization problem asks to recover states of objects from their corrupted relative measurements. When the states are represented by group elements (e.g. 3-D rotations or permutations) this problem is known as group synchronization. In several applications, the algebraic structure of the states is more complicated, for example, the states can be represented by partial permutations. The synchronization problem has many applications, in particular, to structure-from-motion (SfM), where one needs to estimate the 3D structure of a scene from a set of its projected 2D images. I will first describe a general framework for group synchronization, the Cycle-Edge Message Passing (CEMP), and then explain its generalization to non groups, by exemplifying the case of partial permutation synchronization. I will emphasize mathematical difficulties, review some mathematical guarantees for the proposed methods and also demonstrate an application. This is a joint work with Shaohan Li and Yunpeng Shi.
WednesdayMay 10, 202311:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Daniel NevoTitle:Causal inference with misspecified interference structureAbstract:opens in new windowin html    pdfopens in new window
The typical approach towards drawing causal conclusions from observed data starts by defining a causal estimand, for example in terms of potential outcomes or the so-called do operator, and continues by providing conditions for identification of this estimand from the data, followed by statistical estimation and inference. One of the main assumptions is the no-interference assumption, meaning that the treatment assigned to one unit does not affect other units in the sample. However, in many domains such as in the social sciences and infectious disease epidemiology, this assumption is implausible in practice due to social interactions. As an alternative to the no-interference assumption, an interference structure is often represented using a network. Ubiquitously, the network structure is assumed to be known and correctly specified. Nevertheless, correctly encoding the interference structure in a network can be challenging. For example, people may misreport their social connections, or report connections irrelevant to the specific combination of treatment and outcome. Building on the exposure mapping framework, we derive the bias arising from estimating causal effects under a misspecified interference structure. To address this problem, we propose a novel estimator that uses multiple networks simultaneously and is unbiased if one of the networks correctly represents the interference structure, thus providing robustness to the network specification. Additionally, we propose a sensitivity analysis that quantifies the impact of a postulated misspecification mechanism on the causal estimates. Through simulation studies, we illustrate the bias from assuming an incorrect network and show the bias-variance tradeoff of our proposed network-misspecification-robust estimator. We further demonstrate the utility of our methods in two real examples. Joint work with Bar Weinstein
WednesdayMay 03, 202311:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ofir LindenbaumTitle:Exploiting Randomness in Machine LearningAbstract:opens in new windowin html    pdfopens in new window
Noise plays a central role in many machine learning algorithms; one well-studied example is stochastic gradient descent (SGD). Despite its simplicity, i.e., being a noisy first-order optimization method, SGD empirically outperforms gradient descent (GD) and second-order methods. In the first part of the talk, I will present Gaussian Stochastic Gates (STGs), a differentiable non-convex relaxation of the L0 norm highly effective for feature selection. Under a linear sparse regression model, I will show that STGs can recover the informative features successfully. Then, I will show how a randomly aggregated least squares procedure can improve the probability of exact recovery. In the second part, I will revisit the noise model of SGD and present strong evidence that it is well characterized by a symmetric $S\alpha S$ Lévy distribution. This allows us to use Lévy-driven stochastic differential equation (SDE) to analyze different properties of deep nets near local minima. (based and join works with Yutaro Tamada, Yuval Kluger, Sahand Negahban, Stefan Steinerberger, and Barak Battash).
ThursdayFeb 09, 202312:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tomer Michaeli Title:The implicit bias of SGD: A Minima stability analysisAbstract:opens in new windowin html    pdfopens in new window*** Joint ML/Statistics & Vision/AI Seminar ***
One of the puzzling phenomena in deep learning, is that neural networks tend to generalize well even when they are highly overparameterized. Recent works linked this behavior with implicit biases of the algorithms used to train networks (like SGD). Here we analyze the implicit bias of SGD from the standpoint of minima stability, focusing on shallow ReLU networks trained with a quadratic loss. Specifically, it is known that SGD can stably converge only to minima that are flat enough w.r.t. its step size. Here we show that this property enforces the predictor function to become smoother as the step size increases, thus significantly regularizing the solution. Furthermore, we analyze the representation power of stable solutions. Particularly, we prove a depth-separation result: There exist functions that cannot be approximated by depth-2 networks corresponding to stable minima, no matter how small the step size is taken to be, but which can be implemented with depth-3 networks corresponding to stable minima. We show how our theoretical findings explain behaviors observed in practical settings. (Joint works with Rotem Mulayoff, Mor Shpigel Nacson, Greg Ongie, Daniel Soudry).
WednesdayDec 28, 202211:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yoav LevineTitle:Theoretical and practical principles for designing, training, and deploying huge language modelsAbstract:opens in new windowin html    pdfopens in new window
The field of natural language processing (NLP) has been advancing in giant strides over the past several years. The main drivers of this success are: (1) scaling the Transformer deep network architecture to unprecedented sizes and (2) "pretraining" the Transformer over massive amounts of unlabeled text. In this talk, I will describe efforts to provide principled guidance for the above main components and further thrusts in contemporary NLP, aimed to serve as timely constructive feedback for the strong empirical pull in this field. I will begin by describing our theoretical framework for analyzing Transformers, and present results on the depth to width tradeoffs in Transformers, identified bottlenecks within internal Transformer dimensions, and identified biases introduced during the Transformer self-supervised pretraining phase. This framework has guided the design and scale of several of the largest existing language models, including Chinchilla by Deepmind (70 billion learned parameters), Bloom by BigScience (176 billion learned parameters), and Jurassic-1 by AI21(178 billion learned parameters). Then, I will describe our works on leveraging linguistic biases such as word senses or frequent n-grams in order to increase efficiency of the self-supervised pretraining phase. Subsequently, I will describe novel principles for addressing a present-day problem stemming from the above success of scaling, namely, how to deploy a huge language model such that it specializes in many different use cases simultaneously. Finally, I will comment on future challenges in this field, and will relatedly present a recent theoretical result on the importance of intermediate supervision when solving composite NLP tasks. This talk is based on works published in NeurIPS 2020, ACL 2020, ICLR 2021 (spotlight), ICML 2021, ICLR 2022 (spotlight), ICML 2022 (workshop), as well as on several recent preprints.
WednesdayDec 21, 202211:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yonathan EfroniTitle:Reinforcement Learning in the presence of irrelevant informationAbstract:opens in new windowin html    pdfopens in new window
Reinforcement Learning (RL) is a field concerned with designing general purpose learning algorithms that solve sequential-decision tasks. In recent years, by using deep neural networks, RL algorithms were applied on high-dimensional and challenging domains, witnessing unprecedented success. Yet, despite recent advancements, the theoretical foundations of high-dimensional RL are not fully understood. A recurring theme in high-dimensional RL is the presence of irrelevant information in the observations. E.g., in a visual navigation task the observation might capture the movement of clouds, which is irrelevant for reaching the goal location. This calls for natural questions: Can such tasks be learned efficiently, depending only on the complexity of the relevant information? Can RL algorithms be robust to noise in observations? Surprisingly, contemporary RL algorithms may provably fail in the presence of irrelevant information. In this talk, I will elaborate on these failure cases and present our new provable approaches for high-dimensional RL with irrelevant information. Shared to these are techniques to filter the irrelevant information while guaranteeing near-optimal behavior. I will conclude with experimental results showcasing challenges and solutions in practice. Bio: Yonathan is a research scientist at Meta. Prior to that he completed his post-doctorate in Microsoft Research, New York. He obtained his PhD from the Technion, advised by Prof. Shie Mannor, and his Master from the Weizmann institute in Physics. His work won the outstanding paper award in AAAI19 and a best paper award in the OptRL workshop in NeurIPS19.
WednesdayJun 15, 202211:15
Machine Learning and Statistics SeminarRoom 155
Speaker:Eran MalachTitle:Deep Learning at the Computational LimitAbstract:opens in new windowin html    pdfopens in new window
What problems can be solved with Deep Learning? In recent years, theoretical works on Deep Learning have presented many positive results, showing cases where neural networks provably work, along with various negative results, emphasizing the limitations of Deep Learning. However, most of the positive results study problems that are seemingly very simple, while negative results are given for problems that are extraordinarily hard. In this talk, I will focus on studying "borderline" problems - tasks for which there are known computational limitations, but at the same can be solved "efficiently" using neural networks trained with gradient descent. Investigating the behavior of neural networks at the computational limit allows us to study the "optimality" of Deep Learning, and reveals some interesting emergent phenomena
WednesdayMay 11, 202211:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nir SharonTitle:Compactification of the Rigid Motions Group in Image ProcessingAbstract:opens in new windowin html    pdfopens in new window
Image processing problems in general, particularly in the field of single-particle cryo-electron microscopy, often require considering images up to their rotations and translations. Such issues were tackled successfully for the case of rotations only, using quantities that are invariant to the action of rotations on images. However, extending these methods to cases involving translations is more complicated. This talk presents a computationally feasible and theoretically sound method of approximate invariant to the action of rotations and translations on images. Our approach allows one to reduce image processing problems to similar problems but over the sphere -- a compact domain acted on by the group of 3D rotations. We show that our invariant is induced by a family of deformed mappings, thereby compactifying the group structure of rotations and translations of the plane, i.e., the group of rigid motions, into the compact group of 3D rotations. Furthermore, we demonstrate its viability in two image processing tasks: multi-reference alignment and classification.
TuesdayMay 03, 202214:00
Machine Learning and Statistics SeminarRoom 1
Speaker:Nati Srebro Title:Learning by Overfitting: A Statistical Learning view of Interpolation LearningAbstract:opens in new windowin html    pdfopens in new window** Please note that unusual day and time **
The classic view of statistical learning tells us that we should balance model fit with model complexity instead of insisting on training error that's much lower than what we can expect to generalize to, or even lower than the noise level or Bayes error. And that this balance, and control on model complexity ensures good generalization. But in recent years we have seen that in many situations we can learn and generalize without such a balance, and despite (over?) fitting the training set well below the noise level. This has caused us to rethink the basic principles underlying statistical learning theory. In this talk I will discuss how much of our theory we can salvage and how much of it needs to be revised, focusing on the role of uniform convergence in understanding interpolation learning. Based on joint work with Lijia Zhou, Fred Koehler, and Danica Sutherland
WednesdayApr 27, 202211:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Mor Geva Pipek Title:A Trip Down Memory Lane: How Transformer Feed-Forward Layers Build Predictions in Language ModelsAbstract:opens in new windowin html    pdfopens in new window** A joint seminar with Vision and AI**
The field of natural language processing is dominated by transformer-based language models (LMs). One of the core building blocks of these models is the feed-forward network (FFN) layers, which typically account for >2/3 of the network parameters. Yet, how these layers are being utilized by the model to build predictions is largely unknown. In this talk, I will share recent findings on the operation of FFN layers in LMs, and demonstrate their utility in real-world applications. First, I will show that FFN layers can be cast as human-interpretable key-value memories, and describe how the output from each layer can be viewed as a collection of updates to the model's output distribution. Then, I will demonstrate the utility of these findings in the context of (a) controlled language generation, where we reduce the toxicity of GPT2 by almost 50%, and (b) improving computation efficiency, through a simple rule for early exit, saving 20% of computation on average.
WednesdayApr 13, 202211:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Rotem DrorTitle:A Statistical Analysis of Automatic Evaluation for SummarizationAbstract:opens in new windowin html    pdfopens in new window

The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is not clear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to random chance. In this talk, I will address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations measured in summarization. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, I will analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. In this research, we find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings. This work is published at TACL 2021: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00417/107833/A-Statistical-Analysis-of-Summarization-Evaluation.
In the second part of this talk, I will present an ongoing study that identifies two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate summarization systems in practice and propose changes to rectify this disconnect. The results from these analyses point to the need for future research to focus on developing more consistent and reliable human evaluations of summaries. 
This research was done in collaboration with Daniel Deutsch, a Ph.D. student from the Cognitive Computation Group at the Department of Computer and Information Science, University of Pennsylvania.

WednesdayDec 08, 202111:15
Machine Learning and Statistics Seminar
Speaker:Yariv AizenbudTitle:Non-parametric estimation of manifolds from noisy dataAbstract:opens in new windowin html    pdfopens in new window
A common task in many data-driven applications is to find a low dimensional manifold that describes the data accurately. Estimating a manifold from noisy samples has proven to be a challenging task. Indeed, even after decades of research, there is no (computationally tractable) algorithm that accurately estimates a manifold from noisy samples with a constant level of noise. In this talk, we will present the first method that is guaranteed to estimate a manifold and its tangent in the ambient space. Moreover, we establish rigorous convergence rates, which are essentially as good as existing convergence rates for function estimation.
WednesdayDec 01, 202111:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ron LevieTitle: A Theoretical Analysis of Generalization in Graph Convolutional Neural NetworksAbstract:opens in new windowin html    pdfopens in new window
In recent years, the need to accommodate non-Euclidean structures in data science has brought a boom in deep learning methods on graphs, leading to many practical applications with commercial impact. In this talk we will review the mathematical foundations of the generalization capabilities of graph convolutional networks (GNNs). We will focus mainly on spectral GNNs, where convolution is defined as element-wise multiplication in the frequency domain of the graph. In machine learning settings where the dataset consists of signals defined on many different graphs, the trained GNN should generalize to graphs outside the training set. A GNN is called transferable if, whenever two graphs represent the same underlying phenomenon, the GNN has similar repercussions on both graphs. Transferability ensures that GNNs generalize if the graphs in the test set represent the same phenomena as the graphs in the training set. We will discuss the different approaches to mathematically model the notions of transferability, and derive corresponding transferability error bounds, proving that GNNs have good generalization capabilities.
WednesdayNov 24, 202111:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Boris Landa Title:When Sinkhorn-Knopp Meets Marchenko-Pastur: Diagonal Scaling Reveals the Rank of a Count MatrixAbstract:opens in new windowin html    pdfopens in new window
An important task in modern data analysis is to determine the rank of a corrupted data matrix. Random matrix theory provides useful insights into this task by assuming a “signal+noise” model, where the goal is to estimate the rank of the underlying signal matrix. If the noise is homoskedastic, i.e., the noise variances are identical across all entries, the spectrum of the noise admits the celebrated Marchenko-Pastur (MP) law, providing a simple approach for rank estimation. However, in many practical situations, such as in single-cell RNA sequencing (scRNA-seq), the noise is far from being homoskedastic. In this talk, focusing on a Poisson data model, I will present a simple procedure termed biwhitening, which enforces the MP law to hold by appropriate diagonal scaling. Aside from the Poisson distribution, this procedure is extended to families of distributions with quadratic variance functions. I will demonstrate this procedure on both simulated and experimental data, showcasing accurate rank estimation in simulations and excellent fits to the MP law for real scRNA-seq datasets.
WednesdayNov 24, 202111:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Boris Landa Title:When Sinkhorn-Knopp Meets Marchenko-Pastur: Diagonal Scaling Reveals the Rank of a Count MatrixAbstract:opens in new windowin html    pdfopens in new window
An important task in modern data analysis is to determine the rank of a corrupted data matrix. Random matrix theory provides useful insights into this task by assuming a “signal+noise” model, where the goal is to estimate the rank of the underlying signal matrix. If the noise is homoskedastic, i.e., the noise variances are identical across all entries, the spectrum of the noise admits the celebrated Marchenko-Pastur (MP) law, providing a simple approach for rank estimation. However, in many practical situations, such as in single-cell RNA sequencing (scRNA-seq), the noise is far from being homoskedastic. In this talk, focusing on a Poisson data model, I will present a simple procedure termed biwhitening, which enforces the MP law to hold by appropriate diagonal scaling. Aside from the Poisson distribution, this procedure is extended to families of distributions with quadratic variance functions. I will demonstrate this procedure on both simulated and experimental data, showcasing accurate rank estimation in simulations and excellent fits to the MP law for real scRNA-seq datasets.
WednesdayJan 27, 202114:00
Machine Learning and Statistics Seminar
Speaker:Haggai Maron Title:Deep Learning on Structured and Geometric DataAbstract:opens in new windowin html    pdfopens in new windowhttps://weizmann.zoom.us/j/98859805667?pwd=dCthazViNFM1Y2FFeCtIcitORUR3Zz09

Deep Learning of structured and geometric data, such as sets, graphs, and surfaces, is a prominent research direction that has received considerable attention in the last few years. Given a learning task that involves structured data, the main challenge is identifying suitable neural network architectures and understanding their theoretical and practical tradeoffs.

This talk will focus on a popular learning setup where the learning task is invariant to a group of transformations of the input data. This setup is relevant to many popular learning tasks and data types. In the first part of the talk, I will present a general framework for designing neural network architectures based on layers that respect these transformations. In particular, I will show that these layers can be implemented using parameter-sharing schemes induced by the group. In the second part of the talk, I will demonstrate the framework’s applicability by presenting novel neural network architectures for two widely used data types: graphs and sets. I will also show that these architectures have desirable theoretical properties and that they perform well in practice.

 

https://weizmann.zoom.us/j/98859805667?pwd=dCthazViNFM1Y2FFeCtIcitORUR3Zz09
 

ThursdayJan 21, 202118:00
Machine Learning and Statistics Seminar
Speaker:Or Litany Title:Learning on Pointclouds for 3D Scene UnderstandingAbstract:opens in new windowin html    pdfopens in new windowhttps://weizmann.zoom.us/j/93960308443?pwd=b1BWMkZURjZ0THMxUisrSFpMaHNXdz09

In this talk i'll be covering several works in the topic of 3D deep learning on pointclouds for scene understanding tasks.
First, I'll describe VoteNet (ICCV 2019, best paper nomination): a method for object detection from 3D pointclouds input, inspired by the classical generalized Hough voting technique. I'll then explain how we integrated image information into the voting scheme to further boost 3D detection (ImVoteNet, CVPR 2020). In the second part of my talk I'll describe recent studies focusing on reducing supervision for 3D scene understanding tasks, including PointContrast -- a self-supervised representation learning framework for 3D pointclods (ECCV 2020). Our findings in PointContrast are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -- demonstrating that the learned representation can generalize across domains.

 

 

https://weizmann.zoom.us/j/93960308443?pwd=b1BWMkZURjZ0THMxUisrSFpMaHNXdz09

WednesdayJan 13, 202115:15
Machine Learning and Statistics Seminar
Speaker:Barak Sober Title:Estimation of Manifolds from Point Clouds: Building Models from Data Abstract:opens in new windowin html    pdfopens in new windowhttps://weizmann.zoom.us/j/95680998523?pwd=N2tRL1dydlV3VE9uOUVTWmU3UGN1QT09

A common observation in data-driven applications is that high dimensional data has a low intrinsic dimension, at least locally. Thus, when one wishes to work with data that is not governed by a clear set of equations, but still wishes to perform statistical or other scientific analysis, an optional model is the assumption of an underlying manifold from which the data was sampled. This manifold, however, is not given explicitly but we can obtain samples of it (i.e., the individual data points). In this talk, we will consider the mathematical problem of estimating manifolds from a finite set of samples, possibly recorded with noise. Using a Manifold Moving Least-Squares approach we provide an approximant manifold with high approximation order in both the Hausdorff sense as well as in the Riemannian metric (i.e., a nearly isometry). In the case of bounded noise, we present an algorithm that is guaranteed to converge in probability when the number of samples tends to infinity. The motivation for this work is based on the analysis of the evolution of shapes through time (e.g., handwriting or primates' teeth) and we will show how this framework can be utilized to answer scientific questions in paleontology and archaeology. 

 

https://weizmann.zoom.us/j/95680998523?pwd=N2tRL1dydlV3VE9uOUVTWmU3UGN1QT09

WednesdayDec 23, 202011:15
Machine Learning and Statistics Seminar
Speaker:Yossi ArjevaniTitle:Computational Barriers in Continuous Optimization: Two Complexity Theories and One Tale of SymmetryAbstract:opens in new windowin html    pdfopens in new windowhttps://weizmann.zoom.us/j/93958126972?pwd=STg0b0JMUFI5eS9rVEd0QXpRY2V2UT09

Since the modern formulation of mathematical optimization, researchers in the field have repeatedly expanded and re-defined the realm of tractable optimization problems. This endeavor has culminated in the well-known class of convex optimization problems with applications in a wide range of scientific fields. In this talk, I will present the traditional oracle-based complexity model, which has dominated (unstructured) continuous optimization for the past 40 years, and highlight some of its successes and failures in predicting the hardness of convex optimization problems. I will then introduce a novel structural-based model aimed at addressing major oracle-based complexity gaps. The new approach is intimately tied with approximation theory, and is proven to be particularly advantageous for characterizing the complexity of optimization methods in machine learning.

Studies in recent years have indicated that the realm of tractable optimization problems may be expanded once more -- this time into that of the nonconvex world. In the second part of the talk, I will present a novel symmetry-based approach to study nonconvex optimization landscapes, and use it to address optimization-related aspects of neural networks. The approach employs techniques, new to the field, from symmetry breaking and representation theory, and carries important implications for one’s ability to argue about statistical generalization through local curvature.

https://weizmann.zoom.us/j/93958126972?pwd=STg0b0JMUFI5eS9rVEd0QXpRY2V2UT09

 

 

TuesdayDec 22, 202015:00
Machine Learning and Statistics Seminar
Speaker:Nadav DymTitle:On the injectivity and (in)stability of invariant encodingAbstract:opens in new windowin html    pdfopens in new window https://weizmann.zoom.us/j/93648867660?pwd=dVpDeHAyRk5IbGFOTWVvS3pTZnF3UT09

Quotient spaces are a natural mathematical tool to describe a variety of algorithmic problems in computer vision and related fields, where different objects are to be compared while their natural symmetries are to be ignored. The computational complications arising from these symmetries can be alleviated by mapping the given object to features invariant to the object’s symmetries. But can this be done without losing information? In this talk we will discuss this question for two different problems:

(a) Neural networks for point clouds: The natural symmetries of 3D point clouds are permutations and rigid motions. We will describe our recent work which provides a general framework for constructing such invariant networks to be lossless, in the sense that they can approximate any continuous invariant function. As a result, invariant networks such as Tensor Field Networks are shown to have universal approximation power.

(b) Phase retrieval: Phase retrieval is the problem of retrieving a signal, up to global phase, from linear measurements whose phase is lost. The phase ambiguity here can be considered as the symmetries of the problem, and the phaseless linear measurements as the invariants. We will discuss results on the injectivity and stability of phase retrieval, and particularly our results connecting phase retrieval stability to measures of graph connectivity.

To conclude, we will point out some insights which can be obtained from viewing these two different problems in the same framework.

 

 https://weizmann.zoom.us/j/93648867660?pwd=dVpDeHAyRk5IbGFOTWVvS3pTZnF3UT09

WednesdayJan 08, 202011:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yair CarmonTitle:Fundamental limits of modern machine learning and how to get around themAbstract:opens in new windowin html    pdfopens in new window

This talk presents new computational and statistical barriers in machine learning, along with the algorithmic developments that they inspire.

The computational barriers arise in nonconvex optimization: we prove lower bounds on the (oracle) complexity of finding stationary points using (stochastic) gradient methods, showing that gradient descent is unimprovable for a natural class of problems. We bypass this barrier by designing an algorithm that outperforms gradient descent for a large subclass of problems with high-order smoothness. Our algorithm leverages classical momentum techniques from convex optimization using a "convex until proven guilty" principle that we develop.

The statistical barrier is the large amount of data required for adversarially robust learning. In a Gaussian model, we prove that unlabeled data allows us to circumvent an information theoretic gap between robust and standard classification. Our analysis directly leads to a general robust self-training procedure; we use it to significantly improve state-of-the-art performance on the challenging and extensively studied CIFAR-10 adversarial robustness benchmark.

SundayJan 05, 202009:45
Machine Learning and Statistics SeminarRoom 1
Speaker:Alon GonenTitle:Beyond Worst Case In Machine Learning: The Oracle ModelAbstract:opens in new windowin html    pdfopens in new window

In recent years there has been an increasing gap between the success of machine learning algorithms and our ability to explain their success theoretically. 

Namely, many of the problems that are solved to a satisfactory degree of precision are computationally hard in the worst case. Fortunately, there are often reasonable assumptions which help us to get around these worst-case impediments and allow us to rigorously analyze heuristics that are used in practice. 

In this talk I will advocate a complementary approach, where instead of explicitly characterizing some desired "niceness" properties of the data, we assume access to an optimization oracle that solves a relatively simpler task. This allows us to identify the sources of hardness and extend our theoretical understanding to new domains. Furthermore we will show that seemingly innocents (and arguably justifiable) modifications to the oracle can lead to tractable reductions and even to bypass hardness results. 

We demonstrate these ideas using the following results: i) An efficient algorithm for non-convex online learning using an optimization oracle. b) A faster boosting algorithm using a "simple" weak learner. iii) An efficient reduction from online to private learning.

Joint works with Naman Agarwal, Noga Alon, Elad Hazan, and Shay Moran.

SundayJan 05, 202009:45
Machine Learning and Statistics SeminarRoom 1
Speaker:Alon GonenTitle:Beyond Worst Case In Machine Learning: The Oracle ModelAbstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL TIME AND DAY

In recent years there has been an increasing gap between the success of machine learning algorithms and our ability to explain their success theoretically.

Namely, many of the problems that are solved to a satisfactory degree of precision are computationally hard in the worst case. Fortunately, there are often reasonable assumptions which help us to get around these worst-case impediments and allow us to rigorously analyze heuristics that are used in practice.

In this talk I will advocate a complementary approach, where instead of explicitly characterizing some desired "niceness" properties of the data, we assume access to an optimization oracle that solves a relatively simpler task. This allows us to identify the sources of hardness and extend our theoretical understanding to new domains. Furthermore we will show that seemingly innocents (and arguably justifiable) modifications to the oracle can lead to tractable reductions and even to bypass hardness results.

We demonstrate these ideas using the following results: i) An efficient algorithm for non-convex online learning using an optimization oracle. b) A faster boosting algorithm using a "simple" weak learner. iii) An efficient reduction from online to private learning.

Joint works with Naman Agarwal, Noga Alon, Elad Hazan, and Shay Moran.

WednesdayNov 20, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nathan Srebro Title:Optimization's Hidden Gift to Learning: Implicit RegularizationAbstract:opens in new windowin html    pdfopens in new window

It is becoming increasingly clear that implicit regularization afforded by the optimization algorithms play a central role in machine
learning, and especially so when using large, deep, neural networks. We have a good understanding of the implicit regularization afforded by stochastic approximation algorithms, such as SGD, for convex problem, and we understand and can characterize the implicit bias of different algorithms, and can design algorithms with specific biases. But in this talk I will focus on implicit biases of local search algorithms for non-convex underdetermined problem, such as deep networks. In an effort to uncover the implicit biases of gradient-based optimization of neural networks, which holds the key to their empirical success, I will discuss recent
work on implicit regularization for matrix factorization, linear convolutional networks, and two-layer ReLU networks, as well as a general bottom-up understanding on implicit regularization in terms of optimization geometry.

TuesdayAug 13, 201914:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Stefano SoattoTitle:The Information in the Weights of a Deep Network, and its consequences for transfer learning and critical learning periodsAbstract:opens in new windowin html    pdfopens in new windowNOTE THE UNUSUAL TIME
The information in the weights of deep networks plays a key role in understanding their behavior. When used as a regularizer during training (either explicitly or implicitly) it induces generalization through the PAC-Bayes bound. Rather than being computed and minimized explicitly, it can be directly controlled by injecting noise in the weights, a process known as Information Dropout. It can also be shown that stochastic gradient descent (SGD), when taken to the continuous limit and interpreted in the Wasserstein metric, adds the information in the weights as inductive bias, even if not explicitly present in the loss function. The Emergence Bound shows that, provided that a trained network has sufficient capacity, minimizing the information in the weights, which is a function of the training set consisting of data seen in the past, guarantees minimality, invariance, and independence of the components of the representation of the test (future) data. The trace of the information in the weights during training shows that, rather than increasingly monotonically through the learning process, as one would expect, first increases rapidly in the first few epochs, and then decreases to an asymptote that is a fraction of its peak. This unusual behavior qualitatively follows the sensitivity to critical learning periods observed in biological systems, from cats to humans, as well as recently in deep artificial networks. Together with the Emergence Bound and the PAC-Bayes bound, this shows that forgetting is a necessary part of the learning process, and happens while precision increases monotonically. The information in the weights can also be used to defined a topology in the space of tasks, and an asymmetric distance that can be used to predict the cost and performance of fine-tuning a network trained for a different task, without actually performing the experiment. These phenomena collectively point to the importance of the dynamics of learning, and suggests that studying the transient behavior can yield insight beyond those that can be gleaned from the asymptotics. Depending on the context, we use Shannon, Fisher, or Kolmogorov information to prove the results described.
WednesdayJul 03, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Alon CohenTitle:Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ RegretAbstract:opens in new windowin html    pdfopens in new window

The linear-quadratic regulator is a simple and ubiquitous model in optimal control. Classical results in control theory pertain to asymptotic convergence and stability of such systems. Recently, it has received renewed interest from a learning-theoretic perspective with a focus on computational tractability and finite-time convergence guarantees. Among the most challenging problems in LQ control is that of adaptive control: regulating a system with parameters that are initially unknown and have to be learned while incurring the associated costs. In its modern incarnation, this problem is approached through the lens of online learning with partial feedback (e.g., multi-armed bandits) and regret minimization. Still, recent results derive algorithms that are either computationally intractable or suffer from high regret. In this talk, I will present the first computationally-efficient algorithm with $\widetilde O(\sqrt{T})$ regret for learning in linear quadratic control systems with unknown dynamics. By that, this resolves an open question of Abbasi-Yadkori and Szepesvari (2011) and Dean, Mania, Matni, Recht, and Tu (2018). The key to the efficiency of our algorithm is in a novel reformulation of the LQ control problem as a convex semi-definite program. This is joint work with Tomer Koren and Yishay Mansour presented at ICML 2019.

WednesdayJun 19, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Daniel SoudryTitle:Theoretical and Empirical Investigation of Several Common Practices in Deep LearningAbstract:opens in new windowin html    pdfopens in new window

We examine several empirical and theoretical results on the training of deep networks. For example, Why are common "over-fitting" indicators (e.g., very low training error, high validation loss) misleading? Why, sometimes, early-stopping time never arrives? Why can adaptive rate methods (e.g., adam) degrade generalization? Why commonly used loss functions exhibit better generalization than others? Why use weight decay before batch-norm? When can we use low numerical precision, and how low can we get? and discuss the practical implications of these results.

Bio == Since October 2017, Daniel soudry is an assistant professor (Taub Fellow) in the Department of Electrical Engineering at the Technion, working in the areas of machine learning and theoretical neuroscience. Before that, he did his post-doc (as a Gruss Lipper fellow) working with Prof. Liam Paninski in the Department of Statistics, the Center for Theoretical Neuroscience the Grossman Center for Statistics of the Mind at Columbia University. He did his Ph.D. in the Department of Electrical Engineering at the Technion, Israel Institute of technology, under the guidance of Prof. Ron Meir. He received his B.Sc. degree in Electrical Engineering and Physics from the Technion.

WednesdayMay 22, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Meir FederTitle:Universal Learning for Individual DataAbstract:opens in new windowin html    pdfopens in new window

Universal learning is considered from an information theoretic point of view following the universal prediction approach pursued in the 90's by F&Merhav. Interestingly, the extension to learning is not straight-forward. In previous works we considered on-line learning and supervised learning in a stochastic setting. Yet, the most challenging case is batch learning where prediction is done on a test sample once the entire training data is observed, in the individual setting where the features and labels, both training and test, are specific individual quantities. This work provides schemes that for any individual data compete with a "genie" (or a reference) that knows the true test label. It suggests design criteria and derive the corresponding universal learning schemes. The main proposed scheme is termed Predictive Normalized Maximum Likelihood (pNML). As demonstrated, pNML learning and its variations provide robust and "stable" learning solutions that outperform the current leading approach based on Empirical Risk Minimization (ERM). Furthermore, the pNML construction provides a pointwise indication for the learnability that measures the uncertainty in learning the specific test challenge with the given training examples; hence the learner knows when it does not know. The improved performance of the pNML, the induced learnability measure and its utilization are demonstrated in several learning problems including deep neural networks models. Joint work with Yaniv Fogel and Koby Bibas

WednesdayMay 15, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Gal ElidanTitle:Learning Rules-first ClassifiersAbstract:opens in new windowin html    pdfopens in new window

Complex classifiers may exhibit "embarrassing" failures even in "easy" cases where humans can provide a simple justified explanation. Avoiding such failures is obviously of key importance. In this work, we focus on one such setting, where a label is perfectly predictable if the input contains certain features, or rules, and otherwise it is predictable by a linear classifier. We define a hypothesis class that captures this notion and determine its sample complexity. We also give evidence that efficient algorithms cannot achieve this sample complexity. We then derive a simple and efficient algorithm and show that its sample complexity is close to optimal, among efficient algorithms. Experiments on synthetic and sentiment analysis data demonstrate the efficacy of the method, both in terms of accuracy and interpretability. At the end of the talk, I will also give a teaser to demonstrate the non-intuitive nature of the more general problem of embarrassment.

WednesdayApr 10, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yossi KeshetTitle:Deep Learning: Vulnerabilities, Defenses and BeyondAbstract:opens in new windowin html    pdfopens in new window

Deep learning has been amongst the most emerging fields in computer science and engineering. In recent years it has been shown that deep networks are vulnerable to attacks by adversarial examples. I will introduce a novel flexible approach named Houdini for generating adversarial examples for complex and structured tasks and demonstrate successful attacks on different applications such as speech recognition, pose estimation, semantic image segmentation, speaker verification, and malware detection. Then I will discuss how this weakness can be turned into two secure applications. The first is a new technique for watermarking deep network models in a black-box way. That is, concealing information within the model that can be used by the owner of the model to claim ownership. The second application is a novel method to Speech Steganography, namely hiding a secret spoken message within an ordinary public spoken message. I will conclude the talk by a brief discussion of our attempts to detect such adversarial attacks, based on multiple semantic label representations.

WednesdayMar 27, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Roi Livni Title:Passing Tests Without Memorizing Abstract:opens in new windowin html    pdfopens in new window

Generative Adversarial Networks (GANs) is a recent algorithmic framework that has won considerable attention.  In a nutshell, GANs receive as input an IID sample and outputs synthetic data that should resemble data from the true underlying distribution. For example, consider an algorithm that receives as input some tunes from a specific music genre (e.g. jazz, rock, pop) and then outputs a new, original, tune from that genre.  

From a theoretical perspective, the distinction between algorithms that genuinely generate original new examples vs. algorithms that perform naive manipulations (or even merely memorization) of the input sample is an elusive distinction. This makes the theoretical analysis of GANs algorithms challenging. 

In this work we introduce two mathematical frameworks for the task of generating synthetic data. The first model we consider is inspired by GANs, and the learning algorithm has only an indirect access to the target distribution via a discriminator. The second model, called DP-Foolability, exploits the notion of differential privacy as a criterion for "non-memorization".

We characterize learnability in each of these models as well as discuss the interrelations. As an application we prove that privately PAC learnable classes are DP-foolable.  As we will discuss, this can be seen as an analogue of the equivalence between uniform convergence and learnability in classical PAC learning.

Joint work with Olivier Bousquet and Shay Moran.
https://arxiv.org/pdf/1902.03468.pdf

WednesdayJan 30, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ronen TalmonTitle: Data Analysis with the Riemannian Geometry of Symmetric Positive-Definite MatricesAbstract:opens in new windowin html    pdfopens in new window

Recently, the Riemannian geometry of the space of symmetric positive-definite matrices has become a central ingredient in a broad range of data analysis and learning tools. Broadly, it enables to observe complex high-dimensional data through the lens of objects with a known non-Euclidean geometry. In this talk, I will present a method for domain adaptation using parallel transport on the cone manifold of symmetric positive-definite matrices. Using Riemannian geometry, I will show the theoretical guarantees and discuss the benefits of the presented method. In addition, I will demonstrate these benefits on simulations as well as on real-measured data. While the majority of the talk will be focused on the particular symmetric positive-definite covariance matrices, I will present generalizations to kernels, diffusion operators, and graph-Laplacians and show intriguing properties using spectral analysis with application to source separation and multimodal data fusion.
* Joint work with Or Yair, Ori Katz, and Miri Ben-Chen

WednesdayJan 23, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Amir WeissTitle:The Gaussian Maximum Likelihood Approach for Independent Component and Vector AnalysisAbstract:opens in new windowin html    pdfopens in new window

The Blind Source Separation (BSS) problem consists of retrieving signals of interest, termed the sources, from a dataset consisting of their mixtures. One of the most popular and common paradigms for solving the BSS problem is Independent Component Analysis (ICA), where the sources are assumed to be (only) mutually statistically independent random processes, and the mixtures are assumed to be linear combinations thereof, where the linear mixing operator is unknown. In this talk, we shall start with the Gaussian Maximum Likelihood (GML) approach for the semi-blind problem, in which the sources are assumed to be temporally-diverse Gaussian processes. Based on the principles of this approach, we shall then discuss two extensions. First, the noisy Gaussian ICA problem, for which two asymptotically optimal solutions, w.r.t. two different optimality criteria, will be presented. We shall see (both analytically and empirically) that these solutions possess attractive properties even for non-Gaussian mixtures. Then, we shall consider the Independent Vector Analysis (IVA) framework, which has emerged in recent years as an extension of ICA into multiple datasets of mixtures. In IVA, the sources in each set are independent, but may depend on sources in the other sets. We will show that in IVA, the GML approach leads to consistent separation regardless of the sources' distributions.

WednesdayJan 16, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nir BillfeldTitle: Semiparametric Wavelet-based JPEG IV Estimator for Endogenously Truncated DataAbstract:opens in new windowin html    pdfopens in new window

We show that when data are endogenously truncated the widely-used IV fails to render the relationship causal as well as introduces bias into the exogenous covariates. We offer a newly-introduced semiparametric biorthogonal wavelet-based JPEG IV estimator and its associated symmetry preserving kernel, which is closely related to object recognition methods in Artificial Intelligence. The newly-introduced enriched JPEG algorithm is a denoising tool amenable for identifying redundancies in a sequence of irregular noisy data points which also accommodates a reference-free criterion function for optimal denoising. This is suitable for situations where the original data distribution is unobservable such as in the case of endogenous truncation. This estimator corrects both biases, the one generated by endogenous truncation and the one generated by endogenous covariates, by means of denoising. We introduce a multifocal variant of the local GMM (MFGMM) estimator to establish jointly the entire parameter set asymptotic properties. Using Monte Carlo simulations attest to very high accuracy of our offered semiparametric JPEG IV estimator as well as high efficiency as reflected by root n consistency. These results have emerged from utilizing 2,000,000 different distribution functions, generating 100 million realizations to construct the various data sets.

WednesdayJan 09, 201911:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Jonathan BelinkovTitle:Deep Learning Models for Language: What they learn, where they fail, and how to make them more robustAbstract:opens in new windowin html    pdfopens in new window

Deep learning has become pervasive in everyday life, powering language applications like Apple's Siri, Amazon's Alexa, and Google Translate. The inherent limitation of these deep learning systems, however, is that they often function as a "black box," preventing researchers and users from discerning the roles of different components and what they learn during the training process. In this talk, I will describe my research on interpreting deep learning models for language along three lines. First, I will present a methodological framework for investigating how these models capture various language properties. The experimental evaluation will reveal a learned hierarchy of internal representations in deep models for machine translation and speech recognition. Second, I will demonstrate that despite their success, deep models of language fail to deal even with simple kinds of noise, of the type that humans are naturally robust to. I will then propose simple methods for improving their robustness to noise. Finally, I will turn to an intriguing problem in language understanding, where dataset biases enable trivial solutions to complex language tasks. I will show how to design models that are more robust to such biases, and learn less biased latent representations.

WednesdayDec 26, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nadav CohenTitle:On Optimization and Expressiveness in Deep LearningAbstract:opens in new windowin html    pdfopens in new window

Understanding deep learning calls for addressing three fundamental questions: expressiveness, optimization and generalization. Expressiveness refers to the ability of compactly sized deep neural networks to represent functions capable of solving real-world problems. Optimization concerns the effectiveness of simple gradient-based algorithms in solving non-convex neural network training programs. Generalization treats the phenomenon of deep learning models not overfitting despite having much more parameters than examples to learn from. This talk will describe a series of works aimed at unraveling some of the mysteries behind optimization and expressiveness. I will begin by discussing recent analyses of optimization for deep linear neural networks. By studying the trajectories of gradient descent, we will derive the most general guarantee to date for efficient convergence to global minimum of a gradient-based algorithm training a deep network. Moreover, in stark contrast to conventional wisdom, we will see that, sometimes, gradient descent can train a deep linear network faster than a classic linear model. In other words, depth can accelerate optimization, even without any gain in expressiveness, and despite introducing non-convexity to a formerly convex problem. In the second (shorter) part of the talk, I will present an equivalence between convolutional and recurrent networks --- the most successful deep learning architectures to date --- and hierarchical tensor decompositions. The equivalence brings forth answers to various questions concerning expressiveness, resulting in new theoretically-backed tools for deep network design.
Optimization works covered in this talk were in collaboration with Sanjeev Arora, Elad Hazan, Noah Golowich and Wei Hu. Expressiveness works were with Amnon Shashua, Or Sharir, Yoav Levine, Ronen Tamari and David Yakira.

WednesdayDec 12, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Roy SchwartzTitle:Towards Interpretable Deep Learning for Natural Language ProcessingAbstract:opens in new windowin html    pdfopens in new window

Despite their superb empirical performance, deep learning models for natural language processing (NLP) are often considered black boxes, as relatively little is known as to what accounts for their success. This lack of understanding turns model development into a slow and expensive trial-and-error process, which limits many researchers from developing state-of-the-art models. Customers of deep learning also suffer from this lack of understanding, because they are using tools that they cannot interpret. In this talk I will show that many deep learning models are much more understandable than originally thought. I will present links between several deep learning models and classical NLP models: weighted finite-state automata. As the theory behind the latter is well studied, these findings allow for the development of more interpretable and better-performing NLP models. As a case study, I will focus on convolutional neural networks (ConvNets), one of the most widely used deep models in NLP. I will show that ConvNets are mathematically equivalent to a simple, linear chain weighted finite-state automaton. By uncovering this link, I will present an extension of ConvNets that is both more robust and more interpretable than the original model. I will then present similar observations regarding six recently introduced recurrent neural network (RNN) models, demonstrating the empirical benefits of these findings to the performance of NLP systems.

This is joint work with Hao Peng, Sam Thomson and Noah A. Smith

WednesdayNov 28, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Michael KimTitle:Fairness through Computationally-Bounded AwarenessAbstract:opens in new windowin html    pdfopens in new window

As algorithmic prediction systems have become more widespread, so too have concerns that these systems may be discriminatory against groups of people protected by laws and ethics. We present a recent line of work that takes a complexity theoretic perspective towards combating discrimination in prediction systems. We'll focus on fair classification within the versatile framework of Dwork et al. [ITCS'12], which assumes the existence of a metric that measures similarity between pairs of individuals. Unlike earlier work, we do not assume that the entire metric is known to the learning algorithm; instead, the learner can query this metric a bounded number of times. We propose a new notion of fairness called *metric multifairness* and show how to achieve this notion in our setting. Metric multifairness is parameterized by a similarity metric d on pairs of individuals to classify and a rich collection C of (possibly overlapping) "comparison sets" over pairs of individuals. At a high level, metric multifairness guarantees that *similar subpopulations are treated similarly*, as long as these subpopulations are identified within the class C.

WednesdayNov 14, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Kfir LevyTitle:Beyond SGD: Data Adaptive Methods for Machine Learning Abstract:opens in new windowin html    pdfopens in new window

The tremendous success of the Machine Learning paradigm heavily relies on the development of powerful optimization methods. The canonical algorithm for training learning models is SGD (Stochastic Gradient Descent), yet this method has its limitations. It is often unable to exploit useful statistical/geometric structure, it might degrade upon encountering prevalent non-convex phenomena, and it is hard to parallelize. In this talk I will discuss an ongoing line of research where we develop alternative methods that resolve some of SGD"s limitations. The methods that I describe are as efficient as SGD, and implicitly adapt to the underlying structure of the problem in a data dependent manner.

In the first part of the talk, I will discuss a method that is able to take advantage of hard/easy training samples. In the second part, I will discuss a method that enables an efficient parallelization of SGD. Finally, I will briefly describe a method that implicitly adapts to the smoothness and noise properties of the learning objective.

WednesdayNov 07, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tamir Bendory Title:Estimation in extreme noise levels with application to cryo-electron microscopyAbstract:opens in new windowin html    pdfopens in new window

Single-particle cryo-electron microscopy (cryo-EM) is an innovative technology for elucidating structures of biological molecules at atomic-scale resolution. In a cryo-EM experiment, tomographic projections of a molecule, taken at unknown viewing directions, are embedded in highly noisy images at unknown locations. The cryo-EM problem is to estimate the 3-D structure of a molecule from these noisy images.

Inspired by cryo-EM, the talk will focus on two estimation problems: multi-reference alignment and blind deconvolution. These problems abstract away much of the intricacies of cryo-EM, while retaining some of its essential features. In multi-reference alignment, we aim to estimate a signal from its noisy, rotated observations. While the rotations and the signal are unknown, the goal is only to estimate the signal. In the blind deconvolution problem, the goal is to estimate a signal from its convolution with an unknown, sparse signal in the presence of noise. Focusing on the low SNR regime, I will propose the method of moments as a computationally efficient estimation framework for both problems and will introduce its properties. In particular, I will show that the method of moments allows estimating the sought signal accurately in any noise level, provided sufficiently many observations are collected, with only one pass over the data. I will then argue that the same principles carry through to cryo-EM, show examples, and draw potential implications.

WednesdayOct 17, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Adam KapelnerTitle:Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial ExperimentsAbstract:opens in new windowin html    pdfopens in new window

There is a movement in design of experiments away from the classic randomization put forward by Fisher, Cochran and others to one based on optimization. In fixed-sample trials comparing two groups, measurements of subjects are known in advance and subjects can be divided optimally into two groups based on a criterion of homogeneity or "imbalance" between the two groups. These designs are far from random. This talk seeks to understand the benefits and the costs over classic randomization in the context of different performance criterions such as Efron's worst-case analysis. In the criterion that we motivate, randomization beats optimization. However, the optimal design is shown to lie between these two extremes. Much-needed further work will provide a procedure to find this optimal designs in different scenarios in practice. Until then, it is best to randomize.

WednesdayJul 04, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Alexander Volfovsky Title:Causal inference from complex observational data Abstract:opens in new windowin html    pdfopens in new window

A classical problem in causal inference is that of matching treatment units to control units in an observational dataset. This problem is distinct from simple estimation of treatment effects as it provides additional practical interpretability of the underlying causal mechanisms that is not available without matching. Some of the main challenges in developing matching methods arise from the tension among (i) inclusion of as many covariates as possible in defining the matched groups, (ii) having matched groups with enough treated and control units for a valid estimate of average treatment effect in each group, (iii) computing the matched pairs efficiently for large datasets, and (iv) dealing with complicating factors such as non-independence among units. We propose the Fast Large-scale Almost Matching Exactly (FLAME) framework to tackle these problems for categorical covariates. At its core this framework proposes an optimization objective for match quality that captures covariates that are integral for making causal statements while encouraging as many matches as possible. We demonstrate that this framework is able to construct good matched groups on relevant covariates and further extend the methodology to incorporate continuous and other complex covariates. related papers: https://arxiv.org/abs/1707.06315, https://arxiv.org/abs/1806.06802

WednesdayJun 27, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tamir Bendory Title:Estimation in low SNR environment and under group action with application to cryo-EMAbstract:opens in new windowin html    pdfopens in new window

Cryo-electron microscopy (cryo-EM) is an imaging technology that is revolutionizing structural biology, enabling reconstruction of molecules at near-atomic resolution.  
Cryo-EM produces a large number of noisy two-dimensional tomographic projection images of a molecule, taken at unknown viewing directions. 
The extreme levels of noise make classical tasks in statistics and signal processing, such as alignment, detection and clustering, very challenging. 
I will start the talk by studying the multi-reference alignment problem, which can be interpreted as a simplified model for cryo-EM. In multi-reference alignment, we aim to estimate multiple signals from circularly-translated, unlabeled, noisy copies. 
In high noise regimes, the measurements cannot be aligned or clustered. Nonetheless, accurate and efficient estimation can be achieved via group-invariant representations (invariant polynomials). Furthermore, such estimators achieve the optimal estimation rate. 
Then, I will show how this framework can be applied to the problem of 2-D classification in cryo-EM. In the last part of the talk, I will introduce the analog invariants of the cryo-EM problem and discuss how they can be used for ab initio modeling.
 

MondayJun 25, 201816:00
Machine Learning and Statistics SeminarRoom 155
Speaker:Ariel Jaffe / Phd defenceTitle:Spectral methods for unsupervised ensemble learning and latent variable modelsAbstract:opens in new windowin html    pdfopens in new windowNOTE SPECIAL DATE, TIME AND ROOM

The use of latent variables in probabilistic modeling is a standard approach in numerous data analysis applications. In recent years, there has been a surge of interest in spectral methods for latent variable models, where inference is done by analyzing the lower order moments of the observed data. In contrast to iterative approaches such as the EM algorithm, under appropriate conditions spectral methods are guaranteed to converge to the true model parameters given enough data samples.
The focus of the seminar is the development of novel spectral based methods for two problems in statistical machine learning. In the first part, we address unsupervised ensemble learning, where one obtains predictions from different sources or classifiers, yet without knowing the reliability and expertise of each source, and with no labeled data to directly assess it. We develop algorithms to estimate the reliability of the classifiers based on a common assumption that different classifiers make statistically independent errors. In addition, we show how one can detect subsets of classifiers that strongly violate the model of independent errors, in a fully unsupervised manner.
In the second part of the seminar we show how one can use spectral methods to learn the parameters of binary latent variable models. This model has many applications such as overlapping clustering and Gaussian-Bernoulli restricted Boltzmann machines. Our methods are based on computing the eigenvectors of both the second and third moments of the observed variables.
For both problems, we show that spectral based methods can be applied effectively, achieving results that are state of the art in various problems in computational biology and population genetics.

WednesdayMay 09, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Richard Olshen Title:V(D)J Diversity of Subsets of T and B Cells, Statistics and Probability Abstract:opens in new windowin html    pdfopens in new window

This talk will include an introduction to the topic of V(D)J rearrangements of particular subsets of T cells and B cells of the adaptive human immune system, in particular of IgG heavy chains. There are many statistical problems that arise in trying to understand these cells. They involve estimating aspects of functionals of discrete probabilities on (random) finite sets. Topics include but are not limited to exchangeability, estimating non-centrality parameters, and estimating covariance matrices from what are called "replicates" that have been amplified by the PCR process and (partially) sequenced.

I have received considerable assistance from Lu Tian, and also Yi Liu; as well, I have been helped considerably by Andrew Fire and Scott Boyd, and also Jorg Goronzy

WednesdayApr 11, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ethan Fetaya Title:Neural Relational Inference for Interacting Systems Abstract:opens in new windowin html    pdfopens in new window

Interacting systems are prevalent in nature, from dynamical systems in physics to complex societal dynamics. In this talk I will introduce our neural relational inference model: an unsupervised model that learns to infer interactions while simultaneously learning the dynamics purely from observational data. Our model takes the form of a variational auto-encoder, in which the latent code represents the underlying interaction graph and the reconstruction is based on graph neural networks.

WednesdayJan 10, 201811:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yonatan BelinkovTitle:Understanding Internal Representations in Deep Learning Models for Language and Speech ProcessingAbstract:opens in new windowin html    pdfopens in new window

Language technology has become pervasive in everyday life, powering applications like Apple's Siri or Google's Assistant. Neural networks are a key component in these systems thanks to their ability to model large amounts of data. Contrary to traditional systems, models based on deep neural networks (a.k.a. deep learning) can be trained in an end-to-end fashion on input-output pairs, such as a sentence in one language and its translation in another language, or a speech utterance and its transcription. The end-to-end training paradigm simplifies the engineering process while giving the model flexibility to optimize for the desired task. This, however, often comes at the expense of model interpretability: understanding the role of different parts of the deep neural network is difficult, and such models are often perceived as "black-box". In this work, we study deep learning models for two core language technology tasks: machine translation and speech recognition. We advocate an approach that attempts to decode the information encoded in such models while they are being trained. We perform a range of experiments comparing different modules, layers, and representations in the end-to-end models. Our analyses illuminate the inner workings of end-to-end machine translation and speech recognition systems, explain how they capture different language properties, and suggest potential directions for improving them. The methodology is also applicable to other tasks in the language domain and beyond.

WednesdayDec 13, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Roi Livni Title:Overcoming Intractability in LearningAbstract:opens in new windowin html    pdfopens in new window

Machine learning has recently been revolutionized by the introduction of Deep Neural Networks. However, from a theoretical viewpoint these methods are still poorly understood. Indeed the key challenge in Machine Learning today is to derive rigorous results for optimization and generalization in deep learning. In this talk I will present several tractable approaches to training neural networks. At the second part I will discuss a new sequential algorithm for decision making that can take into account the structure in the action space and is more tuned with realistic decision making scenarios.

I will present our work that provides some of the first positive results and yield new, provably efficient, and practical algorithms for training certain types of neural networks. In a second work I will present a new online algorithm that learns by sequentially sampling random networks and asymptotically converges, in performance, to the optimal network. Our approach improves on previous random features based learning in terms of sample/computational complexity, and expressiveness. In a more recent work we take a different perspective on this problem. I will provide sufficient conditions that guarantee tractable learning, using the notion of refutation complexity. I will then discuss how this new idea can lead to new interesting generalization bounds that can potentially explain generalization in settings that are not always captured by classical theory.

In the setting of reinforcement learning I will present a recently developed new algorithm for decision making in a metrical action space. As an application, we consider a dynamic pricing problem in which a seller is faced with a stream of patient buyers. Each buyer buy at the lowest price in a certain time window. We use our algorithm to achieve an optimal regret, improving on previously known regret bound.

WednesdayNov 29, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Roy Lederman Title:Inverse Problems and Unsupervised Learning with applications to Cryo-Electron Microscopy (cryo-EM)Abstract:opens in new windowin html    pdfopens in new window

Cryo-EM is an imaging technology that is revolutionizing structural biology; the Nobel Prize in Chemistry 2017 was recently awarded to Jacques Dubochet, Joachim Frank and Richard Henderson "for developing cryo-electron microscopy for the high-resolution structure determination of biomolecules in solution".
Cryo-electron microscopes produce a large number of very noisy two-dimensional projection images of individual frozen molecules. Unlike related methods, such as computed tomography (CT), the viewing direction of each image is unknown. The unknown directions, together with extreme levels of noise and additional technical factors, make the determination of the structure of molecules challenging.
While other methods for structure determination, such as x-ray crystallography and nuclear magnetic resonance (NMR), measure ensembles of molecules together, cryo-EM produces measurements of individual molecules. Therefore, cryo-EM could potentially be used to study mixtures of different conformations of molecules. Indeed, current algorithms have been very successful at analyzing homogeneous samples, and can recover some distinct conformations mixed in solutions, but, the determination of multiple conformations, and in particular, continuums of similar conformations (continuous heterogeneity), remains one of the open problems in cryo-EM.
I will discuss a one-dimensional discrete model problem, Heterogeneous Multireference Alignment, which captures many of the group properties and other mathematical properties of the cryo-EM problem. I will then discuss different components which we are introducing in order to address the problem of continuous heterogeneity in cryo-EM: 1. "hyper-molecules", the first mathematical formulation of truly continuously heterogeneous molecules, 2. The optimal representation of objects that are highly concentrated in both the spatial domain and the frequency domain using high-dimensional prolate spheroidal functions, and 3. Bayesian algorithms for inverse problems with an unsupervised-learning component for recovering such hyper-molecules in cryo-EM.

WednesdayNov 15, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ilya SoloveychikTitle:Group Symmetric Robust Covariance EstimationAbstract:opens in new windowin html    pdfopens in new window

Covariance matrix estimation is essential in many areas of modern Statistics and Machine Learning including Graphical Models, Classification/Discriminant Analysis, Principal Component Analysis, and many others. Classical statistics suggests using Sample Covariance Matrix (SCM) which is a Maximum Likelihood Estimator (MLE) in the Gaussian populations. Real world data, however, usually exhibits heavy-tailed behavior and/or contains outliers, making the SCM non-efficient or even useless. This problem and many similar ones gave rise to the Robust Statistics field in early 60s, where the main goal was to develop estimators stable under reasonable deviations from the basic Gaussian assumptions. One of the most prominent robust covariance matrix estimators was introduced and thoroughly studied by D. Tyler in the mid-80s. This important representative of the family of M-estimators can be defined as an MLE of a certain population. The problem of robust covariance estimation becomes even more involved in the high-dimensional scenario, where the number of samples n is of the order of the dimension p, or even less. In such cases, prior knowledge, often referred to as structure, is utilized to decrease the number of degrees of freedom and make the estimation possible. Unlike the Gaussian setting, in Tyler's case even imposition of linear structure becomes challenging due to the non-convexity of the negative log-likelihood. Recently, Tyler's target function was shown to become convex under a certain change of metric (geodesic convexity), which stimulated further investigation of the estimator.

In this work, we focus on the so-called group symmetry structure, which essentially means that the true covariance matrix commutes with a group of unitary matrices. In engineering applications such structures appear due to the natural symmetries of the physical processes; examples include circulant, perHermitian, proper quaternion matrices, etc. Group symmetric constraints are linear, and thus convex in the regular Euclidean metric. We show that they are also convex in the geodesic metric. These properties allow us to develop symmetric versions of the SCM and Tyler's estimator and build a general framework for their performance analysis. The classical results claim that at least n = p and n = p+1 samples in general position are necessary to ensure the existence and uniqueness of the SCM and Tyler's estimator, respectively. We significantly improve the sample complexity requirements for both estimators under the symmetry structure and show that in some cases even 1 or 2 samples are enough to guarantee the existence and uniqueness regardless of the ambient dimension.

WednesdayJul 05, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ronen TalmonTitle:Common Manifold Learning with Alternating DiffusionAbstract:opens in new windowin html    pdfopens in new window

We consider the problem of hidden common manifold extraction from multiple data sets, which have observation-specific distortions and artifacts. A new manifold learning method is presented based on alternating products of diffusion operators and local kernels. We provide theoretical analysis showing that our method is able to build a variant of the Laplacian of the hidden common manifold, while suppressing the observation-specific artifacts. The generality of this method is demonstrated in data analysis applications, where different types of devices are used to measure the same activity. In particular, we present applications to problems in biomedicine, neuroscience, and audio analysis. 
This is joint work with Roy Lederman and Hau-tieng Wu.

WednesdayJun 21, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Raja Giryes Title:On the relationship between structure in the data and what deep learning can learnAbstract:opens in new windowin html    pdfopens in new window

The past five years have seen a dramatic increase in the performance of recognition systems due to the introduction of deep architectures for feature learning and classification. However, the mathematical reasons for this success remain elusive. In this talk we will briefly survey some existing theory of deep learning. In particular, we will focus on data structure based theory and discuss two recent developments. 
The first work studies the generalization error of deep neural network. We will show how the generalization error of deep networks can be bounded via their classification margin. We will also discuss the implications of our results for the regularization of the networks. For example, the popular weight decay regularization guarantees the margin preservation, but it leads to a loose bound to the classification margin. We show that a better regularization strategy can be obtained by directly controlling the properties of the network's Jacobian matrix. 
The second work focuses on solving minimization problems with neural networks. Relying on recent recovery techniques developed for settings in which the desired signal belongs to some low-dimensional set, we show that using a coarse estimate of this set leads to faster convergence of certain iterative algorithms with an error related to the accuracy of the set approximation. Our theory ties to recent advances in sparse recovery, compressed sensing and deep learning. In particular, it provides an explanation for the successful approximation of the ISTA (iterative shrinkage and thresholding algorithm) solution by neural networks with layers representing iterations. 

Joint work with Guillermo Sapiro, Miguel Rodrigues, Jure Sokolic, Alex Bronstein and Yonina Eldar. 

WednesdayMay 17, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Aryeh KontorovichTitle:Mixing Time Estimation in Reversible Markov Chains from a Single Sample PathAbstract:opens in new windowin html    pdfopens in new window

We propose a procedure (the first of its kind) for computing a fully data-dependent interval that traps the mixing time t_mix of a finite reversible ergodic Markov chain at a prescribed confidence level. The interval is computed from a single finite-length sample path from the Markov chain, and does not require the knowledge of any parameters of the chain. This stands in contrast to previous approaches, which either only provide point estimates, or require a reset mechanism, or additional prior knowledge.

The interval is constructed around the relaxation time t_relax, which is strongly related to the mixing time, and the width of the interval converges to zero roughly at a sqrt{n} rate, where n is the length of the sample path. Upper and lower bounds are given on the number of samples required to achieve constant-factor multiplicative accuracy. The lower bounds indicate that, unless further restrictions are placed on the chain, no procedure can achieve this accuracy level before seeing each state at least \Omega(t_relax) times on the average. Future directions of research are identified. Time permitting, we will mention some recent further developments by D. Levin and Y. Peres.

Joint work with Daniel Hsu and Csaba Szepesvari.

WednesdayApr 19, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Naftali TishbyTitle:A Deeper Understanding of Deep LearningAbstract:opens in new windowin html    pdfopens in new window

By analytical and numerical studies of Deep Neural Networks (using standard TensorFlow) in the "Information Plane" - the Mutual Information the network layers preserve on the input and the output variables - we obtain the following new insights.

  1. The training epochs, for each layer, are divided into two phases: (1) fitting the training data - increasing the mutual information on the labels; (2) compressing the representation - reducing the mutual information on the inputs. The layers are learnt hierarchically, from the bottom to the top layer, with some overlaps.
  2. Most (~80%) of the training time  - optimization with SGD -  is spent on compressing the representation (the second phase) - NOT on fitting the training data labels, even when the training has no regularization or terms that directly aim at such compression.  
  3. The convergence point, FOR EVERY HIDDEN LAYER, lies on or very close to the Information Bottleneck IB) theoretical bound. Thus, the mappings from the input to the hidden layer and from the hidden layer to the output obey the IB self-consistent equations for some value of the compression-prediction tradeoff.
  4. The main benefit of adding more hidden layers is in the optimization/training time, as the compression phase for each layer amounts to relaxation to a Maximum conditional Entropy state, subject to the proper constraints on the error/information on the labels. As such relaxation takes super-linear time in the compressed entropy, adding more hidden layers dramatically reduces the training time. There is also benefit in sample complexity to adding hidden layers, but this is a smaller effect.

I will explain these new observations and the benefits of exploring Deep Learning in the "Information Plane", and discuss some of the exciting theoretical and practical consequences of our analysis.

Joint work with Ravid Ziv and Noga Zaslavsky.

WednesdayFeb 08, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Haim AvronTitle:Large-scale and Non-approximate Kernel Methods Using Random FeaturesAbstract:opens in new windowin html    pdfopens in new window
Kernel methods constitute a mathematically elegant framework for general-purpose infinite-dimensional non-parametric statistical inference. By providing a principled framework to extend classical linear statistical techniques to non-parametric modeling, their applications span the entire spectrum of statistical learning. However, training procedures naturally derived via this framework scale poorly and with limited opportunities for parallelization. This poor scalability poses a significant barrier for the use of kernel methods in big data applications. As such, with the growth in data across a multitude of applications, scaling up kernel methods has acquired renewed and somewhat urgent significance. Random feature maps, such as random Fourier features, have recently emerged as a powerful technique for speeding up and scaling the training of kernel-based methods. However, random feature maps only provide crude approximations to the kernel function, so delivering state-of-the-art results requires huge amount of random features. Nevertheless, in some cases, even when the number of random features is driven to be as large as the training size, full recovery of the generalization performance of the exact kernel method is not attained. In the talk I will show how random feature maps can be used to efficiently perform non-approximate kernel ridge regression, and thus there is no need to compromise between quality and running time. The core idea is to use random feature maps to form preconditioners to be used in solving kernel ridge regression to high accuracy. I will describe theoretical conditions on when this yields an effective preconditioner, and empirically evaluate the method and show it is highly effective for datasets of up to one million training examples.
WednesdayFeb 01, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Ran Gilad-Bachrach Title:CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and AccuracyAbstract:opens in new windowin html    pdfopens in new window
Applying machine learning to a problem which involves medical, financial, or other types of sensitive data, not only requires accurate predictions but also careful attention to maintaining data privacy and security. Legal and ethical requirements may prevent the use of cloud-based machine learning solutions for such tasks. In this work, we will present a method to convert learned neural networks to CryptoNets, neural networks that can be applied to encrypted data. This allows a data owner to send their data in an encrypted form to a cloud service that hosts the network. The encryption ensures that the data remains confidential since the cloud does not have access to the keys needed to decrypt it. Nevertheless, we will show that the cloud service is capable of applying the neural network to the encrypted data to make encrypted predictions, and also return them in encrypted form. These encrypted predictions can be sent back to the owner of the secret key who can decrypt them. Therefore, the cloud service does not gain any information about the raw data nor about the prediction it made. We demonstrate CryptoNets on the MNIST optical character recognition tasks. CryptoNets achieve 99% accuracy and can make around 59000 predictions per hour on a single PC. Therefore, they allow high throughput, accurate, and private predictions. This is a joint work with Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, John Wernsing.
WednesdayJan 25, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Amir Globerson Title:Variational Conditional ProbabilitiesAbstract:opens in new windowin html    pdfopens in new window
Predicting the label Y of an object X is a core task in machine learning. From a probabilistic perspective, this involves reasoning about conditional probabilities p(y|x). However, it is hard to obtain reliable estimates for these probabilities. Here we show how to obtain lower and upper bounds on p(y|x) given statistical information, and show how it can be used within various learning setups. We also extend this formulation to the structured case, where y can be multivariate.
WednesdayJan 04, 201711:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Sivan SabatoTitle:Active Nearest-Neighbor Learning in Metric SpacesAbstract:opens in new windowin html    pdfopens in new window
We propose a pool-based non-parametric active learning algorithm for general metric spaces, which outputs a nearest-neighbor classifier. We give prediction error guarantees that depend on the noisy-margin properties of the input sample, and are competitive with those obtained by previously proposed passive learners. We prove that the label complexity of the new algorithm is significantly lower than that of any passive learner with similar error guarantees. Our algorithm is based on a generalized sample compression scheme and a new label-efficient active model-selection procedure. Based on joint work with Aryeh Kontorovich and Ruth Urner.
MondayDec 26, 201611:45
Machine Learning and Statistics SeminarRoom 141
Speaker:Elad Hazan Title:A Non-generative Framework and Convex Relaxations for Unsupervised LearningAbstract:opens in new windowin html    pdfopens in new windowNote unusual time and place
We will describe a novel theoretical framework for unsupervised learning which is not based on generative assumptions. It is comparative, and allows to avoid known computational hardness results and improper algorithms based on convex relaxations. We show how several families of unsupervised learning models, which were previously only analyzed under probabilistic assumptions and are otherwise provably intractable, can be efficiently learned in our framework by convex optimization. These includes dictionary learning and learning of algebraic manifolds. Joint work with Tengyu Ma. === Bio === Elad Hazan is a professor of computer science at Princeton university. His research focuses on the design and analysis of algorithms for basic problems in machine learning and optimization. Amongst his contributions are the co-development of the AdaGrad algorithm for training learning machines, and the first sublinear-time algorithms for convex optimization. He is the recipient of (twice) the IBM Goldberg best paper award in 2012 for contributions to sublinear time algorithms for machine learning, and in 2008 for decision making under uncertainty, a European Research Council grant, a Marie Curie fellowship and a Google Research Award (twice). He served on the steering committee of the Association for Computational Learning and has been program chair for COLT 2015.
WednesdayDec 14, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Alon CohenTitle:Online Learning with Feedback Graphs Without the GraphsAbstract:opens in new windowin html    pdfopens in new window

We study an online learning framework introduced by Mannor and Shamir (2011) in which the feedback is specified by a graph, in a setting where the graph may vary from round to round and is \emph{never fully revealed} to the learner. We show a large gap between the adversarial and the stochastic cases. In the adversarial case, we prove that even for dense feedback graphs, the learner cannot improve upon a trivial regret bound obtained by ignoring any additional feedback besides her own loss. In contrast, in the stochastic case we give an algorithm that achieves $\widetilde \Theta(\sqrt{\alpha T})$ regret over $T$ rounds, provided that the independence numbers of the hidden feedback graphs are at most $\alpha$. completely unlearnable. We also extend our results to a more general feedback model, in which the learner does not necessarily observe her own loss, and show that, even in simple cases, concealing the feedback graphs might render the problem unlearnable.

WednesdayNov 16, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Dan Garber Title:Faster Projection-free Machine Learning and OptimizationAbstract:opens in new windowin html    pdfopens in new window

Projected gradient descent (PGD), and its close variants, are often considered the methods of choice for solving a large variety of machine learning optimization problems, including empirical risk minimization, statistical learning, and online convex optimization. This is not surprising, since PGD is often optimal in a very appealing information-theoretic sense. However, for many problems PGD is infeasible both in theory and practice since each step requires to compute an orthogonal projection onto the feasible set. In many important cases, such as when the feasible set is a non-trivial polytope, or a convex surrogate for a low-rank structure, computing the projection is computationally inefficient in high-dimensional settings. An alternative is the conditional gradient method (CG), aka Frank-Wolfe algorithm, that replaces the expensive projection step with a linear optimization step over the feasible set. Indeed in many problems of interest, the linear optimization step admits much more efficient algorithms than the projection step, which is the reason to the substantial regained interest in this method in the past decade. On the downside, the convergence rates of the CG method often fall behind that of PGD and its variants. 
In this talk I will survey an ongoing effort to design CG variants that on one hand enjoy the cheap iteration complexity of the original method, and on the other hand converge provably faster, and are applicable to a wider variety of machine learning settings. In particular I will focus on the cases in which the feasible set is either a polytope or a convex surrogate for low-rank matrices. Results will be demonstrated on applications including: LASSO, video co-localization, optical character recognition, matrix completion, and multi-class classification.

WednesdayJun 08, 201611:15
Machine Learning and Statistics SeminarRoom 290C
Speaker:Daniel SoudryTitle:No bad local minima: Data independent training error guarantees for multilayer neural networksAbstract:opens in new windowin html    pdfopens in new window
We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.
WednesdayJun 01, 201611:15
Machine Learning and Statistics SeminarRoom 290C
Speaker:Shalom LappinTitle:Deep Learning and Semantic Interpretation of Natural LanguageAbstract:opens in new windowin html    pdfopens in new window
Classical approaches to formal and computational semantics assign values to the terminal elements of hierarchical syntactic structures and define combinatorial operations on the semantic representations of phrases to compute the values of sentences. While these approaches offer formally elegant models of interpretation, they have not produced wide coverage systems. They do not provide for semantic learning. They have also not succeeded in integrating lexical and compositional semantics in an interesting or computationally efficient way. Recent developments in image caption generation suggest an alternative approach, which can overcome these difficulties. This work formulates the problem of matching images with descriptions as a task in machine translation. Deep neural networks use an encoder to map regions of pixels in an image to vector representations of graphic features, and a decoder to align these features with the distributional vectors of lexical and phrasal items. This approach can be generalized to deep neural networks that identify correspondences between multi-modal data structures and sentences. To the extent that this research program is successful, it will satisfy the core objective of the classical formal semantic program. It will assign truth (fulfilment) conditions to the sentences of a language, where these conditions are specified in terms of multi-modal representations of situations (scenes) in the world. These correspondences are generated not by a recursive definition of a truth predicate in a formal semantic theory, but by an extended deep neural language model.
WednesdayMay 18, 201611:15
Machine Learning and Statistics Seminar
Speaker:Abraham WynerTitle:Explaining the Success of AdaBoost and Random Forests as Interpolating ClassifiersAbstract:opens in new windowin html    pdfopens in new windowroom 155

There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting's interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for essentially similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self-averaging, interpolating algorithm which fits training data without error but is nevertheless somewhat smooth. We show that AdaBoost has the same property. We conjecture that both AdaBoost and random forests  succeed because of this mechanism. We provide a number of examples and some theoretical justification to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees and without direct regularization or early stopping.

WednesdayMay 04, 201611:15
Machine Learning and Statistics Seminar
Speaker:Amit DanielyTitle:The Power of Initialization and a Dual View on ExpressivityAbstract:opens in new windowin html    pdfopens in new windowRoom 155
We develop a general duality between neural networks and compositional kernels. We show that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space. Hence, though the training objective is hard to optimize in the worst case, the initial weights form a good starting point for optimization. Our dual view also reveals a pragmatic and aesthetic perspective of neural networks and underscores their expressive power. Joint work with Roy Frostig and Yoram Singer
WednesdayApr 06, 201611:15
Machine Learning and Statistics Seminar
Speaker:Moshe KoppelTitle:Reconstructing Ancient Documents from Noisy ManuscriptsAbstract:opens in new windowin html    pdfopens in new windowRoom 155
Given multiple corrupted versions of the same text, as is common with ancient manuscripts, we wish to reconstruct the original text from which the extant corrupted versions were copied (possibly via latent intermediary versions). This is a challenge of cardinal importance in the humanities. We use a variant of EM to solve this problem and demonstrate the efficacy of the method on both synthetic and real-world data.
WednesdayMar 30, 201611:15
Machine Learning and Statistics SeminarRoom 261
Speaker:Matan GavishTitle:Optimal thresholding of singular values and eigenvaluesAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE ROOM CHANGE

It is common practice in multivariate and matrix-valued data analysis to reduce dimensionality by performing a Singular Value Decomposition or Principal Component Analysis, and keeping only $r$ singular values or principal components, the rest being presumably associated with noise. However, the literature does not propose a disciplined criterion to determine $r$; most practitioners still look for the ``elbow in the Scree Plot'', a 50-years-old heuristic performed by eye. I'll review a line of work which develops a systematic approach to eigenvalue and singular value thresholding. This approach assumes that the signal is low-rank and that the noise is rotationally invariant. Recent results derive optimal thresholds in the presence of quite general noise distributions.

Joint work with David Donoho, Iain Johnstone and Edgar Dobriban (Stanford).

WednesdayMar 16, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Amichai PainskyTitle:Generalized Independent Component Analysis Over Finite AlphabetsAbstract:opens in new windowin html    pdfopens in new window
Independent component analysis (ICA) is a statistical method for transforming an observable multidimensional random vector into components that are as statistically independent as possible from each other. Usually the ICA framework assumes a model according to which the observations are generated (such as a linear transformation with additive noise). ICA over finite fields is a special case of ICA in which both the observations and the independent components are over a finite alphabet. In this work we consider a generalization of this framework in which an observation vector is decomposed to its independent components (as much as possible) with no prior assumption on the way it was generated. This generalization is also known as Barlow's minimal redundancy representation problem [Barlow, '89] and is considered an open problem. We propose several theorems and show that this hard problem can be accurately solved with a branch and bound search tree algorithm, or tightly approximated with a series of linear problems. Moreover, we show that there exists a simple transformation (namely, order permutation) which provides a greedy yet very effective approximation of the optimal solution. We further show that while not every random vector can be efficiently decomposed into independent components, the vast majority of vectors do decompose very well (that is, with a small constant cost), as the dimension increases. The minimal redundancy representation (also known as factorial coding) has many applications, mainly in the fields of neural networks and deep learning. In this work we show that this formulation further applies to large alphabet source coding. Joint work with Prof. Saharon Rosset from the Statistics Department and Prof. Meir Feder from the EE department, Tel Aviv University
WednesdayMar 09, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Sebastien BubeckTitle:New Results at the Crossroads of Convexity, Learning and Information TheoryAbstract:opens in new windowin html    pdfopens in new window
I will present three new results: (i) the Cramer transform of the uniform measure on a convex body is a universal self-concordant barrier; (ii) projected gradient descent with Gaussian noise allows to sample from a log-concave measure in polynomial time; and (iii) Thompson sampling combined with a multi-scale exploration solves the Bayesian convex bandit problem. The unifying theme in these results is the interplay between concepts from convex geometry, learning and information theory. Joint work with Ronen Eldan, and for (ii) with Joseph Lehec.
WednesdayFeb 24, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Nicolo Cesa-BianchiTitle:Real-time bidding and regret minimizationAbstract:opens in new windowin html    pdfopens in new window
In real-time bidding (RTB), ad exchanges run second-price auctions in a few milliseconds, allowing publishers to sell ad spaces to advertisers on a per-impression basis. The fact that RTB allows the accurate tailoring of impressions to the features of each individual user, has fueled the demand for algorithmic platforms that serve the needs of either the seller or the buyer. In this talk, we focus on the problem, faced by the seller, of dynamically optimizing the reserve price in each auction with the goal of maximizing overall revenue. We cast this problem in a regret minimization setting, and describe computationally efficient algorithms achieving regret of order T^{1/2} under various assumptions both on the information available to the seller and on the mechanism generating bids. Joint work with Claudio Gentile (Varese) and Yishay Mansour (Tel-Aviv).
WednesdayJan 27, 201611:30
Machine Learning and Statistics SeminarRoom 1
Speaker:Assaf HallakTitle:Off-policy Evaluation for MDPs with Unknown StructureAbstract:opens in new windowin html    pdfopens in new windowNEW DATE
In this talk I will present my work from ICML 2015. First, I will give a general introduction to Reinforcement Learning setup and define the off-policy evaluation problem and its core difficulties. I will present the model based solution for off-policy evaluation, and explain when structure can be exploited to improve performance of such solution. This will lead us to the core of our algorithm - the much more general problem of structure learning. The paper suggests solving this problem greedily and give conditions as to when such a solution works. Finally, I will present a few empirical results demonstrating our result.
WednesdayJan 06, 201611:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Karen LivescuTitle:Segmental Sequence Models in the Neural AgeAbstract:opens in new windowin html    pdfopens in new windowJOINT Vision and Machine Learning seminar

Many sequence prediction tasks---such as automatic speech recognition and video analysis---benefit from long-range temporal features.  One way of utilizing long-range information is through segmental (semi-Markov) models such as segmental conditional random fields.  Such models have had some success, but have been constrained by the computational needs of considering all possible segmentations.  We have developed new segmental models with rich features based on neural segment embeddings, trained with discriminative large-margin criteria, that are efficient enough for first-pass decoding.  In our initial work with these models, we have found that they can outperform frame-based HMM/deep network baselines on two disparate tasks, phonetic recognition and sign language recognition from video.  I will present the models and their results on these tasks, as well as (time permitting) related recent work on neural segmental acoustic word embeddings.


This is joint work with Hao Tang, Weiran Wang, Herman Kamper, Taehwan Kim, and Kevin Gimpel

WednesdayDec 30, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Eran Treister Title:Efficient algorithms for large scale parameter estimationAbstract:opens in new windowin html    pdfopens in new windowJoint Mathematical Analysis and Applications & Machine Learning and Statistics seminar

Parameter estimation is performed by fitting data measurements to a model using Bayesian statistics, assuming additional prior information. The estimation requires a numerical solution of large scale optimization problem, whose objective traditionally includes data fidelity and regularization terms. In this talk I will present numerical solution methods for two such estimation problems.

In the first part of the talk I will concentrate on parameter estimation of physical models, obtained by solving optimization problems that are constrained by partial differential equations (PDEs). I will focus on my recent work on 3D Full Waveform Inversion, which arises in seismic exploration of oil and gas reservoirs, earth sub-surface mapping, ultrasound imaging and more. I will demonstrate how to computationally treat this inverse problem, and improve its solution by using travel time tomography in a joint inversion framework. This includes efficient algorithms for the solution of the Helmholtz and eikonal equations (the two associated PDEs), and a parallel software framework for applying these algorithms for the joint inversion using a Gauss Newton algorithm.

In the second part of the talk, I will consider the estimation of large scale sparse inverse covariance matrices of multivariate Gaussian distribution. Such matrices are often used to characterize and analyze data measurements in fields that range from machine learning, signal processing and computational biology. To estimate these matrices, an l1 regularized log-determinant optimization problem needs to be solved. I will present a block-coordinate descent algorithm that can efficiently solve this problem at large scales with low memory footprint, and a multilevel acceleration framework that is suitable for general sparse optimization problems. These algorithms can be used as a tool for enriching inverse problems by "learning" appropriate prior information, adopting an empirical Bayesian framework.

WednesdayDec 23, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tirza RouttenbergTitle:Estimation after parameter selection: Estimation methods, performance analysis, and adaptive samplingAbstract:opens in new windowin html    pdfopens in new window

In many practical parameter estimation problems, such as medical experiments and cognitive radio communications, parameter selection is performed prior to estimation. The selection process has a major impact on subsequent estimation by introducing a selection bias and creating coupling between decoupled parameters. As a result,   classical estimation theory may be inappropriate and inaccurate and a new methodology is needed. In this study, the problem of estimating a preselected unknown deterministic parameter, chosen from a parameter set based on a predetermined data-based selection rule, \Psi, is considered.  In this talk, I present a general non-Bayesian estimation theory for estimation after parameter selection, includes estimation methods, performance analysis, and adaptive sampling strategies.  First, I use the post-selection mean-square-error (PSMSE) criterion   as a performance measure instead of the commonly used mean-square-error (MSE).  The corresponding Cramér-Rao-type bound on the PSMSE of any \Psi-unbiased estimator is derived, where the \Psi -unbiasedness is in the Lehmann-unbiasedness sense. The post-selection maximum-likelihood (PSML) estimator is presented and its \Psi–efficiency properties are demonstrated. Practical implementations of the PSML estimator are proposed as well. Finally, I discuss the concept of adaptive sampling in a two-sampling stages scheme of selection and estimation.

WednesdayDec 16, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tomer KorenTitle:The Computational Power of Optimization in Online LearningAbstract:opens in new windowin html    pdfopens in new window

We consider the fundamental problem of prediction with expert advice where the experts are "optimizable": there is a black-box optimization oracle that can be used to compute, in constant time, the leading expert in retrospect at any point in time. In this setting, we give a novel online algorithm that attains vanishing regret with respect to $N$ experts in total $\sqrt{N}$ computation time. We also give a lower bound showing that this running time cannot be improved (up to log factors) in the oracle model, thereby exhibiting a quadratic speedup as compared to the standard, oracle-free setting where the required time for vanishing regret is linear in $N$. These results demonstrate an exponential gap between the power of optimization in online learning and its power in statistical learning: in the latter, an optimization oracle---i.e., an efficient empirical risk minimizer---allows to learn a finite hypothesis class of size $N$ in time $\log{N}$.

We also study the implications of our results to learning in repeated zero-sum games, in a setting where the players have access to oracles that compute, in constant time, their best-response to any mixed strategy of their opponent. We show that the runtime required for approximating the minimax value of the game in this setting is $\sqrt{N}$, yielding again a quadratic improvement upon the oracle-free setting, where linear time in $N$ is known to be tight.

WednesdayDec 02, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Yuval Benjamini Title:Estimating bumps: selective inference for regions in non-stationary spatial dataAbstract:opens in new windowin html    pdfopens in new window

"Circular inference" is a pejorative coined for methods in which a hypothesis is selected after looking at the data, but the inferential procedures treat it as if it was determined in advance. Unfortunately, many throughput screening experiments in genomics or neuroimaging seek to do exactly this: identify regions (bumps) of high signal in the data and evaluate these found regions using the same data. Simple estimators that ignore the selection will be biased; when the data is non-stationary, this bias can vary dramatically between different regions. Nevertheless, methods for evaluating and comparing selected regions are crucial, because typically only a handful of regions can be further explored in tailored follow up studies. 

In this talk I describe a new conditional inference approach for characterizing these found regions by estimating their population parameters. Our method explicitly models the selection procedure, and simulates from the conditional distribution to estimate the underlying parameters. Efficient strategies for providing p-value, estimators and intervals will be discussed, as well as power versus accuracy tradeoffs. I will demonstrate the new method for estimating bumps in a comparison of DNA-methylation patterns across tissue type.

This is joint work with Jonathan Taylor and Rafael Irizarry.

MondayJun 15, 201514:30
Machine Learning and Statistics SeminarRoom 261
Speaker:Alex GoldenshlugerTitle:Nonparametric estimation of service time distribution in the $M/G/\infty$ queue and related estimation problemsAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE UNUSUAL DAY/TIME/PLACE

The subject of this talk is the problem of estimating service time distribution of the $M/G/\infty$ queue from incomplete data on the queue. The goal is to estimate $G$ from   observations of the queue--length process at the points of  the regular grid on a fixed time interval. We propose an estimator and analyze its accuracy  over a family of target service time distributions. The original $M/G/\infty$ problem is closely related to the problem of estimating derivatives of the covariance function of a  stationary Gaussian process. We consider the latter problem and derive lower bounds on the minimax risk. The obtained results strongly suggest that the proposed estimator of the service time distribution is rate optimal.

WednesdayMay 27, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Niv BuchbinderTitle:A Primal-Dual Approach to Online Optimization ProblemsAbstract:opens in new windowin html    pdfopens in new window
The primal-dual method is one of the fundamental design methodologies in the areas of linear programming, combinatorial optimization, and approximation algorithms. In this talk I will provide an introduction to the area of online combinatorial optimization. I will then discuss the primal-dual method as a unified framework to the design and analysis of online algorithms. Examples for which this approach is applicable include central online problems such as paging, routing, scheduling, online set cover, and graph connectivity problems. Finally, I will show some connections between online combinatorial optimization and online learning.
WednesdayMay 13, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Tamir HazanTitle:Machine Learning: Is it Artificial Intelligence, Statistics or Algorithms? Abstract:opens in new windowin html    pdfopens in new window
Machine learning originates from different fields: Artificial Intelligence (e.g, the Perceptron), Statistics (e.g., the Vapnik-Chervonenkis dimension) and Algorithms (e.g., Valiant's theory of the learnable). In this talk I will show how machine learning relates and differs from these fields while focusing on learning and inference of high-dimensional structures. Such structures arise in various AI applications, for example objects computer vision, parses in natural language processing and molecular structures in computational biology. I will present perturb-max models, which are high-dimensional probability models that measure the stability of prediction to random shifts of data measurements. These models replace the standard Gibbs distributions and allow efficient sampling, as well as generalization and regret bounds.
WednesdayApr 29, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Jonathan RosenblattTitle:On the Optimality of Averaging in Distributed Statistical LearningAbstract:opens in new windowin html    pdfopens in new window

A common approach to statistical learning on big data is to randomly split it among m machines and calculate the parameter of interest by averaging their m individual estimates.

Focusing on empirical risk minimization, or equivalently M-estimation, we study the statistical error  incurred by this strategy.
We consider two asymptotic settings: one where the number of samples per machine n->inf but the number of  parameters p is fixed, and a second high-dimensional regime where both p,n-> inf with p/n-> kappa.

Most previous works provided only moment bounds on the error incurred by splitting the data in the fixed p setting. In contrast, we present for both regimes asymptotically exact distributions for this estimation error. In the fixed-p setting, under suitable assumptions, we thus prove that to leading order, averaging is as accurate as the centralized solution. In the high-dimensional setting, we show a qualitatively different behavior: data splitting does incur a first order accuracy loss, which we quantify precisely. In addition, our asymptotic distributions allow the construction of confidence intervals and hypothesis testing on the estimated parameters. 

Our main conclusion is that in both regimes, averaging parallelized estimates is an attractive way to speedup computations and save on memory, while incurring a quantifiable and typically moderate excess error.

WednesdayMar 25, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Itai DattnerTitle:Statistical Inference for Systems of Differential EquationsAbstract:opens in new windowin html    pdfopens in new window
Many processes in biology, chemistry, physics, medicine, and engineering are modeled by systems of differential equations. These systems describe the interrelationships between the variables involved, and depend in a complicated way on unknown quantities (e.g., initial values, constants or time dependent parameters). Most often, the researcher would like to execute important tasks such as testing the validity of a model, analyzing its qualitative behavior or predicting future states of the system. In order to execute these tasks, one usually needs to estimate the unknown quantities of the system from real data. However, in the case of differential equations, the inverse problem of parameter estimation is considered as the bottleneck in modeling dynamical systems and estimating parameters based on observed noisy state variables has a relatively sparse statistical literature. In this talk we focus on the fairly general and often applied class of systems of ordinary differential equations linear in (functions of) the parameters. For such systems we first characterize a necessary and sufficient condition for identifiability of parameters. Then we present a novel estimation approach and support it by a general statistical theory that enables the development of estimators tailored for a variety of experimental scenarios. In particular, we present estimators corresponding to some common experimental setups and discuss their statistical properties. Simulation studies and application of the method to real data will be discussed.
WednesdayJan 21, 201511:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Omri AbendTitle:Machine-Learning the Universal Semantics of Natural LanguagesAbstract:opens in new windowin html    pdfopens in new window
The field of Natural Language Processing (NLP) has recently been pivotal in producing important language technologies such as machine translation and question answering. Such technologies are based on elaborate structural representations of text, detected by statistical methods. However, common approaches to structural representation are language-specific or even domain-specific, limiting the applicability of NLP tools and models. How to represent both the idiosyncrasies of specific domains and languages as well as their commonalities is still an open question. In my talk I will address these questions and propose an approach for learning a level of representation shared by all languages using latent variable models for structured prediction. Under this approach, learning starts from universally-applicable coarse-grained logical structures, which is used to bootstrap the learning of more fine-grained semantic distinctions, as well as the learning of the specifics of individual languages. I will discuss the value of universal semantic structures both to the computational modeling of child language acquisition and to leading NLP applications, focusing on machine translation. Joint work with Ari Rappoport, Shay Cohen and Mark Steedman.
WednesdayDec 24, 201411:15
Machine Learning and Statistics SeminarRoom 1
Speaker:Jonathan BerantTitle:Scalable algorithms for translating natural language to logical formAbstract:opens in new windowin html    pdfopens in new window
Conversational interfaces and virtual assistants such as Apple's Siri, Google Now, and Facebook Graph Search, have led to a rising interest in systems that can translate natural language commands and questions to formal logical forms (like SQL queries) that can be executed against a knowledge base. A major challenge has been to scale these systems, known as semantic parsers, to large knowledge bases. In this talk, I will describe novel algorithms for large scale semantic parsing. A fundamental characteristic of semantic parsing against large knowledge bases is that the space of possible logical forms grows quickly with the length of the input sentence. Our first algorithm learns to efficiently search through this space by explicitly scoring partial logical forms, combining ideas from agenda-based parsing and reinforcement learning. Compared to previous methods, our parser is almost an order of magnitude faster, while maintaining state-of-the-art accuracy. The second algorithm addresses the problem of language variability, that is, the fact that the same logical form can be expressed in a myriad of ways in natural language. We learn to paraphrase an input question ("Where is Obama from?") to a canonical form ("What is the place of birth of Barack Obama?") that can be easily mapped to a logical form. This allows us to exploit the large amounts of free text that are available on the web, leading to a state-of-the-art semantic parser that scales to a knowledge base containing hundreds of millions of facts. This is joint work with Percy Liang.