Select Seminar Series
All seminars- Home
-   ›  
- Studying at the Faculty
-   ›  
- Seminars   ›  
- Vision and AI
Vision and AI
As more people work from home or during travel, new opportunities and challenges arise around mobile office work. On the one hand, people may work at flexible hours, independent of traffic limitations, but on the other hand, they may need to work in makeshift spaces with less-than-optimal working conditions; applications are not flexible to their physical and social context and remote collaboration do not account for the difference between the user's conditions and capabilities.
Using better understanding of the physical and social constraints of the user's environments, and the ability to augment users' senses to disconnect them from such constraints. My research looked at designing applications that can be flexible to fit a changing environment, and personalize to each user, while maintaining usability and familiarity.
Bio:
Eyal Ofek received his Ph.D. in computer vision from the Hebrew University of Jerusalem in 2000. He founded two startup companies, in areas of computer graphics and computer vision, and in 1996 he joined the founding group of 3DV Systems, developing the world's first time-of-flight (TOF) depth camera. The technology that he worked on became the basis for the depth sensors, later included in Augmented Reality headsets such as Microsoft HoloLens and Magic Leap HMDs.
In 2004, Eyal joined Microsoft Research and for 6 years was the head of Bing Maps & Mobile research. His group developed technologies and services such as the world's first street-side service, and the popular Stroke Width Transform text detection, later included in OpenCV. In 2011 Eyal formed a new research group at Microsoft Research, centered on augmented reality. The group has developed experimental systems for an environment-aware layout of augmented reality experiences, used by several HoloLens teams and was a base for the Unity Mars product.
Since 2014, Eyal has focused on Human-Computer Interaction (HCI) and Mixed Reality (MR) and Haptics.
We analyze three cases where Deep Neural Networks (DNNs) seem to work "sub-optimally". We find if these issues can-or should-be fixed.
Convnets were originally designed to be shift-invariant, but this does not hold because of aliasing. We show how to completely fix this issue using polynomial activations and achieve state-of-the-art performance under adversarial shifts-even for fractional shifts.
DNNs are known to exhibit "catastrophic forgetting" when trained on sequential tasks. We show this can happen even in a linear setting for regression and classification. However, we derive universal bounds which guarantee no catastrophic forgetting in certain cases, such as when tasks are repeated or randomly ordered.
It was recently observed that DNNs, optimized with gradient descent, operate at the "Edge of Stability", in which monotone convergence is not guaranteed. However, we prove a different potential function ("gradient flow solution sharpness") is monotonically decreasing in scalar networks, observe this holds empirically in DNNs, and discuss the implications.
References:
[1] H. Michaeli, T. Michaeli, D. Soudry, "Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations", CVPR 2023. (https://arxiv.org/abs/2303.08085)
[2] I. Evron, E. Moroshko, G. Buzaglo, M. Khriesh, B. Marjieh, N. Srebro, D. Soudry, "Continual Learning in Linear Classification on Separable Data", ICML 2023. (https://arxiv.org/abs/2306.03534)
[3] I. Kreisler*, M. Shpigel Nacson*, D. Soudry, Yair Carmon, "Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond", ICML 2023. (https://arxiv.org/abs/2305.13064)
AI aims to build systems that interact with their environment, with people, and with other agents in the real world. This vision requires combining perception with reasoning and decision-making. It poses hard algorithmic challenges: from generalizing effectively from few or no samples to adapting to new domains to communicating in ways that are natural to people. I'll discuss our recent research thrusts for facing these challenges. These will include approaches to model the high-level structure of a visual scene; leveraging compositional structures in attribute space to learn from descriptions without any visual samples; and teaching agents new concepts without labels, by using elimination to reason about their environment.
Bio:
Gal Chechik is a Professor at Bar-Ilan University and a director of AI research at NVIDIA. His current research focuses on learning for reasoning and perception. In 2018, Gal joined NVIDIA to found and head nvidia's research in Israel. Prior to that, Gal was a staff research scientist at Google Brain and Google research developing large-scale algorithms for machine perception, used by millions daily. Gal earned his PhD in 2004 from the Hebrew University, and completed his postdoctoral training at Stanford CS department. In 2009, he started the learning systems lab at the Gonda brain center of Bar Ilan university, and was appointed an associate professor in 2013. Gal authored ~120 refereed publications, ~49 patents, including publications in Nature Biotechnology, Cell and PNAS. His work won awards at ICML and NeurIPS.
http://chechiklab.biu.ac.il
Image denoising - removal of white additive Gaussian noise from an image - is one of the oldest and most studied problems in image processing. An extensive work over several decades has led to thousands of papers on this subject, and to many well-performing algorithms for this task. As expected, the era of deep learning has brought yet another revolution to this subfield, and took the lead in today's ability for noise suppression in images. All this progress has led some researchers to believe that "denoising is dead", in the sense that all that can be achieved is already done.
Exciting as all this story might be, this talk IS NOT ABOUT IT!
Our story focuses on recently discovered abilities and vulnerabilities of image denoisers. In a nut-shell, we expose the possibility of using image denoisers for serving other problems, such as regularizing general inverse problems and serving as the engine for image synthesis. We also unveil the (strange?) idea that denoising (and other inverse problems) might not have a unique solution, as common algorithms would have you believe. Instead, we will describe constructive ways to produce randomized and diverse high perceptual quality results for inverse problems.
Deep generative models have proved to be successful for many image-to-image applications. Such models hallucinate information based on their large and diverse training datasets. Therefore, when enhancing or editing a portrait image, the model produces a generic and plausible output, but often it isn't the person who actually appears in the image.
In this talk, I'll present our latest work, MyStyle - which introduces the notion of a personalized generative model. Trained on ~100 images of the same individual, MyStyle learns a personalized prior, custom to their unique appearance. This prior is then leveraged to solve ill-posed image enhancement and editing tasks - such as super-resolution, inpainting and changing the head pose.
The field of natural language processing is dominated by transformer-based language models (LMs). One of the core building blocks of these models is the feed-forward network (FFN) layers, which typically account for >2/3 of the network parameters. Yet, how these layers are being utilized by the model to build predictions is largely unknown. In this talk, I will share recent findings on the operation of FFN layers in LMs, and demonstrate their utility in real-world applications. First, I will show that FFN layers can be cast as human-interpretable key-value memories, and describe how the output from each layer can be viewed as a collection of updates to the model's output distribution. Then, I will demonstrate the utility of these findings in the context of (a) controlled language generation, where we reduce the toxicity of GPT2 by almost 50%, and (b) improving computation efficiency, through a simple rule for early exit, saving 20% of computation on average.
Structure from Motion (SfM) deals with recovering camera parameters and 3D scene structure from collections of 2D images. SfM is commonly solved by minimizing the non-covex, bundle adjustment objective, which generally requires sophisticated initialization. In this talk I will present two approaches to SfM: the first approach involves averaging of essential or fundamental matrices (also called bifocal tensors). Since the bifocal tensors are computed independently from image pairs they are generally inconsistent with any set of n cameras. We provide a complete algebraic characterization of the manifold of bifocal tensors for n cameras and present an optimization framework to project measured bifocal tensors onto the manifold. Our second approach is an online approach: given n-1 images, I_1,...,I_{n-1}, whose camera matrices have already been recovered, we seek to recover the camera matrix associated with an image I_n . We present a novel solution to the six-point online algorithm to recover the exterior parameters associated with I_n. Our algorithm uses just six corresponding pairs of 2D points, extracted each from I_n and from any of the preceding n-1 images, allowing the recovery of the full six degrees of freedom of the n'th camera, and unlike common methods, does not require tracking feature points in three or more images. We present experiments that demonstrate the utility of both our approaches. If time permits, I will briefly present additional recent work for solving SfM using deep neural models.
Visual scene understanding has traditionally focused on identifying objects in images learning to predict their presence and spatial extent. However, understanding a visual scene often goes beyond recognizing individual objects. In my thesis (guided by Prof. Ullman), I mainly focused on developing task-dependent network, that uses processing instructions to guide the functionality of a shared network via an additional task input. In addition, I also studied strategies for incorporating relational information into recognition pipelines to efficiently extract structures of interest from the scene. In the scope of high level scene understanding, which might be dominated by recognizing a rather small number of objects and relations, the above task-dependent scheme naturally allows goal-directed scene interpretation by either a single step or by a sequential execution with a series of different TD instructions. It simplifies the use of referring relations, grounds requested visual concepts back into the image-plane and improves combinatorial generalization, essential for AI systems, by using structured representations and computations. In the scope of multi-task learning the above scheme offers an alternative to the popular multi-branched architecture, which simultaneously execute all tasks using task-specific branches on top of a shared backbone, challenges capacity limitations, increases task selectivity, allows scalability and further tasks extensions. Results will be shown in various applications: object detection, visual grounding, properties classification, human-object interactions and general scene interpretation.
Works included:
1. H. Levi and S. Ullman. Efficient coarse-to-fine non-local module for the detection of small objects. BMVC, 2019. https://arxiv.org/abs/1811.12152
2. H. Levi and S. Ullman. Multi-task learning by a top-down control network. ICIP 2021. https://arxiv.org/abs/2002.03335
3. S. Ullman et al. Image interpretation by iterative bottom-up top-down processing. http://arxiv.org/abs/2105.05592
4. A. Arbelle et al. Detector-Free Weakly Supervised Grounding by Separation. https://arxiv.org/abs/2104.09829. Submitted to ICCV.
5. Ongoing work – Human Object Interactions
Visual scene understanding has traditionally focused on identifying objects in images – learning to predict their presence and spatial extent. However, understanding a visual scene often goes beyond recognizing individual objects. In my thesis (guided by Prof. Ullman), I mainly focused on developing `task-dependent network’, that uses processing instructions to guide the functionality of a shared network via an additional task input. In addition, I also studied strategies for incorporating relational information into recognition pipelines to efficiently extract structures of interest from the scene. In the scope of high level scene understanding, which might be dominated by recognizing a rather small number of objects and relations, the above task-dependent scheme naturally allows goal-directed scene interpretation by either a single step or by a sequential execution with a series of different TD instructions. It simplifies the use of referring relations, grounds requested visual concepts back into the image-plane and improves combinatorial generalization, essential for AI systems, by using structured representations and computations. In the scope of ‘multi-task learning’ the above scheme offers an alternative to the popular ‘multi-branched architecture’, which simultaneously execute all tasks using task-specific branches on top of a shared backbone, challenges capacity limitations, increases task selectivity, allows scalability and further tasks extensions. Results will be shown in various applications: object detection, visual grounding, properties classification, human-object interactions and general scene interpretation.
Works included:
1. H. Levi and S. Ullman. Efficient coarse-to-fine non-local module for the detection of small objects. BMVC, 2019. https://arxiv.org/abs/1811.12152
2. H. Levi and S. Ullman. Multi-task learning by a top-down control network. ICIP 2021. https://arxiv.org/abs/2002.03335
3. S. Ullman et al. Image interpretation by iterative bottom-up top-down processing. http://arxiv.org/abs/2105.05592
4. A. Arbelle et al. Detector-Free Weakly Supervised Grounding by Separation. https://arxiv.org/abs/2104.09829. Submitted to ICCV.
5. Ongoing work – Human Object Interactions
There is a growing number of tasks that work directly on point clouds. As the size of the point cloud grows, so do the computational demands of these tasks. A possible solution is to sample the point cloud first. Classic sampling approaches, such as farthest point sampling (FPS), do not consider the downstream task. A recent work showed that learning a task-specific sampling can improve results significantly. However, the proposed technique did not deal with the non-differentiability of the sampling operation and offered a workaround instead. We introduce a novel differentiable relaxation for point cloud sampling that approximates sampled points as a mixture of points in the primary input cloud. Our approximation scheme leads to consistently good results on classification and geometry reconstruction applications. We also show that the proposed sampling method can be used as a front to a point cloud registration network. This is a challenging task since sampling must be consistent across two different point clouds for a shared downstream task. In all cases, our approach outperforms existing non-learned and learned sampling alternatives.
Based on the work of: Itai Lang, Oren Dovrat, and Asaf Manor
https://weizmann.zoom.us/j/91363737177?pwd=bEdNSU9tcVRCQXhjaDRPSXIyWCswUT09
In this talk i'll be covering several works in the topic of 3D deep learning on pointclouds for scene understanding tasks.
First, I'll describe VoteNet (ICCV 2019, best paper nomination): a method for object detection from 3D pointclouds input, inspired by the classical generalized Hough voting technique. I'll then explain how we integrated image information into the voting scheme to further boost 3D detection (ImVoteNet, CVPR 2020). In the second part of my talk I'll describe recent studies focusing on reducing supervision for 3D scene understanding tasks, including PointContrast -- a self-supervised representation learning framework for 3D pointclods (ECCV 2020). Our findings in PointContrast are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -- demonstrating that the learned representation can generalize across domains.
https://weizmann.zoom.us/j/93960308443?pwd=b1BWMkZURjZ0THMxUisrSFpMaHNXdz09
In localization microscopy, the positions of individual nanoscale point emitters (e.g. fluorescent molecules) are determined at high precision from their point-spread functions (PSFs). This enables highly precise single/multiple-particle-tracking, as well as super-resolution microscopy, namely single molecule localization microscopy (SMLM). To obtain 3D localization, we employ PSF engineering – namely, we physically modify the standard PSF of the microscope, to encode the depth position of the emitter. In this talk I will describe how this method enables unprecedented capabilities in localization microscopy; specific applications include dense emitter fitting for super-resolution microscopy, multicolor imaging from grayscale data, volumetric multi-particle tracking/imaging, dynamic surface profiling, and high-throughput in-flow colocalization in live cells. We often combine the optical encoding method with neural nets (deep-learning) for decoding, i.e. image reconstruction; however, our use of neural nets is not limited to image processing - we use nets to design the optimal optical acquisition system in a task-specific manner.
https://weizmann.zoom.us/j/95427957584?pwd=QWhYMURYU3UxaE43TzhEOFJyV2J2UT09
Recent approaches for unsupervised image translation are strongly reliant on generative adversarial training and ad-hoc architectural locality constraints. Despite their appealing results, it can be easily observed that the learned class and content representations are entangled which often hurts the translation performance. We analyse this task under the framework of image disentanglement into meaningful attributes. We first analyse the simpler setting, where the domain of the image and its other attributes are independent. By information arguments, we present a non-adversarial approach (LORD) that carefully designed an information bottleneck for class-content disentanglement. Our approach brings attention to several interesting and poorly explored phenomena, particularly the beneficial inductive biases of latent optimization and conditional generators, and it outperforms the top adversarial and non-adversarial class-content disentanglement methods (e,g, DrNet and MLVAE). By further information constraints, we extend our approach to the standard unsupervised image translation task where the unknown image properties are dependent on the domain. Our full approach surpasses the top unsupervised image translation methods (e.g. FUNIT and StartGAN-v2).
https://weizmann.zoom.us/j/93320818883?pwd=VE1STDVrS3BPMThHenlqZXhxbDg3dz09
People easily recognize new visual categories that are new combinations of known components. This compositional generalization capacity is critical for learning in real-world domains like vision and language because the long tail of new combinations dominates the distribution. Unfortunately, learning systems struggle with compositional generalization because they often build on features that are correlated with class labels even if they are not "essential" for the class. This leads to consistent misclassification of samples from a new distribution, like new combinations of known components.
In this talk I will present our work on compositional zero-shot recognition: I will describe a causal approach that asks "which intervention caused the image?"; A probabilistic approach using semantic soft “and-or” side-information; And last, an approach that is simultaneously effective for both many-shot and zero-shot classes.
https://weizmann.zoom.us/j/94138183037?pwd=amZIRXgzSElQaXozL0UxdlFCRmJuQT09
Since their introduction by Goodfellow et al. in 2014, generative adversarial models seem to have completely transformed Computer Vision and Graphics. In this talk I will address three questions: (1) What did we do before the GAN era (and was it really that different)? (2) Is the way we train GANs in line with the theory (and can we do it better)? (3) How is information about object transformations encoded in a pre-trained generator?
I will start by showing that Contrastive Divergence (CD) learning (Hinton ‘02), the most widely used method for learning distributions before GANs, is in fact also an adversarial procedure. This settles a long standing debate regarding the objective that this method actually optimizes, which arose due to an unjustified approximation in the original derivation. Our observation explains CD’s great empirical success.
Going back to GANs, I will challenge the common practice for stabilizing training using spectral-normalization. Although theoretically motivated by the Wasserstein GAN formulation, I will show that this heuristic works for different reasons and can be significantly improved upon. Our improved approach leads to state-of-the-art results in many common tasks, including super-resolution and image-to-image-translation.
Finally, I will address the task of revealing meaningful directions in the latent space of a pre-trained GAN. I will show that such directions can be computed in closed form directly from the generator's weights, without the need of any training or optimization as done in existing works. I will particularly discuss nonlinear trajectories that have natural endpoints and allow controlling whether one transformation is allowed to come on the expense of another (e.g. zoom-in with or without allowing translation to keep the object centered).
* These are joint works with Omer Yair, Idan Kligvasser, Nurit Spingarn, and Ron Banner.
https://weizmann.zoom.us/j/96344142954?pwd=RlhsVk10cWlMMm5WdGhtenRvZGQyQT09
We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid - a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the classification model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.
https://arxiv.org/abs/2003.06221
Zoom: https://weizmann.zoom.us/j/96905267652?pwd=MGxhem1WVG1WbzRQSTVnYXZGLy81dz09
The emergence of deep generative models has recently enabled the automatic generation of massive amounts of graphical content, both in 2D and in 3D.
Generative Adversarial Networks (GANs) and style control mechanisms, such as Adaptive Instance Normalization (AdaIN), have proved particularly effective in this context, culminating in the state-of-the-art StyleGAN architecture.
While such models are able to learn diverse distributions, provided a sufficiently large training set, they are not well-suited for scenarios where the distribution of the training data exhibits a multi-modal behavior. In such cases, reshaping a uniform or normal distribution over the latent space into a complex multi-modal distribution in the data domain is challenging, and the generator might fail to sample the target distribution well. Furthermore, existing unsupervised generative models are not able to control the mode of the generated samples independently of the other visual attributes, despite the fact that they are typically disentangled in the training data.
In this work, we introduce UMMGAN, a novel architecture designed to better model multi-modal distributions, in an unsupervised fashion. Building upon the StyleGAN architecture, our network learns multiple modes, in a completely unsupervised manner, and combines them using a set of learned weights. We demonstrate that this approach is capable of effectively approximating a complex distribution as a superposition of multiple simple ones.
We further show that UMMGAN effectively disentangles between modes and style, thereby providing an independent degree of control over the generated content.
This is a joint work with Prof. Dani Lischinski and Prof. Daniel Cohen-Or.
Zoom Link: https://weizmann.zoom.us/j/91197729259
Single image super resolution (SR) has seen major performance leaps in recent years. However, existing methods do not allow exploring the infinitely many plausible reconstructions that might have given rise to the observed low-resolution (LR) image. These different explanations to the LR image may dramatically vary in their textures and fine details, and may often encode completely different semantic information. In this work, we introduce the task of explorable super resolution. We propose a framework comprising a graphical user interface with a neural network backend, allowing editing the SR output so as to explore the abundance of plausible HR explanations to the LR input. At the heart of our method is a novel module that can wrap any existing SR network, analytically guaranteeing that its SR outputs would precisely match the LR input, when downsampled. Besides its importance in our setting, this module is guaranteed to decrease the reconstruction error of any SR network it wraps, and can be used to cope with blur kernels that are different from the one the network was trained for. We illustrate our approach in a variety of use cases, ranging from medical imaging and forensics, to graphics.
Zoom link: https://weizmann.zoom.us/j/96418208541
Robust recovery of lost colors in underwater images remains a challenging problem. We recently showed that this was partly due to the prevalent use of an atmospheric image formation model for underwater images and proposed a physically accurate model. The revised model showed: 1) the attenuation coefficient of the signal is not uniform across the scene but depends on object range and reflectance, 2) the coefficient governing the increase in backscatter with distance differs from the signal attenuation coefficient. Here, we present the first method that recovers color with our revised model, using RGBD images. The Sea-thru method estimates backscatter using the dark pixels and their known range information. Then, it uses an estimate of the spatially varying illuminant to obtain the range-dependent attenuation coefficient. Using more than 1,100 images from two optically different water bodies, which we make available, we show that our method with the revised model outperforms those using the atmospheric model. Consistent removal of water will open up large underwater datasets to powerful computer vision and machine learning algorithms, creating exciting opportunities for the future of underwater exploration and conservation. (Paper published in CVPR 19).
Zoom link: https://weizmann.zoom.us/j/99608045732
My goal is to demonstrate how geometric priors can replace the requirement for annotated 3D data. In this context I'll present two of my works. First, I will present a deep learning framework that estimates dense correspondence between articulated 3D shapes without using any ground truth labeling which I presented as an oral presentation at CVPR 2019: "Unsupervised Learning of Dense Shape Correspondence". We demonstrated that our method is applicable to full and partial 3D models as well as to realistic scans. The problem of incomplete 3D data can be encountered also in many other different scenarios. One such interesting problem of great practical importance is shape completion, to which I'll dedicate the second part of my lecture.
It is common to encounter situations where there is a considerable distinction between the scanned 3D model and the final rendered one. The distinction can be attributed to occlusions or directional view of the target, when the scanning device is localized in space. I will demonstrate that geometric priors can guide learning algorithms in the task of 3D model completion from partial observation. To this end, I'll present our recent work: "The Whole is Greater than the Sum of its Non-Rigid Parts". This famous declaration of Aristotle was adopted to explain human perception by the Gestalt psychology school of thought in the twentieth century. Here, we claim that observing part of an object which was previously acquired as a whole, one could deal with both partial matching and shape completion in a holistic manner. More specifically, given the geometry of a full, articulated object in a given pose, as well as a partial scan of the same object in a different pose, we address the problem of matching the part to the whole while simultaneously reconstructing the new pose from its partial observation. Our approach is data-driven, and takes the form of a Siamese autoencoder without the requirement of a consistent vertex labeling at inference time; as such, it can be used on unorganized point clouds as well as on triangle meshes. We demonstrate the practical effectiveness of our model in the applications of single-view deformable shape completion and dense shape correspondence, both on synthetic and real-world geometric data, where we outperform prior work on these tasks by a large margin.
Processing large point clouds is a challenging task. Therefore, the data is often sampled to a size that can be processed more easily. The question is how to sample the data? A popular sampling technique is Farthest Point Sampling (FPS). However, FPS is agnostic to a downstream application (classification, retrieval, etc.). The underlying assumption seems to be that minimizing the farthest point distance, as done by FPS, is a good proxy to other objective functions.
We show that it is better to learn how to sample. To do that, we propose a deep network to simplify 3D point clouds. The network, termed S-NET, takes a point cloud and produces a smaller point cloud that is optimized for a particular task. The simplified point cloud is not guaranteed to be a subset of the original point cloud. Therefore, we match it to a subset of the original points in a post-processing step. We contrast our approach with FPS by experimenting on two standard data sets and show significantly better results for a variety of applications. Our code is publicly available.
Dialog is an effective way to exchange information, but subtle details and nuances are extremely important. While significant progress has paved a path to address visual dialog with algorithms, details and nuances remain a challenge. Attention mechanisms have demonstrated compelling results to extract details in visual question answering and also provide a convincing framework for visual dialog due to their interpretability and effectiveness. However, the many data utilities that accompany visual dialog challenge existing attention techniques. We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities. To this end, we design a factor graph based attention mechanism which combines any number of utility representations. We illustrate the applicability of the proposed approach on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods by 1.1% for VisDial0.9 and by 2% for VisDial1.0 on MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6%.
One of the unresolved questions in deep learning is the nature of the solutions that are being discovered. We investigated the collection of solutions reached by the same neural network (NN) architecture, with different random initialization of weights and random mini-batches. These solutions are shown to be rather similar -- more often than not, each train and test example is either classified correctly by all NN instances, or by none at all. Furthermore, all NNs seem to share the same learning dynamics, whereby initially the same train and test examples are correctly recognized by the learned model, followed by other examples that are learned in roughly the same order. When extending the investigation to heterogeneous collections of NN architectures, once again examples are seen to be learned in the same order irrespective of architecture, although the more powerful architecture may continue to learn and thus achieve higher accuracy. Finally, I will discuss cases where this pattern of similarity breaks down, which show that the reported similarity is not an artifact of optimization by gradient descent.
Recent developments in Natural Language Processing (NLP) allow models to leverage large, unprecedented amounts of raw text, culminating in impressive performance gains in many of the field"s long-standing challenges, such as machine translation, question answering, or information retrieval.
In this talk, I will show that despite these advances, state-of-the-art NLP models often fail to capture crucial aspects of text understanding. Instead, they excel by finding spurious patterns in the data, which lead to biased and brittle performance. For example, machine translation models are prone to translate doctors as men and nurses as women, regardless of context. Following, I will discuss an approach that could help overcome these challenges by explicitly representing the underlying meaning of texts in formal data structures. Finally, I will present robust models that use such explicit representations to effectively identify meaningful patterns in real-world texts, even when training data is scarce.
I will present a new approach for image parsing, Pixel Consensus Voting (PCV). The core of PCV is a framework for instance segmentation based on the Generalized Hough transform. Pixels cast discretized, probabilistic votes for the likely regions that contain instance centroids. At the detected peaks that emerge in the voting heatmap, backprojection is applied to collect pixels and produce instance masks. Unlike a sliding window detector that densely enumerates object proposals, our method detects instances as a result of the consensus among pixel-wise votes. We implement vote aggregation and backprojection using native operators of a convolutional neural network. The discretization of centroid voting reduces the training of instance segmentation to pixel labeling, analogous and complementary to fully convolutional network-style semantic segmentation, leading to an efficient and unified architecture that jointly models things and stuff. We demonstrate the effectiveness of our pipeline on COCO and Cityscapes Panoptic Segmentation and obtain competitive results. This joint work with Haochen Wang (TTIC/CMU), Ruotian Luo (TTIC), Michael Maire (TTIC/Uchicago) received an Innovation Award at the COCO/Mapillary workshop at ICCV 2019.
In the talk I will present the Hue-Net - a novel Deep Learning framework for Intensity-based Image-to-Image Translation.
The key idea is a new technique we term network augmentation which allows a differentiable construction of intensity histograms from images.
We further introduce differentiable representations of (1D) cyclic and joint (2D) histograms and use them for defining loss functions based on cyclic Earth Mover's Distance (EMD) and Mutual Information (MI). While the Hue-Net can be applied to several image-to-image translation tasks, we choose to demonstrate its strength on color transfer problems, where the aim is to paint a source image with the colors of a different target image. Note that the desired output image does not exist and therefore cannot be used for supervised pixel-to-pixel learning.
This is accomplished by using the HSV color-space and defining an intensity-based loss that is built on the EMD between the cyclic hue histograms of the output and the target images. To enforce color-free similarity between the source and the output images, we define a semantic-based loss by a differentiable approximation of the MI of these images.
The incorporation of histogram loss functions in addition to an adversarial loss enables the construction of semantically meaningful and realistic images.
Promising results are presented for different datasets.
When a very fast dynamic event is recorded with a low framerate camera, the resulting video suffers from severe motion blur (due to exposure time) and motion aliasing (due to low sampling rate in time). True Temporal Super-Resolution (TSR) is more than just Temporal-Interpolation (increasing framerate). It also recovers new high temporal frequencies beyond the temporal nyquist limit of the input video, thus resolving both motion-blur and motion-aliasing. In this work we propose a "Deep Internal Learning" approach for true TSR. We train a video-specific CNN on examples extracted directly from the low-framerate input video. Our method exploits the strong recurrence of small space-time patches inside a single video sequence, both within and across different spatio-temporal scales of the video. We further observe (for the first time) that small space-time patches recur also across-dimensions of the video sequence - i.e., by swapping the spatial and temporal dimensions. In particular, the higher spatial resolution of video frames provides strong examples as to how to increase the temporal resolution of that video. Such internal video-specific examples give rise to strong self-supervision, requiring no data but the input video itself. This results in Zero-Shot Temporal-SR of complex videos, which removes both motion blur and motion aliasing, outperforming previous supervised methods trained on external video datasets.
* Joint work with Shai Bagon, Eyal Naor, George Pisha, Michal Irani
Super resolution (SR) methods typically assume that the low-resolution (LR) image was downscaled from the unknown high-resolution (HR) image by a fixed "ideal" downscaling kernel (e.g. Bicubic downscaling). However, this is rarely the case in real LR images, in contrast to synthetically generated SR datasets. When the assumed downscaling kernel deviates from the true one, the performance of SR methods significantly deteriorates. This gave rise to Blind-SR - namely, SR when the downscaling kernel ("SR-kernel") is unknown. It was further shown that the true SR-kernel is the one that maximizes the recurrence of patches across scales of the LR image. In this paper we show how this powerful cross-scale recurrence property can be realized using Deep Internal Learning. We introduce "KernelGAN", an image-specific Internal-GAN, which trains solely on the LR test image at test time, and learns its internal distribution of patches. Its Generator is trained to produce a downscaled version of the LR test image, such that its Discriminator cannot distinguish between the patch distribution of the downscaled image, and the patch distribution of the original LR image. The Generator, once trained, constitutes the downscaling operation with the correct image-specific SR-kernel. KernelGAN is fully unsupervised, requires no training data other than the input image itself, and leads to state-of-the-art results in Blind-SR when plugged into existing SR algorithms.
Photos and videos are now a main mode of communication, used to tell stories, share experiences and convey ideas. However, common media editing tools are often either too complex to master, or oversimplified and limited.
In this talk I will present my strategy towards the creation of media editing techniques that are easy to learn, yet expressive enough to reflect unique creative objectives. We will mostly discuss one specific domain --- human heads --- which are both extremely common (i.e. people care about people) and technologically challenging. I will present several works on editing video by editing text, perspective manipulation, and in-camera lighting feedback. I will also discuss exciting future opportunities related to neural rendering and digital representations of humans.
BIO
Ohad Fried is a postdoctoral research scholar at Stanford University. His work lies in the intersection of computer graphics, computer vision, and human-computer interaction. He holds a PhD in computer science from Princeton University, and an M.Sc. in computer science and a B.Sc. in computational biology from The Hebrew University. Ohad's research focuses on tools, algorithms, and new paradigms for photo and video editing. Ohad is the recipient of several awards, including a Siebel Scholarship and a Google PhD Fellowship. If you own a cable modem, there's a non-negligible chance that Ohad's code runs within it, so feel free to blame him for your slow internet connection.
www.ohadf.com
Learning of layered or "deep" representations has recently enabled low-cost sensors for autonomous vehicles and efficient automated analysis of visual semantics in online media. But these models have typically required prohibitive amounts of training data, and thus may only work well in the environment they have been trained in. I'll describe recent methods in adversarial adaptive learning that excel when learning across modalities and domains. Further, these models have been unsatisfying in their complexity--with millions of parameters--and their resulting opacity. I'll report approaches which achieve explainable deep learning models, including both introspective approaches that visualize compositional structure in a deep network, and third-person approaches that can provide a natural language justification for the classification decision of a deep model.
Dropout is a simple yet effective regularization technique that has been applied to various machine learning tasks, including linear classification, matrix factorization and deep learning. However, the theoretical properties of dropout as a regularizer remain quite elusive. This talk will present a theoretical analysis of dropout for single hidden-layer linear neural networks. We demonstrate that dropout is a stochastic gradient descent method for minimizing a certain regularized loss. We show that the regularizer induces solutions that are low-rank, in the sense of minimizing the number of neurons. We also show that the global optimum is balanced, in the sense that the product of the norms of incoming and outgoing weight vectors of all the hidden nodes equal. Finally, we provide a complete characterization of the optimization landscape induced by dropout.
We present a Monte Carlo rendering framework for the physically-accurate simulation of speckle patterns arising from volumetric scattering of coherent waves. These noise-like patterns are characterized by strong statistical properties, such as the so-called memory effect. These properties are at the core of imaging techniques for applications as diverse as tissue imaging, motion tracking, and non-line-of-sight imaging. Our rendering framework can replicate these properties computationally, in a way that is orders of magnitude more efficient than alternatives based on directly solving the wave equations. At the core of our framework is a path-space formulation for the covariance of speckle patterns arising from a scattering volume, which we derive from first principles. We use this formulation to develop two Monte Carlo rendering algorithms, for computing speckle covariance as well as directly speckle fields. While approaches based on wave equation solvers require knowing the microscopic position of wavelength-sized scatterers, our approach takes as input only bulk parameters describing the statistical distribution of these scatterers inside a volume. We validate the accuracy of our framework by comparing against speckle patterns simulated using wave equation solvers, use it to simulate memory effect observations that were previously only possible through lab measurements, and demonstrate its applicability for computational imaging tasks.
Joint work with Chen Bar, Marina Alterman, Ioannis Gkioulekas
The recurring context in which objects appear holds valuable information that can be employed to predict their existence. This intuitive observation indeed led many researchers to endow appearance-based detectors with explicit reasoning about context. The underlying thesis suggests that stronger contextual relations would facilitate greater improvements in detection capacity. In practice, however, the observed improvement in many cases is modest at best, and often only marginal. In this work we seek to improve our understanding of this phenomenon, in part by pursuing an opposite approach. Instead of attempting to improve detection scores by employing context, we treat the utility of context as an optimization problem: to what extent can detection scores be improved by considering context or any other kind of additional information? With this approach we explore the bounds on improvement by using contextual relations between objects and provide a tool for identifying the most helpful ones. We show that simple co-occurrence relations can often provide large gains, while in other cases a significant improvement is simply impossible or impractical with either co-occurrence or more precise spatial relations. To better understand these results we then analyze the ability of context to handle different types of false detections, revealing that tested contextual information cannot ameliorate localization errors, severely limiting its gains. These and additional insights further our understanding on where and why utilization of context for object detection succeeds and fails
A longstanding problem in machine learning is to find unsupervised methods that can learn the statistical structure of high dimensional signals. In recent years, GANs have gained much attention as a possible solution to the problem, and in particular have shown the ability to generate remarkably realistic high resolution sampled images. At the same time, many authors have pointed out that GANs may fail to model the full distribution ("mode collapse") and that using the learned models for anything other than generating samples may be very difficult. In this paper, we examine the utility of GANs in learning statistical models of images by comparing them to perhaps the simplest statistical model, the Gaussian Mixture Model. First, we present a simple method to evaluate generative models based on relative proportions of samples that fall into predetermined bins. Unlike previous automatic methods for evaluating models, our method does not rely on an additional neural network nor does it require approximating intractable computations. Second, we compare the performance of GANs to GMMs trained on the same datasets. While GMMs have previously been shown to be successful in modeling small patches of images, we show how to train them on full sized images despite the high dimensionality. Our results show that GMMs can generate realistic samples (although less sharp than those of GANs) but also capture the full distribution, which GANs fail to do. Furthermore, GMMs allow efficient inference and explicit representation of the underlying statistical structure. Finally, we discuss how GMMs can be used to generate sharp images.
We consider the geo-localization task - finding the pose (position & orientation) of a camera in a large 3D scene from a single image. We aim toexperimentally explore the role of geometry in geo-localization in a convolutional neural network (CNN) solution. We do so by ignoring the often available texture of the scene. We therefore deliberately avoid using texture or rich geometric details and use images projected from a simple 3D model of a city, which we term lean images. Lean images contain solely information that relates to the geometry of the area viewed (edges, faces, or relative depth). We find that the network is capable of estimating the camera pose from the lean images, and it does so not by memorization but by some measure of geometric learning of the geographical area. The main contributions of this work are: (i) providing insight into the role of geometry in the CNN learning process; and (ii) demonstrating the power of CNNs for recovering camera pose using lean images.
This is a joint work with Moti Kadosh & Ariel Shamir
This work studies the relationship between the classification performed by deep neural networks (DNNs) and the decision of various classic classifiers, namely k-nearest neighbors (k-NN), support vector machines (SVM), and logistic regression (LR). This is studied at various layers of the network, providing us with new insights on the ability of DNNs to both memorize the training data and generalize to new data at the same time, where k-NN serves as the ideal estimator that perfectly memorizes the data.
First, we show that DNNs' generalization improves gradually along their layers and that memorization of non-generalizing networks happens only at the last layers.
We also observe that the behavior of DNNs compared to the linear classifiers SVM and LR is quite the same on the training and test data regardless of whether the network generalizes. On the other hand, the similarity to k-NN holds only at the absence of overfitting. This suggests that the k-NN behavior of the network on new data is a good sign of generalization. Moreover, this allows us to use existing k-NN theory for DNNs.
Inverse problems appear in many applications, such as image deblurring, inpainting and super-resolution. The common approach to address them is to design a specific algorithm (or recently - a deep neural network) for each problem. The Plug-and-Play (P&P) framework, which has been recently introduced, allows solving general inverse problems by leveraging the impressive capabilities of existing denoising algorithms. While this fresh strategy has found many applications, a burdensome parameter tuning is often required in order to obtain high-quality results. In this work, we propose an alternative method for solving inverse problems using off-the-shelf denoisers, which requires less parameter tuning (can be also translated into less pre-trained denoising neural networks). First, we transform a typical cost function, composed of fidelity and prior terms, into a closely related, novel optimization problem. Then, we propose an efficient minimization scheme with a plug-and-play property, i.e., the prior term is handled solely by a denoising operation. Finally, we present an automatic tuning mechanism to set the method's parameters. We provide a theoretical analysis of the method, and empirically demonstrate its impressive results for image inpainting, deblurring and super-resolution. For the latter, we also present an image-adaptive learning approach that further improves the results.
Modern robotic and vision systems are often equipped with a direct 3D data acquisition device, e.g. a LiDAR or RGBD camera, which provides a rich 3D point cloud representation of the surroundings. Point clouds have been used successfully for localization and mapping tasks, but their use in semantic understanding has not been fully explored. Recent advances in deep learning methods for images along with the growing availability of 3D point cloud data have fostered the development of new 3D deep learning methods that use point clouds for semantic understanding. However, their unstructured and unordered nature make them an unnatural input to deep learning methods. In this work we propose solutions to three semantic understanding and geometric processing tasks: point cloud classification, segmentation, and normal estimation. We first propose a new global representation for point clouds called the 3D Modified Fisher Vector (3DmFV). The representation is structured and independent of order and sample size. As such, it can be used with 3DmFV-Net, a newly designed 3D CNN architecture for classification. The representation introduces a conceptual change for processing point clouds by using a global and structured spatial distribution. We demonstrate the classification performance on the ModelNet40 CAD dataset and the Sydney outdoor dataset obtained by LiDAR. We then extend the architecture to solve a part segmentation task by performing per point classification. The results here are demonstrated on the ShapeNet dataset. We use the proposed representation to solve a fundamental and practical geometric processing problem of normal estimation using a new 3D CNN (Nesti-Net). To that end, we propose a local multi-scale representation called Multi Scale Point Statistics (MuPS) and show that using structured spatial distributions is also as effective for local multi-scale analysis as for global analysis. We further show that multi-scale data integrates well with a Mixture of Experts (MoE) architecture. The MoE enables the use of semi-supervised scale prediction to determine the appropriate scale for a given local geometry. We evaluate our method on the PCPNet dataset. For all methods we achieved state-of-the-art performance without using an end-to-end learning approach.
In this talk I will present two recent works:
1) Dynamic Temporal Alignment of Speech to Lips
Many speech segments in movies are re-recorded in a studio during post-production, to compensate for poor sound quality as recorded on location. Manual alignment of the newly-recorded speech with the original lip movements is a tedious task. We present an audio-to-video alignment method for automating speech to lips alignment, stretching and compressing the audio signal to match the lip movements. This alignment is based on deep audio-visual features, mapping the lips video and the speech signal to a shared representation. Using this shared representation we compute the lip-sync error between every short speech period and every video frame, followed by the determination of the optimal corresponding frame for each short sound period over the entire video clip. We demonstrate successful alignment both quantitatively, using a human perception-inspired metric, as well as qualitatively. The strongest advantage of our audio-to-video approach is in cases where the original voice is unclear, and where a constant shift of the sound can not give perfect alignment. In these cases, state-of-the-art methods will fail.
2) Neural separation of observed and unobserved distributions
Separating mixed distributions is a long standing challenge for machine learning and signal processing. Most current methods either rely on making strong assumptions on the source distributions or rely on having training samples of each source in the mixture. In this work, we introduce a new method - Neural Egg Separation - to tackle the scenario of extracting a signal from an unobserved distribution additively mixed with a signal from an observed distribution. Our method iteratively learns to separate the known distribution from progressively finer estimates of the unknown distribution. In some settings, Neural Egg Separation is initialization sensitive, we therefore introduce Latent Mixture Masking which ensures a good initialization. Extensive experiments on audio and image separation tasks show that our method outperforms current methods that use the same level of supervision, and often achieves similar performance to full supervision.
We introduce a 3D shape generative model based on deep neural networks. A new image-like (i.e., tensor) data representation for genus-zero 3D shapes is devised. It is based on the observation that complicated shapes can be well represented by multiple parameterizations (charts), each focusing on a different part of the shape.
The new tensor data representation is used as input to Generative Adversarial Networks for the task of 3D shape generation. The 3D shape tensor representation is based on a multichart structure that enjoys a shape covering property and scale-translation rigidity. Scale-translation rigidity facilitates high quality 3D shape learning and guarantees unique reconstruction. The multi-chart structure uses as input adataset of 3D shapes (with arbitrary connectivity) and a sparse correspondence between them.
The output of our algorithm is a generative model that learns the shape distribution and is able to generate novel shapes, interpolate shapes, and explore the generated shape space. The effectiveness of the method is demonstrated for the task of anatomic shape generation including human body and bone (teeth) shape generation.
In recent years, deep neural networks (DNNs) achieved unprecedented performance in many vision tasks. However, state-of-the-art results are typically achieved by very deep networks, which can reach tens of layers with tens of millions of parameters. To make DNNs implementable on platforms with limited resources, it is necessary to weaken the tradeoff between performance and efficiency. In this work, we propose a new activation unit, which is suitable for both high and low level vision problems. In contrast to the widespread per-pixel activation units, like ReLUs and sigmoids, our unit implements a learnable nonlinear function with spatial connections. This enables the net to capture much more complex features, thus requiring a significantly smaller number of layers in order to reach the same performance. We illustrate the effectiveness of our units through experiments with state-of-the-art nets for classification, denoising, and super resolution. With our approach, we are able to reduce the size of these models by nearly 50% without incurring any degradation in performance.
*Spotlight presentation at CVPR'18 (+ submitted extension)
A joint work with Tamar Rott Shaham and Tomer Michaeli
We tackle the problem of multiple image alignment and 3D reconstruction under extreme noise.
Modern alignment schemes, based on similarity measures, feature matching and optical flow are often pairwise, or assume global alignment. Nevertheless, under extreme noise, the alignment success sharply deteriorates, since each image does not contain enough information. Yet, when multiple images are well aligned, the signal emerges from the stack.
As the problems of alignment and 3D reconstruction are coupled, we constrain the solution by taking into account only alignments that are geometrically feasible and solve for the entire stack simultaneously. The solution is formulated as a variational problem where only a single variable per pixel, the scene's distance, is solved for. Thus, the complexity of the algorithm is independent of the number of images. Our method outperforms state-of-the-art techniques, as indicated by our simulations and by experiments on real-world scenes.
And, finally- I will discuss how this algorithm is related to our current and planned work with underwater robots.
The common practice in image generation is to train a network and fix it to a specific working point. In contrast, in this work we would like to provide control at inference time in a generic way. We propose to add a second training phase, where we train an additional tuning block. The goal of these tuning blocks is to provide us with control over the output image by modifying the latent space representation. I will first present our latest attempt in the domain of generative adversarial, where we aim to improve the quality of the results using the discriminator information -- we name this adversarial feedback loop. Second, I will present Dynamic-Net, where we train a single network such that it can emulate many networks trained with different objectives. The main assumption of both works is that we can learn how to change the latent space in order to achieve a specific goal. Evaluation on many application shows that the latent space can be manipulated, such that, it allows us to have diversified control at inference time.
* Joint work with Firas Shama*, Alon Shoshan* and Lihi Zelnik Manor.
Imbalanced data is problematic for many machine learning applications. Some classifiers, for example, can be strongly biased due to diversity of class populations. In unsupervised learning, imbalanced sampling of ground truth clusters can lead to distortions of the learned clusters. Removing points (undersampling) is a simple solution, but this strategy leads to information loss and can decrease generalization performance. We instead focus on data generation to artificially balance the data.
In this talk, we present a novel data synthesis method, which we call SUGAR (Synthesis Using Geometrically Aligned Random-walks) [1], for generating data in its original feature space while following its intrinsic geometry. This geometry is inferred by a diffusion kernel [2] that captures a data-driven manifold and reveals underlying structure in the full range of the data space -- including undersampled regions that can be augmented by new synthesized data.
We demonstrate the potential advantages of the approach by improving both classification and clustering performance on numerous datasets. Finally, we show that this approach is especially useful in biology, where rare subpopulations and gene-interaction relationships are affected by biased sampling.
[1] Lindenbaum, Ofir, Jay S. Stanley III, Guy Wolf, and Smita Krishnaswamy. "Geometry-Based Data Generation." NIPS (2018).
[2] Coifman, Ronald R., and Stéphane Lafon. "Diffusion maps." Applied and computational harmonic analysis (2006).
The goal of style transfer algorithms is to render the content of one image using the style of another. We propose Style Transfer by Relaxed Optimal Transport and Self-Similarity (STROTSS), a new optimization-based style transfer algorithm. We extend our method to allow user-specified point-to-point or region-to-region control over visual similarity between the style image and the output. Such guidance can be used to either achieve a particular visual effect or correct errors made by unconstrained style transfer. In order to quantitatively compare our method to prior work, we conduct a large-scale user study designed to assess the style-content tradeoff across settings in style transfer algorithms. Our results indicate that for any desired level of content preservation, our method provides higher quality stylization than prior work.
Joint work with Nick Kolkin and Jason Salavon.
This talk will describe my past and ongoing work on translating images and words between very different datasets without supervision. Humans often do not require supervision to make connections between very different sources of data, but this is still difficult for machines. Recently great progress was made by using adversarial training a powerful yet tricky method. Although adversarial methods have had great success, they have critical failings which significantly limit their breadth of applicability and motivate research into alternative non-adversarial methods. In this talk, I will describe novel non-adversarial methods for unsupervised word translation and for translating images between very different datasets (joint work with Lior Wolf). As image generation models are an important component in our method, I will present a non-adversarial image generation approach, which is often better than current adversarial approaches (joint work with Jitendra Malik).
We all capture the world around us through digital data such as images, videos and sound. However, in many cases, we are interested in certain properties of the data that are either not available or difficult to perceive directly from the input signal. My goal is to "Re-render Reality", i.e., develop algorithms that analyze digital signals and then create a new version of it that allows us to see and hear better. In this talk, I'll present a variety of methodologies aimed at enhancing the way we perceive our world through modified, re-rendered output. These works combine ideas from signal processing, optimization, computer graphics, and machine learning, and address a wide range of applications. More specifically, I'll demonstrate how we can automatically reveal subtle geometric imperfection in images, visualize human motion in 3D, and use visual signals to help us separate and mute interference sound in a video. Finally, I'll discuss some of my future directions and work in progress.
BIO: Tali is a Senior Research Scientist at Google, Cambridge, developing algorithms at the intersection of computer vision and computer graphics. Before Google, she was a Postdoctoral Associate at the Computer Science and Artificial Intelligence Lab (CSAIL) at MIT, working with Prof. William T. Freeman. Tali completed her Ph.D studies at the school of electrical engineering, Tel-Aviv University, Israel, under the supervision of Prof. Shai Avidan, and Prof. Yael Moses. Her research interests include computational photography, image synthesize, geometry and 3D reconstruction.
Visual repetitions are abundant in our surrounding physical world: small image patches tend to reoccur within a natural image, and across different rescaled versions thereof. Similarly, semantic repetitions appear naturally inside an object class within image datasets, as a result of different views and scales of the same object. In my thesis I studied deviations from these expected repetitions, and demonstrated how this information can be exploited to tackle both low-level and high-level vision tasks. These include “blind” image reconstruction tasks (e.g. dehazing, deblurring), image classification confidence estimation, and more.
Deep convolutional network architectures are often assumed to guarantee generalization for small image translations and deformations. In this paper we show that modern CNNs (VGG16, ResNet50, and InceptionResNetV2) can drastically change their output when an image is translated in the image plane by a few pixels, and that this failure of generalization also happens with other realistic small image transformations. Furthermore, we see these failures to generalize more frequently in more modern networks. We show that these failures are related to the fact that the architecture of modern CNNs ignores the classical sampling theorem so that generalization is not guaranteed. We also show that biases in the statistics of commonly used image datasets makes it unlikely that CNNs will learn to be invariant to these transformations. Taken together our results suggest that the performance of CNNs in object recognition falls far short of the generalization capabilities of humans.
Joint work with Aharon Azulay
In typical imaging systems, an image/video is first acquired, then compressed for transmission or storage, and eventually presented to human observers using different and often imperfect display devices. While the resulting quality of the perceived output image may severely be affected by the acquisition and display processes, these degradations are usually ignored in the compression stage, leading to an overall sub-optimal system performance. In this work we propose a compression methodology to optimize the system's end-to-end reconstruction error with respect to the compression bit-cost. Using the alternating direction method of multipliers (ADMM) technique, we show that the design of the new globally-optimized compression reduces to a standard compression of a "system adjusted" signal. Essentially, we propose a new practical framework for the information-theoretic problem of remote source coding. The main ideas of our method are further explained using rate-distortion theory for Gaussian signals. We experimentally demonstrate our framework for image and video compression using the state-of-the-art HEVC standard, adjusted to several system layouts including acquisition and rendering models. The experiments established our method as the best approach for optimizing the system performance at high bit-rates from the compression standpoint.
In addition, we relate the proposed approach also to signal restoration using complexity regularization, where the likelihood of candidate solutions is evaluated based on their compression bit-costs.
Using our ADMM-based approach, we present new restoration methods relying on repeated applications of standard compression techniques. Thus, we restore signals by leveraging state-of-the-art models designed for compression. The presented experiments show good results for image deblurring and inpainting using the JPEG2000 and HEVC compression standards.
* Joint work with Prof. Alfred Bruckstein and Prof. Michael Elad.
** More details about the speaker and his research work are available at http://ydar.cswp.cs.technion.ac.il/
In this talk I explore various limitations of sensors and displays, and suggest new ways to overcome them.
These include:
- The limited 2D displays of today's screens - I will show how we can design new displays that enable us to see in 3D *without* any glasses.
- The limited spatial resolution of images - I will discuss the crucial factors for successful Super-Resolution.
- The poor image quality due to motion blur (due to camera motion or scene motion) - I will present a new approach for Blind Non-Uniform Deblurring.
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).
Joint work with Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, Bill Freeman and Miki Rubinstein of Google Research.
Information transfer between triangle meshes is of great importance in computer graphics and geometry processing. To facilitate this process, a smooth and accurate map is typically required between the two meshes. While such maps can sometimes be computed between nearly-isometric meshes, the more general case of meshes with diverse geometries remains challenging. This talk describes a novel approach for direct map computation between triangle meshes without mapping to an intermediate domain, which optimizes for the harmonicity and reversibility of the forward and backward maps. Our method is general both in the information it can receive as input, e.g. point landmarks, a dense map or a functional map, and in the diversity of the geometries to which it can be applied. We demonstrate that our maps exhibit lower conformal distortion than the state-of-the-art, while succeeding in correctly mapping key features of the input shapes.
Artificial illumination plays a vital role in human civilization. In computer vision, artificial light is extensively used to recover 3D shape, reflectance, and further scene information. However, in most computer vision applications using artificial light, some additional structure is added to the illumination to facilitate the task at hand. In this work, we show that the ubiquitous alternating current (AC) lights already have a valuable inherent structure that stems from bulb flicker. By passively sensing scene flicker, we reveal new scene information which includes: the type of bulbs in the scene, the phases of the electric grid up to city scale, and the light transport matrix. This information yields unmixing of reflections and semi-reflections, nocturnal high dynamic range, and scene rendering with bulbs not observed during acquisition. Moreover, we provide methods that enable capturing scene flicker using almost any off-the-shelf camera, including smartphones.
In underwater imaging, similar structures are added to illumination sources to overcome the limited imaging range. In this setting, we show that by optimizing camera and light positions while taking the effect of light propagation in scattering media into account we achieve superior imaging of underwater scenes while using simple, unstructured illumination.
Fast and robust three-dimensional reconstruction of facial geometric structure from a single image is a challenging task with numerous applications in computer vision and graphics. We propose to leverage the power of convolutional neural networks (CNNs) to produce highly detailed face reconstruction directly from a single image. For this purpose, we introduce an end-to-end CNN framework which constructs the shape in a coarse-to-fine fashion. The proposed architecture is composed of two main blocks, a network which recovers the coarse facial geometry (CoarseNet), followed by a CNN which refines the facial features of that geometry (FineNet). To alleviate the lack of training data for face reconstruction, we train our model using synthetic data as well as unlabeled facial images collected from the internet. The proposed model successfully recovers facial geometries from real images, even for faces with extreme expressions and under varying lighting conditions. In this talk, I will summarize three papers that were published at 3DV 2016, CVPR 2017 (as an oral presentation), and ICCV 2017.
Bio: Matan Sela holds a Ph.D in Computer Science from the Technion - Israel Institute of Technology. He received B.Sc. and M.Sc. (both with honors) in electrical engineering, both from The Technion - Israel Institute of Technology in 2012 and 2015, respectively. During summer 2017, he was a research intern at Google, Mountain View, California. His interests are Machine Learning, Computer Vision, Computer Graphics, Geometry Processing and any combination of thereof.
Recent work has shown impressive success in automatically creating new images with desired properties such as transferring painterly style, modifying facial expressions or manipulating the center of attention of the image. In this talk I will discuss two of the standing challenges in image synthesis and how we tackle them:
- I will describe our efforts in making the synthesized images more photo-realistic.
- I will further show how we can broaden the scope of data that can be used for training synthesis networks, and with that provide a solution to new applications.
Recently, there has been increasing interest to leverage the competence of neural networks to analyze data. In particular, new clustering methods that employ deep embeddings have been presented.
In this talk, we depart from centroid-based models and suggest a new framework, called Clustering-driven deep embedding with PAirwise Constraints (CPAC), for non-parametric clustering using a neural network. We present a clustering-driven embedding based on a Siamese network that encourages pairs of data points to output similar representations in the latent space. Our pair-based model allows augmenting the information with labeled pairs to constitute a semi-supervised framework. Our approach is based on analyzing the losses associated with each pair to refine the set of constraints.
We show that clustering performance increases when using this scheme, even with a limited amount of user queries.
We present state-of-the-art results on different types of datasets and compare our performance to parametric and non-parametric techniques.
The harnessing of modern computational abilities for many-body wave-function representations is naturally placed as a prominent avenue in contemporary condensed matter physics. Specifically, highly expressive computational schemes that are able to efficiently represent the entanglement properties which characterize many-particle quantum systems are of interest. In the seemingly unrelated field of machine learning, deep network architectures have exhibited an unprecedented ability to tractably encompass the convoluted dependencies which characterize hard learning tasks such as image classification or speech recognition. However, theory is still lagging behind these rapid empirical advancements, and key questions regarding deep learning architecture design have no adequate answers. In the presented work, we establish a Tensor Network (TN) based common language between the two disciplines, which allows us to offer bidirectional contributions. By showing that many-body wave-functions are structurally equivalent to mappings of convolutional and recurrent arithmetic circuits, we construct their TN descriptions in the form of Tree and Matrix Product State TNs, and bring forth quantum entanglement measures as natural quantifiers of dependencies modeled by such networks. Accordingly, we propose a novel entanglement based deep learning design scheme that sheds light on the success of popular architectural choices made by deep learning practitioners, and suggests new practical prescriptions. Specifically, our analysis provides prescriptions regarding connectivity (pooling geometry) and parameter allocation (layer widths) in deep convolutional networks, and allows us to establish a first of its kind theoretical assertion for the exponential enhancement in long term memory brought forth by depth in recurrent networks. In the other direction, we identify that an inherent re-use of information in state-of-the-art deep learning architectures is a key trait that distinguishes them from TN based representations. Therefore, we suggest a new TN manifestation of information re-use, which enables TN constructs of powerful architectures such as deep recurrent networks and overlapping convolutional networks. This allows us to theoretically demonstrate that the entanglement scaling supported by state-of-the-art deep learning architectures can surpass that of commonly used expressive TNs in one dimension, and can support volume law entanglement scaling in two dimensions with an amount of parameters that is a square root of that required by Restricted Boltzmann Machines. We thus provide theoretical motivation to shift trending neural-network based wave-function representations closer to state-of-the-art deep learning architectures.
Image restoration algorithms are typically evaluated by some distortion measure (e.g. PSNR, SSIM, IFC, VIF) or by human opinion scores that quantify perceived perceptual quality. In this work, we prove mathematically that distortion and perceptual quality are at odds with each other. Specifically, we study the optimal probability for correctly discriminating the outputs of an image restoration algorithm from real images. We show that as the mean distortion decreases, this probability must increase (indicating worse perceptual quality). As opposed to the common belief, this result holds true for any distortion measure, and is not only a problem of the PSNR or SSIM criteria. However, as we show experimentally, for some measures it is less severe (e.g. distance between VGG features). We also show that generative-adversarial-nets (GANs) provide a principled way to approach the perception-distortion bound. This constitutes theoretical support to their observed success in low-level vision tasks. Based on our analysis, we propose a new methodology for evaluating image restoration methods, and use it to perform an extensive comparison between recent super-resolution algorithms. Our study reveals which methods are currently closest to the theoretical perception-distortion bound.
* Joint work with Yochai Blau.
Scattering effects in images, including those related to haze, fog, and appearance of clouds, are fundamentally dictated by microphysical characteristics of the scatterers. We define and derive recovery of these characteristics, in a three-dimensional heterogeneous medium. Recovery is based on a novel tomography approach. Multiview (multi-angular) and multi-spectral data are linked to the underlying microphysics using 3D radiative transfer, accounting for multiple-scattering. Despite the nonlinearity of the tomography model, inversion is enabled using a few approximations that we describe. As a case study, we focus on passive remote sensing of the atmosphere, where scatterer retrieval can benefit modeling and forecasting of weather, climate, and pollution.
Variational problems in geometry, fabrication, learning, and related applications lead to structured numerical optimization problems for which generic algorithms exhibit inefficient or unstable performance. Rather than viewing the particularities of these problems as barriers to scalability, in this talk we embrace them as added structure that can be leveraged to design large-scale and efficient techniques specialized to applications with geometrically structure variables. We explore this theme through the lens of case studies in surface parameterization, optimal transport, and multi-objective design
Multi-finger caging offers a robust approach to object grasping. To securely grasp an object, the fingers are first placed in caging regions surrounding a desired immobilizing grasp. This prevents the object from escaping the hand, and allows for great position uncertainty of the fingers relative to the object. The hand is then closed until the desired immobilizing grasp is reached.
While efficient computation of two-finger caging grasps for polygonal objects is well developed, the computation of three-finger caging grasps has remained a challenging open problem. We will discuss the caging of polygonal objects using three-finger hands that maintain similar triangle finger formations during the grasping process. While the configuration space of such hands is four dimensional, their contact space which represents all two and three finger contacts along the grasped object's boundary forms a two-dimensional stratified manifold.
We will present a caging graph that can be constructed in the hand's relatively simple contact space. Starting from a desired immobilizing grasp of the object by a specific triangular finger formation, the caging graph is searched for the largest formation scale value that ensures a three-finger cage about the object. This value determines the caging regions, and if the formation scale is kept below this value, any finger placement within the caging regions will guarantee a robust object grasping.
In unsupervised domain mapping, the learner is given two unmatched datasets A and B. The goal is to learn a mapping G_AB that translates a sample in A to the analog sample in B. Recent approaches have shown that when learning simultaneously both G_AB and the inverse mapping G_BA, convincing mappings are obtained. In this work, we present a method of learning G_AB without learning G_BA. This is done by learning a mapping that maintains the distance between a pair of samples. Moreover, good mappings are obtained, even by maintaining the distance between different parts of the same sample before and after mapping. We present experimental results that the new method not only allows for one sided mapping learning, but also leads to preferable numerical results over the existing circularity-based constraint.
In recent years, robots have played an active role in everyday life: medical robots assist in complex surgeries, low-cost commercial robots clean houses and fleets of robots are used to efficiently manage warehouses. A key challenge in these systems is motion planning, where we are interested in planning a collision-free path for a robot in an environment cluttered with obstacles. While the general problem has been studied for several decades now, these new applications introduce an abundance of new challenges.
In this talk I will describe some of these challenges as well as algorithms developed to address them. I will overview general challenges such as compression and graph-search algorithms in the context of motion planning. I will show why traditional Computer Science tools are ill-suited for these problems and introduce alternative algorithms that leverage the unique characteristics of robot motion planning. In addition, I will describe domains-specific challenges such as those that arise when planning for assistive robots and for humanoid robots and overview algorithms tailored for these specific domains.
Image captioning -- automatic production of text describing a visual scene -- has received a lot of attention recently. However, the objective of captioning, evaluation metrics, and the training protocol remain somewhat unsettled. The general goal seems to be for machines to describe visual signal like humans do. We pursue this goal by incorporating a discriminability loss in training caption generators. This loss is explicitly "aware" of the need for the caption to convey information, rather than appear fluent or reflect word distribution in the human captions. Specifically, the loss in our work is tied to discriminative tasks: producing a referring expression (text that allows a recipient to identify a region in the given image) or producing a discriminative caption which allows the recipient to identify an image within a set of images. In both projects, use of the dscriminability loss does not require any additional human annotations, and relies on collaborative training between the caption generator and a comprehension model, which is a proxy for a human recipient. In experiments on standard benchmarks, we show that adding discriminability objectives not only improves the discriminative quality of the generated image captions, but, perhaps surprisingly, also makes the captions better under a variety of traditional metrics.
Joint work with Ruotian Luo (TTIC), Brian Price and Scott Cohen (Adobe).
Deep Learning has led to a dramatic leap in Super-Resolution (SR) performance in the past few years. However, being supervised, these SR methods are restricted to specific training data, where the acquisition of the low-resolution (LR) images from their high-resolution (HR) counterparts is predetermined (e.g., bicubic downscaling), without any distracting artifacts (e.g., sensor noise, image compression, non-ideal PSF, etc). Real LR images, however, rarely obey these restrictions, resulting in poor SR results by SotA (State of the Art) methods. In this paper we introduce "Zero-Shot" SR, which exploits the power of Deep Learning, but does not rely on prior training. We exploit the internal recurrence of information inside a single image, and train a small image-specific CNN at test time, on examples extracted solely from the input image itself. As such, it can adapt itself to different settings per image. This allows to perform SR of real old photos, noisy images, biological data, and other images where the acquisition process is unknown or non-ideal. On such images, our method outperforms SotA CNN-based SR methods, as well as previous unsupervised SR methods. To the best of our knowledge, this is the first unsupervised CNN-based SR method.
We present a general approach to video understanding, inspired by semantic transfer techniques that have been successfully used for 2D image analysis. Our method considers a video to be a 1D sequence of clips, each one associated with its own semantics. The nature of these semantics -- natural language captions or other labels -- depends on the task at hand. A test video is processed by forming correspondences between its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video. We describe two matching methods, both designed to ensure that (a) reference clips appear similar to test clips and (b), taken together, the semantics of the selected reference clips is consistent and maintains temporal coherence. We use our method for video captioning on the LSMDC'16 benchmark, video summarization on the SumMe and TVSum benchmarks, Temporal Action Detection on the Thumos2014 benchmark, and sound prediction on the Greatest Hits benchmark. Our method not only surpasses the state of the art, in four out of five benchmarks, but importantly, it is the only single method we know of that was successfully applied to such a diverse range of tasks.
Digital geometry processing (DGP) is one of the core topics of computer graphics, and has been an active line of research for over two decades. On one hand, the field introduces theoretical studies in topics such as vector-field design, preservative maps and deformation theory. On the other hand, the tools and algorithms developed by this community are applicable in fields ranging from computer-aided design, to multimedia, to computational biology and medical imaging. Throughout my work, I have sought to bridge the gap between the theoretical aspects of DGP and their applications. In this talk, I will demonstrate how DGP concepts can be leveraged to facilitate real-life applications with the right adaptation. More specifically, I will portray how I have employed deformation theory to support problems in animation and augmented reality. I will share my thoughts and first taken steps to enlist DGP to the aid of machine learning, and perhaps most excitingly, I will discussion my own and the graphics community's contributions to computational fabrication field, as well as my vision for its future.
Bio: Dr. Amit H. Bermano is a postdoctoral Researcher at the Princeton Graphics Group, hosted by Professor Szymon Rusinkiewicz and Professor Thomas Funkhouser. Previously, he was a postdoctoral researcher at Disney Research Zurich in the computational materials group, led by Dr. Bernhard Thomaszewski. He conducted his doctoral studies at ETH Zurich under the supervision of Prof. Dr. Markus Gross, in collaboration with Disney Research Zurich. His Masters and Bachelors degrees were obtained at The Technion - Israel Institute of Technology under the supervision of Prof. Craig Gotsman. His research focuses on connecting the geometry processing field with other fields in computer graphics and vision, mainly by using geometric methods to facilitate other applications. His interests in this context include computational fabrication, animation, augmented reality, medical imaging and machine learning.
Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments, using a single microphone. For example, video conferences from home or office are disturbed by other voices, TV reporting from city streets is mixed with traffic noise, etc. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. Face motions captured in the video are used to estimate the speaker's voice, which is applied as a filter on the input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model.
In the first part of this talk, I will describe a few techniques to predict speech signals by a silent video of a speaking person. In the second part of the talk, I will propose a method to separate overlapping speech of several people speaking simultaneously (known as the cocktail-party problem), based on the speech predictions generated by video-to-speech system.
When dealing with a highly variable problem such as action recognition, focusing on a small area, such as the hand's region, makes the problem more manageable, and enables us to invest relatively high amount of resources needed for interpretation in a small but highly informative area of the image. In order to detect this region of interest in the image and properly analyze it, I have built a process that includes several steps, starting with a state of the art hand detector, incorporating both detection of the hand by appearance and by estimation of human body pose. The hand detector is built upon a fully convolutional neural network, detecting hands efficiently and accurately. The human body pose estimation starts with a state of the art head detector and continues with a novel approach where each location in the image votes for the position of each body keypoint, utilizing information from the whole image. Using dense, multi-target votes enables us to compute image-dependent joint keypoint probabilities by looking at consensus voting, and accurately estimates the body pose. Once the detection of the hands is complete, an additional step of segmentation of the hand and fingers is made. In this step each hand pixel in the image is labeled using a dense fully convolutional network. Finally, an additional step is made to segment and identify the held object. Understanding the hand-object interaction is an important step toward understanding the action taking place in the image. These steps enable us to perform fine interpretation of hand-object interaction images as an essential step towards understanding the human-object interaction and recognizing human activities.
Recent progress in imaging technologies leads to a continuous growth in biomedical data, which can provide better insight into important clinical and biological questions. Advanced machine learning techniques, such as artificial neural networks are brought to bear on addressing fundamental medical image computing challenges such as segmentation, classification and reconstruction, required for meaningful analysis of the data. Nevertheless, the main bottleneck, which is the lack of annotated examples or 'ground truth' to be used for training, still remains.
In my talk, I will give a brief overview on some biomedical image analysis problems we aim to address, and suggest how prior information about the problem at hand can be utilized to compensate for insufficient or even the absence of ground-truth data. I will then present a framework based on deep neural networks for the denoising of Dynamic contrast-enhanced MRI (DCE-MRI) sequences of the brain. DCE-MRI is an imaging protocol where MRI scans are acquired repetitively throughout the injection of a contrast agent, that is mainly used for quantitative assessment of blood-brain barrier (BBB) permeability. BBB dysfunctionality is associated with numerous brain pathologies including stroke, tumor, traumatic brain injury, epilepsy. Existing techniques for DCE-MRI analysis are error-prone as the dynamic scans are subject to non-white, spatially-dependent and anisotropic noise. To address DCE-MRI denoising challenges we use an ensemble of expert DNNs constructed as deep autoencoders, where each is trained on a specific subset of the input space to accommodate different noise characteristics and dynamic patterns. Since clean DCE-MRI sequences (ground truth) for training are not available, we present a sampling scheme, for generating realistic training sets with nonlinear dynamics that faithfully model clean DCE-MRI data and accounts for spatial similarities. The proposed approach has been successfully applied to full and even temporally down-sampled DCE-MRI sequences, from two different databases, of stroke and brain tumor patients, and is shown to favorably compare to state-of-the-art denoising methods.
The recent success of convolutional neural networks (CNNs) for image processing tasks is inspiring research efforts attempting to achieve similar success for geometric tasks. One of the main challenges in applying CNNs to surfaces is defining a natural convolution operator on surfaces. In this paper we present a method for applying deep learning to sphere-type shapes using a global seamless parameterization to a planar flat-torus, for which the convolution operator is well defined. As a result, the standard deep learning framework can be readily applied for learning semantic, high-level properties of the shape. An indication of our success in bridging the gap between images and surfaces is the fact that our algorithm succeeds in learning semantic information from an input of raw low-dimensional feature vectors.
We demonstrate the usefulness of our approach by presenting two applications: human body segmentation, and automatic landmark detection on anatomical surfaces. We show that our algorithm compares favorably with competing geometric deep-learning algorithms for segmentation tasks, and is able to produce meaningful correspondences on anatomical surfaces where hand-crafted features are bound to fail.
Joint work with: Meirav Galun, Noam Aigerman, Miri Trope, Nadav Dym, Ersin Yumer, Vladimir G. Kim and Yaron Lipman.
Image denoising is the most fundamental problem in image enhancement, and it is largely solved: It has reached impressive heights in performance and quality -- almost as good as it can ever get. But interestingly, it turns out that we can solve many other problems using the image denoising "engine". I will describe the Regularization by Denoising (RED) framework: using the denoising engine in defining the regularization of any inverse problem. The idea is to define an explicit image-adaptive regularization functional directly using a high performance denoiser. Surprisingly, the resulting regularizer is guaranteed to be convex, and the overall objective functional is explicit, clear and well-defined. With complete flexibility to choose the iterative optimization procedure for minimizing this functional, RED is capable of incorporating any image denoising algorithm as a regularizer, treat general inverse problems very effectively, and is guaranteed to converge to the globally optimal result.
* Joint work with Peyman Milanfar (Google Research) and Yaniv Romano (EE-Technion).
We present a novel approach to template matching that is efficient, can handle partial occlusions, and is equipped with provable performance guarantees. A key component of the method is a reduction that transforms the problem of searching a nearest neighbor among N high-dimensional vectors, to searching neighbors among two sets of order sqrt(N) vectors, which can be done efficiently using range search techniques. This allows for a quadratic improvement in search complexity, that makes the method scalable when large search spaces are involved.
For handling partial occlusions, we develop a hashing scheme based on consensus set maximization within the range search component. The resulting scheme can be seen as a randomized hypothesize-and-test algorithm, that comes with guarantees regarding the number of iterations required for obtaining an optimal solution with high probability.
The predicted matching rates are validated empirically and the proposed algorithm shows a significant improvement over the state-of-the-art in both speed and robustness to occlusions.
Joint work with Stefano Soatto.
We study the ecological use of analogies in AI. Specifically, we address the problem of transferring a sample in one domain to an analog sample in another domain. Given two related domains, S and T, we would like to learn a generative function G that maps an input sample from S to the domain T, such that the output of a given representation function f, which accepts inputs in either domains, would remain unchanged. Other than f, the training data is unsupervised and consist of a set of samples from each domain, without any mapping between them. The Domain Transfer Network (DTN) we present employs a compound loss function that includes a multiclass GAN loss, an f preserving component, and a regularizing component that encourages G to map samples from T to themselves. We apply our method to visual domains including digits and face images and demonstrate its ability to generate convincing novel images of previously unseen entities, while preserving their identity.
Joint work with Yaniv Taigman and Adam Polyak
Image processing algorithms often involve a data fidelity penalty, which encourages the solution to comply with the input data. Existing fidelity measures (including perceptual ones) are very sensitive to slight misalignments in the locations and shapes of objects. This is in sharp contrast to the human visual system, which is typically indifferent to such variations. In this work, we propose a new error measure, which is insensitive to small smooth deformations and is very simple to incorporate into existing algorithms. We demonstrate our approach in lossy image compression. As we show, optimal encoding under our criterion boils down to determining how to best deform the input image so as to make it "more compressible". Surprisingly, it turns out that very minor deformations (almost imperceptible in some cases) suffice to make a huge visual difference in methods like JPEG and JPEG2000. Thus, by slightly sacrificing geometric integrity, we gain a significant improvement in preservation of visual information.
We also show how our approach can be used to visualize image priors. This is done by determining how images should be deformed so as to best conform to any given image model. By doing so, we highlight the elementary geometric structures to which the prior resonates. Using this method, we reveal interesting behaviors of popular priors, which were not noticed in the past.
Finally, we illustrate how deforming images to possess desired properties can be used for image "idealization" and for detecting deviations from perfect regularity.
Joint work with Tamar Rott Shaham, Tali Dekel, Michal Irani, and Bill Freeman.
Within the wide field of sparse approximation, convolutional sparse coding (CSC) has gained increasing attention in recent years. This model assumes a structured-dictionary built as a union of banded Circulant matrices. Most attention has been devoted to the practical side of CSC, proposing efficient algorithms for the pursuit problem, and identifying applications that benefit from this model. Interestingly, a systematic theoretical understanding of CSC seems to have been left aside, with the assumption that the existing classical results are sufficient.
In this talk we start by presenting a novel analysis of the CSC model and its associated pursuit. Our study is based on the observation that while being global, this model can be characterized and analyzed locally. We show that uniqueness of the representation, its stability with respect to noise, and successful greedy or convex recovery are all guaranteed assuming that the underlying representation is locally sparse. These new results are much stronger and informative, compared to those obtained by deploying the classical sparse theory.
Armed with these new insights, we proceed by proposing a multi-layer extension of this model, ML-CSC, in which signals are assumed to emerge from a cascade of CSC layers. This, in turn, is shown to be tightly connected to Convolutional Neural Networks (CNN), so much so that the forward-pass of the CNN is in fact the Thresholding pursuit serving the ML-CSC model. This connection brings a fresh view to CNN, as we are able to attribute to this architecture theoretical claims such as uniqueness of the representations throughout the network, and their stable estimation, all guaranteed under simple local sparsity conditions. Lastly, identifying the weaknesses in the above scheme, we propose an alternative to the forward-pass algorithm, which is both tightly connected to deconvolutional and recurrent neural networks, and has better theoretical guarantees.
One of the most exciting possibilities opened by deep neural networks is end-to-end learning: the ability to learn tasks without the need for feature engineering or breaking down into sub-tasks. This talk will present three cases illustrating how end-to-end learning can operate in machine perception across the senses (Hearing, Vision) and even for the entire perception-cognition-action cycle.
The talk begins with speech recognition, showing how acoustic models can be learned end-to-end. This approach skips the feature extraction pipeline, carefully designed for speech recognition over decades.
Proceeding to vision, a novel application is described: identification of photographers of wearable video cameras. Such video was previously considered anonymous as it does not show the photographer.
The talk concludes by presenting a new task, encompassing the full perception-cognition-action cycle: visual learning of arithmetic operations using only pictures of numbers. This is done without using or learning the notions of numbers, digits, and operators.
The talk is based on the following papers:
Speech Acoustic Modeling From Raw Multichannel Waveforms, Y. Hoshen, R.J. Weiss, and K.W. Wilson, ICASSP'15
An Egocentric Look at Video Photographer Identity, Y. Hoshen, S. Peleg, CVPR'16
Visual Learning of Arithmetic Operations, Y. Hoshen, S. Peleg, AAAI'16
Computer science and optics are usually studied separately -- separate people, in separate departments, meet at separate conferences. This is changing. The exciting promise of technologies like virtual reality and self-driving cars demand solutions that draw from the best aspects of computer vision, computer graphics, and optics. Previously, it has proved difficult to bridge these communities. For instance, the laboratory setups in optics are often designed to image millimeter-size scenes in a vibration-free darkroom.
This talk is centered around time of flight imaging, a growing area of research in computational photography. A time of flight camera works by emitting amplitude modulated (AM) light and performing correlations on the reflected light. The frequency of AM is in the radio frequency range (like a Doppler radar system), but the carrier signal is optical, overcoming diffraction limited challenges of full RF systems while providing optical contrast. The obvious use of such cameras is to acquire 3D geometry. By spatially, temporally and spectrally coding light transport we show that it may be possible to go "beyond depth", demonstrating new forms of imaging like photography through scattering media, fast relighting of photographs, real-time tracking of occluded objects in the scene (like an object around a corner), and even the potential to distinguish between biological molecules using fluorescence. We discuss the broader impact of this design paradigm on the future of 3D depth sensors, interferometers, computational photography, medical imaging and many other applications.
This Thursday we will have two SIGGRAPH rehearsal talks in the Vision Seminar, one by Netalee Efrat and one by Meirav Galun. Abstracts are below. Each talk will be about 15 minutes (with NO interruptions), followed by 10 minutes feedback.
Talk1 (Netalee Efrat): Cinema 3D: Large scale automultiscopic display
While 3D movies are gaining popularity, viewers in a 3D cinema still need to wear cumbersome glasses in order to enjoy them. Automultiscopic displays provide a better alternative to the display of 3D content, as they present multiple angular images of the same scene without the need for special eyewear. However, automultiscopic displays cannot be directly implemented in a wide cinema setting due to variants of two main problems: (i) The range of angles at which the screen is observed in a large cinema is usually very wide, and there is an unavoidable tradeoff between the range of angular images supported by the display and its spatial or angular resolutions. (ii) Parallax is usually observed only when a viewer is positioned at a limited range of distances from the screen. This work proposes a new display concept, which supports automultiscopic content in a wide cinema setting. It builds on the typical structure of cinemas, such as the fixed seat positions and the fact that different rows are located on a slope at different heights. Rather than attempting to display many angular images spanning the full range of viewing angles in a wide cinema, our design only displays the narrow angular range observed within the limited width of a single seat. The same narrow range content is then replicated to all rows and seats in the cinema. To achieve this, it uses an optical construction based on two sets of parallax barriers, or lenslets, placed in front of a standard screen. This paper derives the geometry of such a display, analyzes its limitations, and demonstrates a proof-of-concept prototype.
*Joint work with Piotr Didyk, Mike Foshey, Wojciech Matusik, Anat Levin
Talk 2 (Meirav Galun): Accelerated Quadratic Proxy for Geometric Optimization
We present the Accelerated Quadratic Proxy (AQP) - a simple first order algorithm for the optimization of geometric energies defined over triangular and tetrahedral meshes. The main pitfall encountered in the optimization of geometric energies is slow convergence. We observe that this slowness is in large part due to a Laplacian-like term existing in these energies. Consequently, we suggest to exploit the underlined structure of the energy and to locally use a quadratic polynomial proxy, whose Hessian is taken to be the Laplacian. This improves stability and convergence, but more importantly allows incorporating acceleration in an almost universal way, that is independent of mesh size and of the specific energy considered. Experiments with AQP show it is rather insensitive to mesh resolution and requires a nearly constant number of iterations to converge; this is in strong contrast to other popular optimization techniques used today such as Accelerated Gradient Descent and Quasi-Newton methods, e.g., L-BFGS. We have tested AQP for mesh deformation in 2D and 3D as well as for surface parameterization, and found it to provide a considerable speedup over common baseline techniques.
*Joint work with Shahar Kovalsky and Yaron Lipman
Children may learn about the world by pushing, banging, and manipulating things, watching and listening as materials make their distinctive sounds-- dirt makes a thud; ceramic makes a clink. These sounds reveal physical properties of the objects, as well as the force and motion of the physical interaction.
We've explored a toy version of that learning-through-interaction by recording audio and video while we hit many things with a drumstick. We developed an algorithm the predict sounds from silent videos of the drumstick interactions. The algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We demonstrate that the sounds generated by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that the task of predicting sounds allows our system to learn about material properties in the scene.
Joint work with:
Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson
http://arxiv.org/abs/1512.08512
While co-addition and subtraction of astronomical images stand at the heart of observational astronomy, the existing solutions for them lack rigorous argumentation, are not achieving maximal sensitivity and are often slow. Moreover, there is no widespread agreement on how they should be done, and often different methods are used for different scientific applications. I am going to present rigorous solutions to these problems, deriving them from the most basic statistical principles. These solutions are proved optimal, under well defined and practically acceptable assumptions, and in many cases improve substantially the performance of the most basic operations in astronomy.
For coaddition, we present a coadd image that is:
a) sufficient for any further statistical decision or measurement on the underlying constant sky, making the entire data set redundant.
b) improves both survey speed (by 5-20%) and effective spatial resolution of past and future astronomical surveys.
c) improves substantially imaging through turbulence applications.
d) much faster than many of the currently used coaddition solutions.
For subtraction, we present a subtraction image that is:
a) optimal for transient detection under the assumption of spatially uniform noise.
b) sufficient for any further statistical decision on the differences between the images, including the identification of cosmic rays and other image artifacts.
c) Free of subtraction artifacts, allowing (for the first time) robust transient identification in real time, opening new avenues for scientific exploration.
d) orders of magnitude faster than past subtraction methods.
A common approach to face recognition relies on using deep learning for extracting a signature. All leading work on the subject use stupendous amounts of processing power and data. In this work we present a method for efficient and compact learning of metric embedding. The core idea allows a more accurate estimation of the global gradient and hence fast and robust convergence. In order to avoid the need for huge amounts of data we include an explicit alignment phase into the network, hence greatly reducing the number of parameters. These insights allow us to efficiently train a compact deep learning model for face recognition in only 12 hours on a single GPU, which can then fit a mobile device.
Joint work with: Oren Tadmor, Tal Rosenwein, Shai Shalev-Schwartz, Amnon Shashua
Dynamic events such as family gatherings, concerts or sports events are often photographed by a group of people. The set of still images obtained this way is rich in dynamic content. We consider the question of whether such a set of still images, rather than traditional video sequences, can be used for analyzing the dynamic content of the scene. This talk will describe several instances of this problem, their solutions and directions for future studies.
In particular, we will present a method to extend epipolar geometry to predict location of a moving feature in CrowdCam images. The method assumes that the temporal order of the set of images, namely photo-sequencing, is given. We will briefly describe our method to compute photo-sequencing using geometric considerations and rank aggregation. We will also present a method for identifying the moving regions in a scene, which is a basic component in dynamic scene analysis. Finally, we will consider a new vision of developing collaborative CrowdCam, and a first step toward this goal.
This talk will be based on joint works with Tali Dekel, Adi Dafni, Mor Dar, Lior Talked, Ilan Shimshoni, and Shai Avidan.
Stochastic analysis of real-world signals consists of 3 main parts: mathematical representation; probabilistic modeling; statistical inference. For it to be effective, we need mathematically-principled and practical computational tools that take into consideration not only each of these components by itself but also their interplay. This is especially true for a large class of computer-vision and machine-learning problems that involve certain mathematical structures; the latter may be a property of the data or encoded in the representation/model to ensure mathematically-desired properties and computational tractability. For concreteness, this talk will center on structures that are geometric, hierarchical, or topological.
Structures present challenges. For example, on nonlinear spaces, most statistical tools are not directly applicable, and, moreover, computations can be expensive. As another example, in mixture models, topological constraints break statistical independence. Once we overcome the difficulties, however, structures offer many benefits. For example, respecting and exploiting the structure of Riemannian manifolds and/or Lie groups yield better probabilistic models that also support consistent synthesis. The latter is crucial for the employment of analysis-by-synthesis inference methods used within, e.g., a generative Bayesian framework. Likewise, imposing a certain structure on velocity fields yields highly-expressive diffeomorphisms that are also simple and computationally tractable; particularly, this facilitates MCMC inference, traditionally viewed as too expensive in this context.
Time permitting, throughout the talk I will also briefly touch upon related applications such as statistical shape models, transfer learning on manifolds, image warping/registration, time warping, superpixels, 3D-scene analysis, nonparametric Bayesian clustering of spherical data, multi-metric learning, and new machine-learning applications of diffeomorphisms. Lastly, we also applied the (largely model-based) ideas above to propose the first learned data augmentation scheme; as it turns out, when compared with the state-of-the-art schemes, this improves the performance of classifiers of the deep-net variety.
I will describe recent work on building and using rich representations aimed at automatic analysis of visual scenes. In particular, I will describe methods for semantic segmentation (labeling regions of an image according to the category it belongs to), and on semantic boundary detection (recovering accurate boundaries of semantically meaningful regions, such as those corresponding to objects). We focus on feed-forward architectures for these tasks, leveraging recent advances in the art of training deep neural networks. Our approach aims to shift the burden of inducing desirable constraints from explicit structure in the model to implicit structure inherent in computing richer, context-aware representations. I will describe experiments on standard benchmark data sets that demonstrate the success of this approach.
Joint work with Mohammadreza Mostajabi, Payman Yadollahpour, and Harry Yang.
Many sequence prediction tasks---such as automatic speech recognition and video analysis---benefit from long-range temporal features. One way of utilizing long-range information is through segmental (semi-Markov) models such as segmental conditional random fields. Such models have had some success, but have been constrained by the computational needs of considering all possible segmentations. We have developed new segmental models with rich features based on neural segment embeddings, trained with discriminative large-margin criteria, that are efficient enough for first-pass decoding. In our initial work with these models, we have found that they can outperform frame-based HMM/deep network baselines on two disparate tasks, phonetic recognition and sign language recognition from video. I will present the models and their results on these tasks, as well as (time permitting) related recent work on neural segmental acoustic word embeddings.
This is joint work with Hao Tang, Weiran Wang, Herman Kamper, Taehwan Kim, and Kevin Gimpel
It has long been conjectured that hypothesis spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical architectures than with shallow ones. Despite the vast empirical evidence, formal arguments to date are limited and do not capture the kind of networks used in practice. Using tensor factorization, we derive a universal hypothesis space implemented by an arithmetic circuit over functions applied to local data structures (e.g. image patches). The resulting networks first pass the input through a representation layer, and then proceed with a sequence of layers comprising sum followed by product-pooling, where sum corresponds to the widely used convolution operator. The hierarchical structure of networks is born from factorizations of tensors based on the linear weights of the arithmetic circuits. We show that a shallow network corresponds to a rank-1 decomposition, whereas a deep network corresponds to a Hierarchical Tucker (HT) decomposition. Log-space computation for numerical stability transforms the networks into SimNets.
In its basic form, our main theoretical result shows that the set of polynomially sized rank-1 decomposable tensors has measure zero in the parameter space of polynomially sized HT decomposable tensors. In deep learning terminology, this amounts to saying that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require an exponential size if one wishes to implement (or approximate) them with a shallow network. Our construction and theory shed new light on various practices and ideas employed by the deep learning community, and in that sense bear a paradigmatic contribution as well.
Joint work with Or Sharir and Amnon Shashua.
In view of the recent huge interest in image classification and object recognition problems and the spectacular success of deep learning and random forests in solving these tasks, it seems astonishing that much more modest efforts are being invested into related, and often more difficult, problems of image and multimodal content-based retrieval, and, more generally, similarity assessment in large-scale databases. These problems, arising as primitives in many computer vision tasks, are becoming increasingly important in the era of exponentially increasing information. Semantic and similarity-preserving hashing methods have recently received considerable attention to address such a need, in part due to their significant memory and computational advantage over other representations.
In this talk, I will overview some of my recent attempts to construct efficient semantic hashing schemes based on deep neural networks and random forests.
Based on joint works with Qiang Qiu, Guillermo Sapiro, Michael Bronstein, and Jonathan Masci.
What is it that enables learning with multi-layer networks? What causes the network to generalize well? What makes it possible to optimize the error, despite the problem being hard in the worst case? In this talk I will attempt to address these questions and relate between them, highlighting the important role of optimization in deep learning. I will then use the insight to suggest studying novel optimization methods, and will present Path-SGD, a novel optimization approach for multi-layer RELU networks that yields better optimization and better generalization.
Joint work with Behnam Neyshabur, Ryota Tomioka and Russ Salakhutdinov.
Conventional cameras record all light falling onto their sensor regardless of the path that light followed to get there. In this talk I will present an emerging family of video cameras that can be programmed to record just a fraction of the light coming from a controllable source, based on the actual 3D path followed. Live video from these cameras offers a very unconventional view of our everyday world in which refraction and scattering can be selectively blocked or enhanced, visual structures too subtle to notice with the naked eye can become apparent, and object appearance can depend on depth.
I will discuss the unique optical properties and power efficiency of these "transport-aware" cameras, as well as their use for 3D shape acquisition, robust time-of-flight imaging, material analysis, and scene understanding. Last but not least, I will discuss their potential to become our field's "outdoor Kinect" sensor---able to operate robustly even in direct sunlight with very low power.
Kyros Kutulakos is a Professor of Computer Science at the University of Toronto. He received his PhD degree from the University of Wisconsin-Madison in 1994 and his BS degree from the University of Crete in 1988, both in Computer Science. In addition to the University of Toronto, he has held appointments at the University of Rochester (1995-2001) and Microsoft Research Asia (2004-05 and 2011-12). He is the recipient of an Alfred P. Sloan Fellowship, an Ontario Premier's Research Excellence Award, a Marr Prize in 1999, a Marr Prize Honorable Mention in 2005, and three other paper awards (CVPR 1994, ECCV 2006, CVPR 2014). He also served as Program Co-Chair of CVPR 2003, ICCP 2010 and ICCV 2013.
Many scientific and engineering problems are challenged by the fact they involve functions of a very large number of variables. Such problems arise naturally in signal recovery, image processing, learning theory, etc. In addition to the numerical difficulties due to the so-called curse of dimensionality, the resulting optimization problems are often nonsmooth and nonconvex.
We shall survey some of our recent results, illustrating how these difficulties may be handled in the context of well-structured optimization models, highlighting the ways in which problem structures and data information can be beneficially exploited to devise and analyze simple and efficient algorithms.
The goal in this work is to produce ‘full interpretation’ for object images, namely to identify and localize all semantic features and parts that are recognized by human observers. We develop a novel approach and tools to study this challenging task, by dividing the interpretation of the complete object to interpretation of so-called 'minimal recognizable configurations’, namely severely reduced but recognizable local regions, that are minimal in the sense that any further reduction would turn them unrecognizable. We show that for the task of full interpretation, such minimal images have unique properties, which make them particularly useful.
For modeling interpretation, we identify primitive components and relations that play a useful role in the interpretation of minimal images by humans, and incorporate them in a structured prediction algorithm. The structure elements can be point, contour, or region primitives, while relations between them range from standard unary and binary potentials based on relative location, to more complex and high dimensional relations. We show experimental results and match them to human performance. We discuss implications of ‘full’ interpretation for difficult visual tasks, such as recognizing human activities or interactions.
The field of Machine Learning has been making huge strides recently. Problems such as visual recognition and classification, that were believed to be open only a few years ago, now seem solvable. The best performers use Artificial Neural Networks, in their reincarnation as "Deep Learning", where huge networks are trained over lots of data. One bottleneck in current schemes is the huge amount of required computation during both training and testing. This limits the usability of these methods when power is an issue, such as with wearable devices.
As a step towards deeper understanding of deep learning mechanisms, I will show how correct conditioning of the back-propagation training iterations results in a much improved convergence. This reduces training time, providing better results. It also allows us to train smaller models, that are harder to optimize.
In this talk I will also discuss the challenges - and describe some of the solutions - in applying Machine Learning on a mobile device that can fit your pocket. The OrCam is a wearable camera that speaks to you. It reads anything, learns and recognizes faces, and much more. It is ready to help through the day, all with a simple pointing gesture. It is already improving the lives of many blind and visually impaired people.
Many types of multi-dimensional data have a natural division into two "views", such as audio and video or images and text.
Multi-view learning includes a variety of techniques that use multiple views
of data to learn improved models for each of the views. The views can be multiple measurement modalities (like the examples above) but also can be different types of information extracted from the same source (words + context, document text + links) or any division of the data dimensions into subsets satisfying certain learning assumptions. Theoretical and empirical results show that multi-view techniques can improve over single-view ones in certain settings. In many cases multiple views help by reducing noise in some sense (what is noise in one view is not in the other). In this talk, I will focus on multi-view learning of representations (features), especially using canonical correlation analysis (CCA) and related techniques. I will give a tutorial overview of CCA and its relationship with other techniques such as partial least squares (PLS) and linear discriminant analysis (LDA). I will also present extensions developed by ourselves and others, such as kernel, deep, and generalized
("many-view") CCA. Finally, I will give recent results on speech and language tasks, and demonstrate our publicly available code.
Based on joint work with Raman Arora, Weiran Wang, Jeff Bilmes, Galen Andrew, and others.
Small image patches tend to recur at multiple scales within high-quality natural images.
This fractal-like behavior has been used in the past for various tasks including image compression, super-resolution and denoising. In this talk, I will show that this phenomenon can also be harnessed for "blind deblurring" and for "blind super-resolution", that is, for removing blur or increasing resolution without a-priori knowledge of the associated blur kernel. It turns out that the cross-scale patch recurrence property is strong only in images taken under ideal imaging conditions, but significantly diminishes when the imaging conditions deviate from ideal ones. Therefore, the deviations from ideal patch recurrence actually provide information on the unknown camera blur kernel.
More specifically, we show that the correct blur kernel is the one which maximizes the similarity between patches across scales of the image. Extensive experiments indicate that our approach leads to state of the art results, both in deblurring and in super-resolution.
Joint work with Michal Irani.
We present a practical method for establishing dense correspondences between two images with similar content, but possibly different 3D scenes. One of the challenges in designing such a system is the local scale differences of objects appearing in the two images. Previous methods often considered only small subsets of image pixels; matching only pixels for which stable scales may be reliably estimated. More recently, others have considered dense correspondences, but with substantial costs associated with generating, storing and matching scale invariant descriptors.
Our work here is motivated by the observation that pixels in the image have contexts -- the pixels around them -- which may be exploited in order to estimate local scales reliably and repeatably. In practice, we demonstrate that scales estimated in sparse interest points may be propagated to neighboring pixels where this information cannot be reliably determined. Doing so allows scale invariant descriptors to be extracted anywhere in the image, not just in detected interest points. As a consequence, accurate dense correspondences are obtained even between very different images, with little computational costs beyond those required by existing methods.
This is joint work with Moria Tau from the Open University of Israel
On my last visit in 2012, I posed a number of open questions, including how to achieve joint segmentation and tracking, and how to obtain uncertainty estimates for a segmentation.
Some of these questions we have been able to solve [Schiegg ICCV 2013, Schiegg Bioinformatics 2014, Fiaschi CVPR 2014] and I would like to report on this progress.
Given that I will be at Weizmann for another four months, I will also pose new open questions on image processing problems that require a combination of combinatorial optimization and (structured) learning, as an invitation to work together.
The astronomical community's largest technical challenge is coping with the earths atmosphere. In this talk, I will present the popular methods for performing scientific measurement from the ground, coping with the time dependant distortions generated by the earths atmosphere. We will talk about the following topics:
1) Scientific motivation for eliminating the effect of the atmosphere.
2) The statistics of turbulence - the basis for all methods is in deep understanding of the atmospheric turbulence
3) wave-front sensing + adaptive optics - A way to correct it in hardware.
4) lucky imaging + speckle Interferometry - Ways to computationally extract scientifically valuable data despite the turbulent atmosphere.