You are here

Vision and Robotics Seminar

MondaySep 04, 201714:00
Vision and Robotics SeminarRoom 1
Speaker:Ita LifshitzTitle:Hand-object interaction: a step towards action recognitionAbstract:opens in new windowin html    pdfopens in new windowNOTE THE UNUSUAL TIME AND DAY

When dealing with a highly variable problem such as action recognition, focusing on a small area, such as the hand's region, makes the problem more manageable, and enables us to invest relatively high amount of resources needed for interpretation in a small but highly informative area of the image. In order to detect this region of interest in the image and properly analyze it, I have built a process that includes several steps, starting with a state of the art hand detector, incorporating both detection of the hand by appearance and by estimation of human body pose. The hand detector is built upon a fully convolutional neural network, detecting hands efficiently and accurately. The human body pose estimation starts with a state of the art head detector and continues with a novel approach where each location in the image votes for the position of each body keypoint, utilizing information from the whole image. Using dense, multi-target votes enables us to compute image-dependent joint keypoint probabilities by looking at consensus voting, and accurately estimates the body pose. Once the detection of the hands is complete, an additional step of segmentation of the hand and fingers is made. In this step each hand pixel in the image is labeled using a dense fully convolutional network. Finally, an additional step is made to segment and identify the held object. Understanding the hand-object interaction is an important step toward understanding the action taking place in the image. These steps enable us to perform fine interpretation of hand-object interaction images as an essential step towards understanding the human-object interaction and recognizing human activities.

ThursdayJul 06, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Tammy Riklin-Raviv Title:Big data - small training sets: biomedical image analysis bottlenecks, some strategies and applications Abstract:opens in new windowin html    pdfopens in new window

Recent progress in imaging technologies leads to a continuous growth in biomedical data, which can provide better insight into important clinical and biological questions. Advanced machine learning techniques, such as artificial neural networks are brought to bear on addressing fundamental medical image computing challenges such as segmentation, classification and reconstruction, required for meaningful analysis of the data. Nevertheless, the main bottleneck, which is the lack of annotated examples or 'ground truth' to be used for training, still remains.

In my talk, I will give a brief overview on some biomedical image analysis problems we aim to address, and suggest how prior information about the problem at hand can be utilized to compensate for insufficient or even the absence of ground-truth data. I will then present a framework based on deep neural networks for the denoising of Dynamic contrast-enhanced MRI (DCE-MRI) sequences of the brain. DCE-MRI is an imaging protocol where MRI scans are acquired repetitively throughout the injection of a contrast agent, that is mainly used for quantitative assessment of blood-brain barrier (BBB) permeability. BBB dysfunctionality is associated with numerous brain pathologies including stroke, tumor, traumatic brain injury, epilepsy. Existing techniques for DCE-MRI analysis are error-prone as the dynamic scans are subject to non-white, spatially-dependent and anisotropic noise. To address DCE-MRI denoising challenges we use an ensemble of expert DNNs constructed as deep autoencoders, where each is trained on a specific subset of the input space to accommodate different noise characteristics and dynamic patterns. Since clean DCE-MRI sequences (ground truth) for training are not available, we present a sampling scheme, for generating realistic training sets with nonlinear dynamics that faithfully model clean DCE-MRI data and accounts for spatial similarities. The proposed approach has been successfully applied to full and even temporally down-sampled DCE-MRI sequences, from two different databases, of stroke and brain tumor patients, and is shown to favorably compare to state-of-the-art denoising methods.

ThursdayJun 29, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Shai AvidanTitle:Co-occurrence FilterAbstract:opens in new windowin html    pdfopens in new window
Co-occurrence Filter (CoF) is a boundary preserving filter. It is based on the Bilateral Filter (BF) but instead of using a Gaussian on the range values to preserve edges it relies on a co-occurrence matrix. Pixel values that co-occur frequently in the image (i.e., inside textured regions) will have a high weight in the co-occurrence matrix. This, in turn, means that such pixel pairs will be averaged and hence smoothed, regardless of their intensity differences. On the other hand, pixel values that rarely co-occur (i.e., across texture boundaries) will have a low weight in the co-occurrence matrix. As a result, they will not be averaged and the boundary between them will be preserved. The CoF therefore extends the BF to deal with boundaries, not just edges. It learns co-occurrences directly from the image. We can achieve various filtering results by directing it to learn the co-occurrence matrix from a part of the image, or a different image. We give the definition of the filter, discuss how to use it with color images and show several use cases. Joint work with Roy Jevnisek
ThursdayJun 22, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Haggai MaronTitle:Convolutional Neural Networks on Surfaces via Seamless Toric CoversAbstract:opens in new windowin html    pdfopens in new window

The recent success of convolutional neural networks (CNNs) for image processing tasks is inspiring research efforts attempting to achieve similar success for geometric tasks. One of the main challenges in applying CNNs to surfaces is defining a natural convolution operator on surfaces. In this paper we present a method for applying deep learning to sphere-type shapes using a global seamless parameterization to a planar flat-torus, for which the convolution operator is well defined. As a result, the standard deep learning framework can be readily applied for learning semantic, high-level properties of the shape. An indication of our success in bridging the gap between images and surfaces is the fact that our algorithm succeeds in learning semantic information from an input of raw low-dimensional feature vectors. 

We demonstrate the usefulness of our approach by presenting two applications: human body segmentation, and automatic landmark detection on anatomical surfaces. We show that our algorithm compares favorably with competing geometric deep-learning algorithms for segmentation tasks, and is able to produce meaningful correspondences on anatomical surfaces where hand-crafted features are bound to fail.

Joint work with: Meirav Galun, Noam Aigerman, Miri Trope, Nadav Dym, Ersin Yumer, Vladimir G. Kim and Yaron Lipman.

ThursdayJun 15, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Ron KimmelTitle:On Learning Invariants and Representation Spaces of Shapes and FormsAbstract:opens in new windowin html    pdfopens in new window
We study the power of the Laplace Beltrami Operator (LBO) in processing and analyzing geometric information. The decomposition of the LBO at one end, and the heat operator at the other end provide us with efficient tools for dealing with images and shapes. Denoising, segmenting, filtering, exaggerating are just few of the problems for which the LBO provides an efficient solution. We review the optimality of a truncated basis provided by the LBO, and a selection of relevant metrics by which such optimal bases are constructed. Specific example is the scale invariant metric for surfaces that we argue to be a natural selection for the study of articulated shapes and forms. In contrast to geometry understanding there is a new emerging field of deep learning. Learning systems are rapidly dominating the areas of audio, textual, and visual analysis. Recent efforts to convert these successes over to geometry processing indicate that encoding geometric intuition into modeling, training, and testing is a non-trivial task. It appears as if approaches based on geometric understanding are orthogonal to those of data-heavy computational learning. We propose to unify these two methodologies by computationally learning geometric representations and invariants and thereby take a small step towards a new perspective on geometry processing. I will present examples of shape matching, facial surface reconstruction from a single image, reading facial expressions, shape representation, and finally definition and computation of invariant operators and signatures.
ThursdayJun 08, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Nadav CohenTitle:Expressive Efficiency and Inductive Bias of Convolutional Networks: Analysis and Design through Hierarchical Tensor DecompositionsAbstract:opens in new windowin html    pdfopens in new windowJOINT VISION AND MACHINE LEARNING SEMINAR
The driving force behind convolutional networks - the most successful deep learning architecture to date, is their expressive power. Despite its wide acceptance and vast empirical evidence, formal analyses supporting this belief are scarce. The primary notions for formally reasoning about expressiveness are efficiency and inductive bias. Efficiency refers to the ability of a network architecture to realize functions that require an alternative architecture to be much larger. Inductive bias refers to the prioritization of some functions over others given prior knowledge regarding a task at hand. Through an equivalence to hierarchical tensor decompositions, we study the expressive efficiency and inductive bias of various architectural features in convolutional networks (depth, width, pooling geometry and more). Our results shed light on the demonstrated effectiveness of convolutional networks, and in addition, provide new tools for network design. The talk is based on a series of works published in COLT, ICML, CVPR and ICLR (as well as several new preprints), with collaborators Or Sharir, Ronen Tamari, David Yakira, Yoav Levine and Amnon Shashua.
ThursdayJun 01, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Nir Sharon Title:Synchronization over Cartan motion groupsAbstract:opens in new windowin html    pdfopens in new window
The mathematical problem of group synchronization deals with the question of how to estimate unknown group elements from a set of their mutual relations. This problem appears as an important step in solving many real-world problems in vision, robotics, tomography, and more. In this talk, we present a novel solution for synchronization over the class of Cartan motion groups, which includes the special important case of rigid motions. Our method is based on the idea of group contraction, an algebraic notion origin in relativistic mechanics.
ThursdayMay 25, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Rafi MalachTitle:Neuronal "Ignitions" underlying stable representations in a dynamic visual environmentAbstract:opens in new windowin html    pdfopens in new window
The external world is in a constant state of flow- posing a major challenge to neuronal representations of the visual system that necessitate sufficient time for integration and perceptual decisions. In my talk I will discuss the hypothesis that one solution to this challenge is implemented by breaking the neuronal responses into a series of discrete and stable states. I will propose that these stable points are likely implemented through relatively long lasting "ignitions" of recurrent neuronal activity. Such ignitions are a pre-requisite for the emergence of a perceptual image in the mind of the observer. The self-sustained nature of the ignitions endows them with stability despite the dynamically changing inputs. Results from intracranial recordings in patients conducted for clinical diagnostic purposes during rapid stimulus presentations, ecological settings, blinks and saccadic eye movements will be presented in support of this hypothesis.
ThursdayMay 18, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Michael Elad Title:Regularization by Denoising (RED)Abstract:opens in new windowin html    pdfopens in new window

Image denoising is the most fundamental problem in image enhancement, and it is largely solved: It has reached impressive heights in performance and quality -- almost as good as it can ever get. But interestingly, it turns out that we can solve many other problems using the image denoising "engine". I will describe the Regularization by Denoising (RED) framework: using the denoising engine in defining the regularization of any inverse problem. The idea is to define an explicit image-adaptive regularization functional directly using a high performance denoiser. Surprisingly, the resulting regularizer is guaranteed to be convex, and the overall objective functional is explicit, clear and well-defined. With complete flexibility to choose the iterative optimization procedure for minimizing this functional, RED is capable of incorporating any image denoising algorithm as a regularizer, treat general inverse problems very effectively, and is guaranteed to converge to the globally optimal result.

* Joint work with Peyman Milanfar (Google Research) and Yaniv Romano (EE-Technion).

ThursdayApr 27, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Tamar FlashTitle:Motion compositionality and timing: combined geometrical and optimization approachesAbstract:opens in new windowin html    pdfopens in new window
In my talk I will discuss several recent research directions that we have taken to explore the different principles underlying the construction and control of complex human upper arm and gait movements. One important topic is motor compositionality, exploring the nature of the motor primitives underlying the construction of complex movements at different levels of the motor hierarchy. The second topic which we focused on is motion timing, investigating what principles dictate the durations of complex sequential behaviors both at the level of the internal timing of different motion segments and the total durations of different types of movement. Finally I will discuss the topic of motor coordination and the mapping between end-effector and joint motions both during arm and leg movements using various dimension reduction approaches. The mathematical models we have used to study the above topics combine geometrical approaches with optimization models to derive motion invariants, optimal control principles and different conservations laws.
ThursdayApr 20, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Lihi Zelnik-Manor Title:Separating the Wheat from the Chaff in Visual DataAbstract:opens in new windowin html    pdfopens in new window
By far, most of the bits in the world are image and video data. YouTube alone gets 300 hours of video uploaded every minute. Adding to that personal pictures, videos, TV channels and the gazillion of security cameras shooting 24/7 one quickly sees that the amount of visual data being recorded is colossal. In the first part of this talk I will discuss the problem of "saliency prediction" - separating between the important parts of images/videos (the "wheat") from the less important ones (the "chaff"). I will review work done over the last decade and its achievements. In the second part of the talk I will discuss one particular application of saliency prediction that our lab is interested in: making images and videos accessible to the visually impaired. Our plan is to convert images and videos into tactile surfaces that can be "viewed" by touch. As it turns out, saliency estimation and manipulation both play a key factor in this task.
ThursdayApr 06, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Simon KormanTitle:Occlusion-Aware Template Matching via Consensus Set MaximizationAbstract:opens in new windowin html    pdfopens in new window

We present a novel approach to template matching that is efficient, can handle partial occlusions, and is equipped with provable performance guarantees. A key component of the method is a reduction that transforms the problem of searching a nearest neighbor among N high-dimensional vectors, to searching neighbors among two sets of order sqrt(N) vectors, which can be done efficiently using range search techniques. This allows for a quadratic improvement in search complexity, that makes the method scalable when large search spaces are involved. 
For handling partial occlusions, we develop a hashing scheme based on consensus set maximization within the range search component. The resulting scheme can be seen as a randomized hypothesize-and-test algorithm, that comes with guarantees regarding the number of iterations required for obtaining an optimal solution with high probability. 
The predicted matching rates are validated empirically and the proposed algorithm shows a significant improvement over the state-of-the-art in both speed and robustness to occlusions.
Joint work with Stefano Soatto.

ThursdayMar 30, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Lior WolfTitle:Unsupervised Cross-Domain Image GenerationAbstract:opens in new windowin html    pdfopens in new window

We study the ecological use of analogies in AI. Specifically, we address the problem of transferring a sample in one domain to an analog sample in another domain. Given two related domains, S and T, we would like to learn a generative function G that maps an input sample from S to the domain T, such that the output of a given representation function f, which accepts inputs in either domains, would remain unchanged. Other than f, the training data is unsupervised and consist of a set of samples from each domain, without any mapping between them. The Domain Transfer Network (DTN) we present employs a compound loss function that includes a multiclass GAN loss, an f preserving component, and a regularizing component that encourages G to map samples from T to themselves. We apply our method to visual domains including digits and face images and demonstrate its ability to generate convincing novel images of previously unseen entities, while preserving their identity.

Joint work with Yaniv Taigman and Adam Polyak

ThursdayFeb 09, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Tomer MichaeliTitle:Deformation-aware image processingAbstract:opens in new windowin html    pdfopens in new window

Image processing algorithms often involve a data fidelity penalty, which encourages the solution to comply with the input data. Existing fidelity measures (including perceptual ones) are very sensitive to slight misalignments in the locations and shapes of objects. This is in sharp contrast to the human visual system, which is typically indifferent to such variations. In this work, we propose a new error measure, which is insensitive to small smooth deformations and is very simple to incorporate into existing algorithms. We demonstrate our approach in lossy image compression. As we show, optimal encoding under our criterion boils down to determining how to best deform the input image so as to make it "more compressible". Surprisingly, it turns out that very minor deformations (almost imperceptible in some cases) suffice to make a huge visual difference in methods like JPEG and JPEG2000. Thus, by slightly sacrificing geometric integrity, we gain a significant improvement in preservation of visual information.

We also show how our approach can be used to visualize image priors. This is done by determining how images should be deformed so as to best conform to any given image model. By doing so, we highlight the elementary geometric structures to which the prior resonates. Using this method, we reveal interesting behaviors of popular priors, which were not noticed in the past.

Finally, we illustrate how deforming images to possess desired properties can be used for image "idealization" and for detecting deviations from perfect regularity.


Joint work with Tamar Rott Shaham, Tali Dekel, Michal Irani, and Bill Freeman.

ThursdayJan 26, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Vardan PapyanTitle:Signal Modeling: From Convolutional Sparse Coding to Convolutional Neural NetworksAbstract:opens in new windowin html    pdfopens in new window

Within the wide field of sparse approximation, convolutional sparse coding (CSC) has gained increasing attention in recent years. This model assumes a structured-dictionary built as a union of banded Circulant matrices. Most attention has been devoted to the practical side of CSC, proposing efficient algorithms for the pursuit problem, and identifying applications that benefit from this model. Interestingly, a systematic theoretical understanding of CSC seems to have been left aside, with the assumption that the existing classical results are sufficient.
In this talk we start by presenting a novel analysis of the CSC model and its associated pursuit. Our study is based on the observation that while being global, this model can be characterized and analyzed locally. We show that uniqueness of the representation, its stability with respect to noise, and successful greedy or convex recovery are all guaranteed assuming that the underlying representation is locally sparse. These new results are much stronger and informative, compared to those obtained by deploying the classical sparse theory.
Armed with these new insights, we proceed by proposing a multi-layer extension of this model, ML-CSC, in which signals are assumed to emerge from a cascade of CSC layers. This, in turn, is shown to be tightly connected to Convolutional Neural Networks (CNN), so much so that the forward-pass of the CNN is in fact the Thresholding pursuit serving the ML-CSC model. This connection brings a fresh view to CNN, as we are able to attribute to this architecture theoretical claims such as uniqueness of the representations throughout the network, and their stable estimation, all guaranteed under simple local sparsity conditions. Lastly, identifying the weaknesses in the above scheme, we propose an alternative to the forward-pass algorithm, which is both tightly connected to deconvolutional and recurrent neural networks, and has better theoretical guarantees.

ThursdayJan 19, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:David Held Title:Robots in Clutter: Learning to Understand Environmental ChangesAbstract:opens in new windowin html    pdfopens in new window
Robots today are confined to operate in relatively simple, controlled environments. One reason for this is that current methods for processing visual data tend to break down when faced with occlusions, viewpoint changes, poor lighting, and other challenging but common situations that occur when robots are placed in the real world. I will show that we can train robots to handle these variations by modeling the causes behind visual appearance changes. If robots can learn how the world changes over time, they can be robust to the types of changes that objects often undergo. I demonstrate this idea in the context of autonomous driving, and I will show how we can use this idea to improve performance for every step of the robotic perception pipeline: object segmentation, tracking, velocity estimation, and classification. I will also present some preliminary work on learning to manipulate objects, using a similar framework of learning environmental changes. By learning how the environment can change over time, we can enable robots to operate in the complex, cluttered environments of our daily lives.
ThursdayJan 05, 201712:15
Vision and Robotics SeminarRoom 1
Speaker:Shai AvidanTitle:Taking Pictures in Scattering MediaAbstract:opens in new windowin html    pdfopens in new window
Pictures taken under bad weather conditions or underwater often suffer from low contrast and limited visibility. Restoring colors of images taken in such conditions is extremely important for consumer applications, computer vision tasks, and marine research. The common physical phenomena in these scenarios are scattering and absorption - the imaging is done either under water, or in a medium that contains suspended particles, e.g. dust (haze) and water droplets (fog). As a result, the colors of captured objects are attenuated, as well as veiled by light scattered by the suspended particles. The amount of attenuation and scattering depends on the objects' distance from the camera and therefore the color distortion cannot be globally corrected. We propose a new prior, termed Haze-Line, and use it to correct these types of images. First, we show how it can be used to clean images taken under bad weather conditions such as haze or fog. Then we show how to use it to automatically estimate the air light.Finally, we extend it to deal with underwater images as well. The proposed algorithm is completely automatic and quite efficient in practice. Joint work with Dana Berman (TAU) and Tali Treibitz (U.of Haifa)
ThursdayDec 22, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Greg Shakhnarovich Title:Image colorization and its role in visual learningAbstract:opens in new windowin html    pdfopens in new window
I will present our recent and ongoing work on fully automatic image colorization. Our approach exploits both low-level and semantic representations during colorization. As many scene elements naturally appear according to multimodal color distributions, we train our model to predict per-pixel color histograms. This intermediate output can be used to automatically generate a color image, or further manipulated prior to image formation to "push" the image in a desired direction. Our system achieves state-of-the-art results under a variety of metrics. Moreover, it provides a vehicle to explore the role the colorization task can play as a proxy for visual understanding, providing a self-supervision mechanism for learning representations. I will describe the ability of our self-supervised network in several contexts, such as classification and semantic segmentation. On VOC segmentation and classification tasks, we present results that are state-of-the-art among methods not using ImageNet labels for pretraining. Joint work with Gustav Larsson and Michael Maire.
ThursdayDec 15, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Gil Ben-Artzi Title:Calibration of Multi-Camera Systems by Global Constraints on the Motion of SilhouettesAbstract:opens in new windowin html    pdfopens in new window
Computing the epipolar geometry between cameras with very different viewpoints is often problematic as matching points are hard to find. In these cases, it has been proposed to use information from dynamic objects in the scene for suggesting point and line correspondences. We introduce an approach that improves by two orders of magnitude the performance over state-of-the-art methods, by significantly reducing the number of outliers in the putative matches. Our approach is based on (a) a new temporal signature: motion barcode, which is used to recover corresponding epipolar lines across views, and (b) formulation of the correspondences problem as constrained flow optimization, requiring small differences between the coordinates of corresponding points over consecutive frames. Our method was validated on four standard datasets providing accurate calibrations across very different viewpoints.
ThursdayDec 01, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Michael (Miki) Lustig Title:Applications of Subspace and Low-Rank Methods for Dynamic and Multi-Contrast Magnetic Resonance Imaging Abstract:opens in new windowin html    pdfopens in new window
There has been much work in recent years to develop methods for recovering signals from insufficient data. One very successful direction are subspace methods that constrain the data to live in a lower dimensional space. These approaches are motivated by theoretical results in recovering incomplete low-rank matrices as well as exploiting the natural redundancy of multidimensional signals. In this talk I will present our research group's efforts in this area. I will start with describing a new decomposition that can represent dynamic images as a sum of multi-scale low-rank matrices, which can very efficiently capture spatial and temporal correlations in multiple scales. I will then describe and show results from applications using subspace and low-rank methods for highly accelerated multi-contrast MR imaging and for the purpose of motion correction.
MondayNov 21, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Emanuele Rodola', Or LitanyTitle:Spectral Approaches to Partial Shape MatchingAbstract:opens in new windowin html    pdfopens in new window
In this talk we will present our recent line of work on (deformable) partial shape correspondence in the spectral domain. We will first introduce Partial Functional Maps (PFM), showing how to robustly formulate the shape correspondence problem under missing geometry with the language of functional maps. We use perturbation analysis to show how removal of shape parts changes the Laplace-Beltrami eigenfunctions, and exploit it as a prior on the spectral representation of the correspondence. We will show further extensions to deal with the presence of clutter (deformable object-in-clutter) and multiple pieces (non-rigid puzzles). In the second part of the talk, we will introduce a novel approach to the same problem which operates completely in the spectral domain, avoiding the cumbersome alternating optimization used in the previous approaches. This allows matching shapes with constant complexity independent of the number of shape vertices, and yields state-of-the-art results on challenging correspondence benchmarks in the presence of partiality and topological noise.
ThursdayNov 10, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Yedid HoshenTitle:End-to-End Learning: Applications in Speech, Vision and CognitionAbstract:opens in new windowin html    pdfopens in new window

One of the most exciting possibilities opened by deep neural networks is end-to-end learning: the ability to learn tasks without the need for feature engineering or breaking down into sub-tasks. This talk will present three cases illustrating how end-to-end learning can operate in machine perception across the senses (Hearing, Vision) and even for the entire perception-cognition-action cycle.

The talk begins with speech recognition, showing how acoustic models can be learned end-to-end. This approach skips the feature extraction pipeline, carefully designed for speech recognition over decades.

Proceeding to vision, a novel application is described: identification of photographers of wearable video cameras. Such video was previously considered anonymous as it does not show the photographer.

The talk concludes by presenting a new task, encompassing the full perception-cognition-action cycle: visual learning of arithmetic operations using only pictures of numbers. This is done without using or learning the notions of numbers, digits, and operators.

The talk is based on the following papers:

Speech Acoustic Modeling From Raw Multichannel Waveforms, Y. Hoshen, R.J. Weiss, and K.W. Wilson, ICASSP'15

An Egocentric Look at Video Photographer Identity, Y. Hoshen, S. Peleg, CVPR'16

Visual Learning of Arithmetic Operations, Y. Hoshen, S. Peleg, AAAI'16

MondaySep 26, 201614:00
Vision and Robotics SeminarRoom 1
Speaker:Achuta KadambiTitle:From the Optics Lab to Computer Vision Abstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL DAY AND TIME

Computer science and optics are usually studied separately -- separate people, in separate departments, meet at separate conferences. This is changing. The exciting promise of technologies like virtual reality and self-driving cars demand solutions that draw from the best aspects of computer vision, computer graphics, and optics. Previously, it has proved difficult to bridge these communities. For instance, the laboratory setups in optics are often designed to image millimeter-size scenes in a vibration-free darkroom. 

This talk is centered around time of flight imaging, a growing area of research in computational photography. A time of flight camera works by emitting amplitude modulated (AM) light and performing correlations on the reflected light. The frequency of AM is in the radio frequency range (like a Doppler radar system), but the carrier signal is optical, overcoming diffraction limited challenges of full RF systems while providing optical contrast. The obvious use of such cameras is to acquire 3D geometry. By spatially, temporally and spectrally coding light transport we show that it may be possible to go "beyond depth", demonstrating new forms of imaging like photography through scattering media, fast relighting of photographs, real-time tracking of occluded objects in the scene (like an object around a corner), and even the potential to distinguish between biological molecules using fluorescence. We discuss the broader impact of this design paradigm on the future of 3D depth sensors, interferometers, computational photography, medical imaging and many other applications. 

ThursdaySep 08, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Tali Dekel Title:Exploring and Modifying Spatial Variations in a Single ImageAbstract:opens in new windowin html    pdfopens in new window
Structures and objects, captured in image data, are often idealized by the viewer. For example, buildings may seem to be perfectly straight, or repeating structures such as corn's kernels may seem almost identical. However, in reality, such flawless behavior hardly exists. The goal in this line of work is to detect the spatial imperfection, i.e., departure of objects from their idealized models, given only a single image as input, and to render a new image in which the deviations from the model are either reduced or magnified. Reducing the imperfections allows us to idealize/beautify images, and can be used as a graphic tool for creating more visually pleasing images. Alternatively, increasing the spatial irregularities allow us to reveal useful and surprising information that is hard to visually perceive by the naked eye (such as the sagging of a house's roof). I will consider this problem under two distinct definitions of idealized model: (i) ideal parametric geometries (e.g., line segments, circles), which can be automatically detected in the input image. (ii) perfect repetitions of structures, which relies on the redundancy of patches in a single image. Each of these models has lead to a new algorithm with a wide range of applications in civil engineering, astronomy, design, and materials defects inspection.
ThursdayAug 04, 201611:30
Vision and Robotics SeminarRoom 1
Speaker:Michael RabinovichTitle:Scalable Locally Injective MappingsAbstract:opens in new windowin html    pdfopens in new window
We present a scalable approach for the optimization of flip-preventing energies in the general context of simplicial mappings, and specifically for mesh parameterization. Our iterative minimization is based on the observation that many distortion energies can be optimized indirectly by minimizing a simpler proxy energy and compensating for the difference with a reweighting scheme. Our algorithm is simple to implement and scales to datasets with millions of faces. We demonstrate our approach for the computation of maps that minimize a conformal or isometric distortion energy, both in two and three dimensions. In addition to mesh parameterization, we show that our algorithm can be applied to mesh deformation and mesh quality improvement.
ThursdayJul 21, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Ethan FetayaTitle:PhD Thesis Defense: Learning with limited supervision Abstract:opens in new windowin html    pdfopens in new window
The task of supervised learning, performing predictions based on a given labeled dataset, is well-understood theoretically and for which many practical algorithms exist. In general, the more complex the hypothesis space is, the larger the amount of samples we will need so that we do not overfit. The main issue is that obtaining a large labeled dataset is a costly and tedious process. An interesting and important question is what can be done when only a small amount of labeled data, or no data, is available. I will go over several approaches, learning with a single positive example, as well as unsupervised representation learning.
MondayJul 18, 201611:30
Vision and Robotics SeminarRoom 155
Speaker:Emanuel A. LazarTitle:Voronoi topology analysis of structure in spatial point setsAbstract:opens in new windowin html    pdfopens in new window
Atomic systems are regularly studied as large sets of point-like particles, and so understanding how particles can be arranged in such systems is a very natural problem. However, aside from perfect crystals and ideal gases, describing this kind of "structure" in an insightful yet tractable manner can be challenging. Analysis of the configuration space of local arrangements of neighbors, with some help from the Borsuk-Ulam theorem, helps explain limitations of continuous metric approaches to this problem, and motivates the use of Voronoi cell topology. Several short examples from materials research help illustrate strengths of this approach.
ThursdayJul 14, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Netalee Efrat and Meirav GalunTitle:SIGGRAPH Dry-Runs Abstract:opens in new windowin html    pdfopens in new window

This Thursday we will have two SIGGRAPH rehearsal talks in the Vision Seminar, one by Netalee Efrat  and one by  Meirav Galun. Abstracts are below. Each talk will be about 15 minutes (with NO interruptions), followed by 10 minutes feedback.

Talk1  (Netalee Efrat):   Cinema 3D: Large scale automultiscopic display  

While 3D movies are gaining popularity, viewers in a 3D cinema still need to wear cumbersome glasses in order to enjoy them. Automultiscopic displays provide a better alternative to the display of 3D content, as they present multiple angular images of the same scene without the need for special eyewear. However, automultiscopic displays cannot be directly implemented in a wide cinema setting due to variants of two main problems: (i) The range of angles at which the screen is observed in a large cinema is usually very wide, and there is an unavoidable tradeoff between the range of angular images supported by the display and its spatial or angular resolutions. (ii) Parallax is usually observed only when a viewer is positioned at a limited range of distances from the screen. This work proposes a new display concept, which supports automultiscopic content in a wide cinema setting. It builds on the typical structure of cinemas, such as the fixed seat positions and the fact that different rows are located on a slope at different heights. Rather than attempting to display many angular images spanning the full range of viewing angles in a wide cinema, our design only displays the narrow angular range observed within the limited width of a single seat. The same narrow range content is then replicated to all rows and seats in the cinema. To achieve this, it uses an optical construction based on two sets of parallax barriers, or lenslets, placed in front of a standard screen. This paper derives the geometry of such a display, analyzes its limitations, and demonstrates a proof-of-concept prototype.

*Joint work with Piotr Didyk, Mike Foshey, Wojciech Matusik, Anat Levin

Talk 2  (Meirav Galun):   Accelerated Quadratic Proxy for Geometric Optimization 

We present the Accelerated Quadratic Proxy (AQP) - a simple first order algorithm for the optimization of geometric energies defined over triangular and tetrahedral meshes. The main pitfall encountered in the optimization of geometric energies is slow convergence. We observe that this slowness is in large part due to a Laplacian-like term existing in these energies. Consequently, we suggest to exploit the underlined structure of the energy  and to locally use a quadratic polynomial proxy, whose Hessian is taken to be the Laplacian. This improves stability and convergence, but more importantly allows incorporating acceleration in an almost universal way, that is independent of mesh size and of the specific energy considered. Experiments with AQP show it is rather insensitive to mesh resolution and requires a nearly constant number of iterations to converge; this is in strong contrast to other popular optimization techniques used today such as Accelerated Gradient Descent and Quasi-Newton methods, e.g., L-BFGS.  We have tested AQP for mesh deformation in 2D and 3D as well as for surface parameterization, and found it to provide a considerable speedup over common baseline techniques.

*Joint work with Shahar Kovalsky and Yaron Lipman

ThursdayJun 16, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Yair Weiss Title:Neural Networks, Graphical Models and Image RestorationAbstract:opens in new windowin html    pdfopens in new window
This is an invited talk I gave last year at a workshop on "Deep Learning for Vision". It discusses some of the history of graphical models and neural networks and speculates on the future of both fields with examples from the particular problem of image restoration.
ThursdayJun 02, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Omri Azencot Title:Advection-based Function Matching on SurfacesAbstract:opens in new windowin html    pdfopens in new window
A tangent vector field on a surface is the generator of a smooth family of maps from the surface to itself, known as the flow. Given a scalar function on the surface, it can be transported, or advected, by composing it with a vector field's flow. Such transport is exhibited by many physical phenomena, e.g., in fluid dynamics. In this paper, we are interested in the inverse problem: given source and target functions, compute a vector field whose flow advects the source to the target. We propose a method for addressing this problem, by minimizing an energy given by the advection constraint together with a regularizing term for the vector field. Our approach is inspired by a similar method in computational anatomy, known as LDDMM, yet leverages the recent framework of functional vector fields for discretizing the advection and the flow as operators on scalar functions. The latter allows us to efficiently generalize LDDMM to curved surfaces, without explicitly computing the flow lines of the vector field we are optimizing for. We show two approaches for the solution: using linear advection with multiple vector fields, and using non-linear advection with a single vector field. We additionally derive an approximated gradient of the corresponding energy, which is based on a novel vector field transport operator. Finally, we demonstrate applications of our machinery to intrinsic symmetry analysis, function interpolation and map improvement.
WednesdayMay 25, 201611:15
Vision and Robotics SeminarRoom 1
Speaker:Bill Freeman Title:Visually Indicated SoundsAbstract:opens in new windowin html    pdfopens in new windowJOINT SEMINAR WITH MACHINE LEARNING & STATISTICS

Children may learn about the world by pushing, banging, and manipulating things, watching and listening as materials make their distinctive sounds-- dirt makes a thud; ceramic makes a clink. These sounds reveal physical properties of the objects, as well as the force and motion of the physical interaction.

We've explored a toy version of that learning-through-interaction by recording audio and video while we hit many things with a drumstick. We developed an algorithm the predict sounds from silent videos of the drumstick interactions. The algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We demonstrate that the sounds generated by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that the task of predicting sounds allows our system to learn about material properties in the scene.

Joint work with:
Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson

MondayMay 09, 201614:00
Vision and Robotics SeminarRoom 1
Speaker:Nikos ParagiosTitle:Visual Perception through Hyper GraphsAbstract:opens in new windowin html    pdfopens in new windowNote the unusual day & time
Computational vision, visual computing and biomedical image analysis have made tremendous progress in the past decade. This is mostly due the development of efficient learning and inference algorithms which allow better and richer modeling of visual perception tasks. Hyper-Graph representations are among the most prominent tools to address such perception through the casting of perception as a graph optimization problem. In this talk, we briefly introduce the interest of such representations, discuss their strength and limitations, provide appropriate strategies for their inference learning and present their application to address a variety of problems of visual computing.
ThursdayApr 14, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Barak Zackay Title:Proper astronomical image processing - Solving the problems of image co-addition and image subtractionAbstract:opens in new windowin html    pdfopens in new window

While co-addition and subtraction of astronomical images stand at the heart of observational astronomy, the existing solutions for them lack rigorous argumentation, are not achieving maximal sensitivity and are often slow. Moreover, there is no widespread agreement on how they should be done, and often different methods are used for different scientific applications. I am going to present rigorous solutions to these problems, deriving them from the most basic statistical principles. These solutions are proved optimal, under well defined and practically acceptable assumptions, and in many cases improve substantially the performance of the most basic operations in astronomy.

For coaddition, we present a coadd image that is:
a) sufficient for any further statistical decision or measurement on the underlying constant sky, making the entire data set redundant.
b) improves both survey speed (by 5-20%) and effective spatial resolution of past and future astronomical surveys.
c) improves substantially imaging through turbulence applications.
d) much faster than many of the currently used coaddition solutions.

For subtraction,  we present a subtraction image that is:
a) optimal for transient detection under the assumption of spatially uniform noise.
b) sufficient for any further statistical decision on the differences between the images, including the identification of cosmic rays and other image artifacts.
c) Free of subtraction artifacts, allowing (for the first time) robust transient identification in real time, opening new avenues for scientific exploration.
d) orders of magnitude faster than past subtraction methods.

ThursdayApr 07, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Yoni WexlerTitle:Fast Face Recognition with Multi-BatchAbstract:opens in new windowin html    pdfopens in new window

A common approach to face recognition relies on using deep learning  for extracting a signature.  All leading work on the subject use  stupendous amounts of processing power and data. In this work we present a method for efficient and compact learning  of metric embedding.  The core idea allows a more accurate  estimation of the global gradient and hence fast and robust  convergence. In order to avoid the need for huge amounts of data we include an explicit alignment phase into the network, hence greatly reducing  the number of parameters. These insights allow us to efficiently train a compact deep learning model for face recognition in only 12 hours on a single GPU, which can  then fit a mobile device.

Joint work with: Oren Tadmor, Tal Rosenwein, Shai Shalev-Schwartz, Amnon Shashua

ThursdayMar 31, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Yael Moses Title:Dynamic Scene Analysis Using CrowdCam DataAbstract:opens in new windowin html    pdfopens in new window

Dynamic events such as family gatherings, concerts or sports events are often photographed by a group of people. The set of still images obtained this way is rich in dynamic content. We consider the question of whether such a set of still images, rather than traditional video sequences, can be used for analyzing the dynamic content of the scene. This talk will describe several instances of this problem, their solutions and directions for future studies.

In particular, we will present a method to extend epipolar geometry to predict location of a moving feature in CrowdCam images. The method assumes that the temporal order of the set of images, namely photo-sequencing, is given. We will briefly describe our method to compute photo-sequencing using geometric considerations and rank aggregation.  We will also present a method for identifying the moving regions in a scene, which is a basic component in dynamic scene analysis. Finally, we will consider a new vision of developing collaborative CrowdCam, and a first step toward this goal.

This talk will be based on joint works with Tali Dekel, Adi Dafni, Mor Dar, Lior Talked, Ilan Shimshoni,  and Shai Avidan.

MondayMar 28, 201611:00
Vision and Robotics SeminarRoom 141
Speaker:Dan RavivTitle:Stretchable non-rigid structuresAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE UNUSUAL ROOM, DAY and TIME.
Geometrical understanding of bendable and stretchable structures is crucial for many applications where comparison, inference and reconstruction play an important role. Moreover, it is the first step in quantifying normal and abnormal phenomena in non-rigid domains. Moving from Euclidean (straight) distances towards intrinsic (geodesic) measures, revolutionized the way we handle bendable structures, but did not take stretching into account. Human organs, such as the heart, lungs and kidneys, are great examples for such models. In this lecture I will show that stretching can be accounted for in the atom (local) level, in a closed form using higher derivatives of the data. I further show that invariants can play a critical part in modern learning systems, used for statistical analysis of non-rigid structures, and assist in fabricating soft-models. The lecture will be self-contained and no prior knowledge is needed.
ThursdayJan 21, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Yoav Schechner Title:Clouds in 4DAbstract:opens in new windowin html    pdfopens in new window
The spatially varying and temporally dynamic atmosphere presents significant, exciting and fundamentally new problems for imaging and computer vision. Some problems must tackle the complexity of radiative transfer models in 3D multiply-scattering media, to achieve reconstruction based on the models. This aspect can also be used in other scattering media. Nevertheless, the huge scale of the atmosphere and its dynamics call for multiview imaging using unprecedented distributed camera systems, on the ground or in orbit. These new configurations require generalizations of traditional triangulation, radiometric calibration, background estimation, lens-flare and compression questions. This focus can narrow uncertainties in climate-change forecasts, as we explain.
ThursdayJan 14, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Oren Friefeld Title:From representation to inference: respecting and exploiting mathematical structures in computer vision and machine learningAbstract:opens in new windowin html    pdfopens in new window

Stochastic analysis of real-world signals consists of 3 main parts: mathematical representation; probabilistic modeling; statistical inference. For it to be effective, we need mathematically-principled and practical computational tools that take into consideration not only each of these components by itself but also their interplay. This is especially true for a large class of computer-vision and machine-learning problems that involve certain mathematical structures; the latter may be a property of the data or encoded in the representation/model to ensure mathematically-desired properties and computational tractability. For concreteness, this talk will center on structures that are geometric, hierarchical, or topological.

Structures present challenges. For example, on nonlinear spaces, most statistical tools are not directly applicable, and, moreover, computations can be expensive. As another example, in mixture models, topological constraints break statistical independence. Once we overcome the difficulties, however, structures offer many benefits. For example, respecting and exploiting the structure of Riemannian manifolds and/or Lie groups yield better probabilistic models that also support consistent synthesis. The latter is crucial for the employment of analysis-by-synthesis inference methods used within, e.g., a generative Bayesian framework. Likewise, imposing a certain structure on velocity fields yields highly-expressive diffeomorphisms that are also simple and computationally tractable; particularly, this facilitates MCMC inference, traditionally viewed as too expensive in this context.

Time permitting, throughout the talk I will also briefly touch upon related applications such as statistical shape models, transfer learning on manifolds, image warping/registration, time warping, superpixels, 3D-scene analysis, nonparametric Bayesian clustering of spherical data, multi-metric learning, and new machine-learning applications of diffeomorphisms. Lastly, we also applied the (largely model-based) ideas above to propose the first learned data augmentation scheme; as it turns out, when compared with the state-of-the-art schemes, this improves the performance of classifiers of the deep-net variety.

ThursdayJan 07, 201612:15
Vision and Robotics SeminarRoom 1
Speaker:Greg ShakhnarovichTitle:Rich Representations for Parsing Visual ScenesAbstract:opens in new windowin html    pdfopens in new window

I will describe recent work on building and using rich representations aimed at automatic analysis of visual scenes. In particular, I will describe methods for semantic segmentation (labeling regions of an image according to the category it belongs to), and on semantic boundary detection (recovering accurate boundaries of semantically meaningful regions, such as those corresponding to objects). We focus on feed-forward architectures for these tasks, leveraging recent advances in the art of training deep neural networks. Our approach aims to shift the burden of inducing desirable constraints from explicit structure in the model to implicit structure inherent in computing richer, context-aware representations. I will describe experiments on standard benchmark data sets that demonstrate the success of this approach.

Joint work with Mohammadreza Mostajabi, Payman Yadollahpour, and Harry Yang.

WednesdayJan 06, 201611:15
Vision and Robotics SeminarRoom 1
Speaker:Karen LivescuTitle:Segmental Sequence Models in the Neural AgeAbstract:opens in new windowin html    pdfopens in new windowJoint Vision and Machine Learning seminar note unusual day/time

Many sequence prediction tasks---such as automatic speech recognition and video analysis---benefit from long-range temporal features.  One way of utilizing long-range information is through segmental (semi-Markov) models such as segmental conditional random fields.  Such models have had some success, but have been constrained by the computational needs of considering all possible segmentations.  We have developed new segmental models with rich features based on neural segment embeddings, trained with discriminative large-margin criteria, that are efficient enough for first-pass decoding.  In our initial work with these models, we have found that they can outperform frame-based HMM/deep network baselines on two disparate tasks, phonetic recognition and sign language recognition from video.  I will present the models and their results on these tasks, as well as (time permitting) related recent work on neural segmental acoustic word embeddings.

This is joint work with Hao Tang, Weiran Wang, Herman Kamper, Taehwan Kim, and Kevin Gimpel

ThursdayDec 31, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Shai Shalev-Shwartz Title:Deep Learning: The theoretical-practical gapAbstract:opens in new windowin html    pdfopens in new window
I will describe two contradicting lines of work. On one hand, a practical work on autonomous driving I was doing at Mobileye, in which deep learning is one of the key ingredients. On the other hand, recent theoretical works showing very strong hardness of learning results. Bridging this gap is a great challenge. I will describe some approaches toward a solution.
ThursdayDec 24, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Shai Avidan Title:Best-Buddies Similarity for Robust Template MatchingAbstract:opens in new windowin html    pdfopens in new window
We propose a novel method for template matching in unconstrained environments. Its essence is the Best-Buddies Similarity (BBS), a useful, robust, and parameter-free similarity measure between two sets of points. BBS is based on counting the number of Best-Buddies Pairs (BBPs)- pairs of points in source and target sets, where each point is the nearest neighbor of the other. BBS has several key features that make it robust against complex geometric deformations and high levels of outliers, such as those arising from background clutter and occlusions. We study these properties, provide a statistical analysis that justifies them, and demonstrate the consistent success of BBS on a challenging real world dataset. Joint work with Tali Dekel, Shaul Oron, Miki Rubinstein and Bill Freeman
ThursdayDec 03, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Ariel ShamirTitle:Creating Visual StoriesAbstract:opens in new windowin html    pdfopens in new window
Similar to text, the amount of visual data in the form of videos and images is growing enormously. One of the key challenges is to understand this data, arrange it, and create content which is semantically meaningful. In this talk I will present several such efforts to "bridge the semantic gap" using humans as "agents": capturing and utilizing eye movements, body movement or gaze direction. This enables re-editing of existing videos, tracking of sports highlights, creating one coherent video from multiple sources, and more.
ThursdayNov 26, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Nadav Cohen Title:On the Expressive Power of Deep Learning: A Tensor AnalysisAbstract:opens in new windowin html    pdfopens in new window

It has long been conjectured that hypothesis spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical architectures than with shallow ones.  Despite the vast empirical evidence, formal arguments to date are limited and do not capture the kind of networks used in practice. Using tensor factorization, we derive a universal hypothesis space implemented by an arithmetic circuit over functions applied to local data structures (e.g. image patches). The resulting networks first pass the input through a representation layer, and then proceed with a sequence of layers comprising sum followed by product-pooling, where sum corresponds to the widely used convolution operator. The hierarchical structure of networks is born from factorizations of tensors based on the linear weights of the arithmetic circuits. We show that a shallow network corresponds to a rank-1 decomposition, whereas a deep network corresponds to a Hierarchical Tucker (HT) decomposition. Log-space computation for numerical stability transforms the networks into SimNets.

In its basic form, our main theoretical result shows that the set of polynomially sized rank-1 decomposable tensors has measure zero in the parameter space of polynomially sized HT decomposable tensors. In deep learning terminology, this amounts to saying that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require an exponential size if one wishes to implement (or approximate) them with a shallow network. Our construction and theory shed new light on various practices and ideas employed by the deep learning community, and in that sense bear a paradigmatic contribution as well.

Joint work with Or Sharir and Amnon Shashua.

ThursdayNov 19, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Alex Bronstein Title:Learning to hashAbstract:opens in new windowin html    pdfopens in new window

In view of the recent huge interest in image classification and object recognition problems and the spectacular success of deep learning and random forests in solving these tasks, it seems astonishing that much more modest efforts are being invested into related, and often more difficult, problems of image and multimodal content-based retrieval, and, more generally, similarity assessment in large-scale databases. These problems, arising as primitives in many computer vision tasks, are becoming increasingly important in the era of exponentially increasing information. Semantic and similarity-preserving hashing methods have recently received considerable attention to address such a need, in part due to their significant memory and computational advantage over other representations.

In this talk, I will overview some of my recent attempts to construct efficient semantic hashing schemes based on deep neural networks and random forests.

Based on joint works with Qiang Qiu, Guillermo Sapiro, Michael Bronstein, and Jonathan Masci.

ThursdayNov 12, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Nathan Srebro Title:Optimization, Regularization and Generalization in Multilayer NetworksAbstract:opens in new windowin html    pdfopens in new windowJoint Machine Learning & Vision Seminar

What is it that enables learning with multi-layer networks?  What causes the network to generalize well?  What makes it possible to optimize the error, despite the problem being hard in the worst case?  In this talk I will attempt to address these questions and relate between them, highlighting the important role of optimization in deep learning.  I will then use the insight to suggest studying novel optimization methods, and will present Path-SGD, a novel optimization approach for multi-layer RELU networks that yields better optimization and better generalization.

Joint work with Behnam Neyshabur, Ryota Tomioka and Russ Salakhutdinov.

ThursdayOct 22, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Michael Bronstein Title:Deep learning on geometric dataAbstract:opens in new windowin html    pdfopens in new window
The past decade in computer vision research has witnessed the re-emergence of "deep learning" and in particular, convolutional neural network techniques, allowing to learn task-specific features from examples and achieving a breakthrough in performance in a wide range of applications. However, in the geometry processing and computer graphics communities, these methods are practically unknown. One of the reasons stems from the facts that 3D shapes (typically modeled as Riemannian manifolds) are not shift-invariant spaces, hence the very notion of convolution is rather elusive. In this talk, I will show some recent works from our group trying to bridge this gap. Specifically, I will show the construction of intrinsic convolutional neural networks on meshes and point clouds, with applications such as finding dense correspondence between deformable shapes and shape retrieval.
ThursdayJul 02, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Kyros KutulakosTitle:Transport-Aware CamerasAbstract:opens in new windowin html    pdfopens in new window

Conventional cameras record all light falling onto their sensor regardless of the path that light followed to get there. In this talk I will present an emerging family of video cameras that can be programmed to record just a fraction of the light coming from a controllable source, based on the actual 3D path followed. Live video from these cameras offers a very unconventional view of our everyday world in which refraction and scattering can be selectively blocked or enhanced, visual structures too subtle to notice with the naked eye can become apparent, and object appearance can depend on depth.
I will discuss the unique optical properties and power efficiency of  these "transport-aware" cameras, as well as their use for 3D shape acquisition, robust time-of-flight imaging, material analysis, and scene understanding. Last but not least, I will discuss their potential to become our field's "outdoor Kinect" sensor---able to operate robustly even in direct sunlight with very low power.

Kyros Kutulakos is a Professor of Computer Science at the University of Toronto. He received his PhD degree from the University of Wisconsin-Madison in 1994 and his BS degree from the University of Crete in 1988, both in Computer Science. In addition to the University of Toronto, he has held appointments at the University of Rochester (1995-2001) and Microsoft Research Asia (2004-05 and 2011-12). He is the recipient of an Alfred P. Sloan Fellowship, an Ontario Premier's Research Excellence Award, a Marr Prize in 1999, a Marr Prize Honorable Mention in 2005, and three other paper awards (CVPR 1994, ECCV 2006, CVPR 2014). He also served as Program Co-Chair of CVPR 2003, ICCP 2010 and ICCV 2013.

ThursdayJun 18, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Marc TeboulleTitle:Elementary Algorithms for High Dimensional Structured OptimizationAbstract:opens in new windowin html    pdfopens in new window

Many scientific and engineering problems are challenged by the fact they involve functions of a very large number of variables. Such problems arise naturally in signal recovery, image processing, learning theory, etc. In addition to the numerical difficulties due to the so-called curse of dimensionality, the resulting optimization problems are often nonsmooth and nonconvex.

We shall survey some of our recent results, illustrating how these difficulties may be handled in the context of well-structured optimization models, highlighting the ways in which problem structures and data information can be beneficially exploited to devise and analyze simple and efficient algorithms.

ThursdayJun 04, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Rene VidalTitle:Algebraic, Sparse and Low Rank Subspace ClusteringAbstract:opens in new windowin html    pdfopens in new window
In the era of data deluge, the development of methods for discovering structure in high-dimensional data is becoming increasingly important. Traditional approaches such as PCA often assume that the data is sampled from a single low-dimensional manifold. However, in many applications in signal/image processing, machine learning and computer vision, data in multiple classes lie in multiple low-dimensional subspaces of a high-dimensional ambient space. In this talk, I will present methods from algebraic geometry, sparse representation theory and rank minimization for clustering and classification of data in multiple low-dimensional subspaces. I will show how these methods can be extended to handle noise, outliers as well as missing data. I will also present applications of these methods to video segmentation and face clustering.
WednesdayMay 20, 201513:00
Vision and Robotics SeminarRoom 1
Speaker:Thomas BroxTitle:Will ConvNets render computer vision research obsolete?Abstract:opens in new windowin html    pdfopens in new window
Deep learning based on convolutional network architectures has revolutionized the field of visual recognition in the last two years. There is hardly a classification task left, where ConvNets do not define the state-of-the-art. Outside recognition, deep learning seems to be of lesser importance, yet this could be a fallacy. In this talk I will present our recent work on convolutional networks and show that they can learn to solve computer vision problems that are not typically assigned to the field of recognition. I will present a network that has learned to be good on descriptor matching, another one can create new images of chairs, and I show two networks that have learned to estimate optical flow. I will conclude with some arguments why, despite all this, computer vision will stay a serious research field.
ThursdayMay 14, 201512:15
Vision and Robotics SeminarRoom 141
Speaker:Guy Ben-YosefTitle:Full interpretation of minimal imagesAbstract:opens in new windowin html    pdfopens in new windowPlease note unusual location.

The goal in this work is to produce ‘full interpretation’ for object images, namely to identify and localize all semantic features and parts that are recognized by human observers. We develop a novel approach and tools to study this challenging task, by dividing the interpretation of the complete object to interpretation of so-called 'minimal recognizable configurations’, namely severely reduced but recognizable local regions, that are minimal in the sense that any further reduction would turn them unrecognizable. We show that for the task of full interpretation, such minimal images have unique properties, which make them particularly useful.

For modeling interpretation, we identify primitive components and relations that play a useful role in the interpretation of minimal images by humans, and incorporate them in a structured prediction algorithm. The structure elements can be point, contour, or region primitives, while relations between them range from standard unary and binary potentials based on relative location, to more complex and high dimensional relations. We show experimental results and match them to human performance. We discuss implications of ‘full’ interpretation for difficult visual tasks, such as recognizing human activities or interactions.

ThursdayApr 02, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Yonatan WexlerTitle:Machine Learning In Your PocketAbstract:opens in new windowin html    pdfopens in new window

The field of Machine Learning has been making huge strides recently. Problems such as visual recognition and classification, that were believed to be open only a few years ago, now seem solvable. The best performers use Artificial Neural Networks, in their reincarnation as "Deep Learning", where huge networks are trained over lots of data. One bottleneck in current schemes is the huge amount of required computation during both training and testing. This limits the usability of these methods when power is an issue, such as with wearable devices.

As a step towards deeper understanding of deep learning mechanisms, I will show how correct conditioning of the back-propagation training iterations results in a much improved convergence. This reduces training time, providing better results. It also allows us to train smaller models, that are harder to optimize.

In this talk I will also discuss the challenges - and describe some of the solutions - in applying Machine Learning on a mobile device that can fit your pocket. The OrCam is a wearable camera that speaks to you. It reads anything, learns and recognizes faces, and much more. It is ready to help through the day, all with a simple pointing gesture. It is already improving the lives of many blind and visually impaired people.

ThursdayMar 26, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Lior WolfTitle:Image Annotation using Deep Learning and Fisher VectorsAbstract:opens in new windowin html    pdfopens in new window
We present a system for solving the holy grail of computer vision -- matching images and text and describing an image by an automatically generated text. Our system is based on combining deep learning tools for images and text, namely Convolutional Neural Networks, word2vec, and Recurrent Neural Networks, with a classical computer vision tool, the Fisher Vector. The Fisher Vector is modified to support hybrid distributions that are a much better fit for the text data. Our method proves to be extremely potent and we outperform by a significant margin all concurrent methods.
ThursdayMar 19, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Simon KormanTitle:Inverting RANSAC: Global Model Detection via Inlier Rate EstimationAbstract:opens in new windowin html    pdfopens in new window
This work presents a novel approach for detecting inliers in a given set of correspondences (matches). It does so without explicitly identifying any consensus set, based on a method for inlier rate estimation (IRE). Given such an estimator for the inlier rate, we also present an algorithm that detects a globally optimal transformation. We provide a theoretical analysis of the IRE method using a stochastic generative model on the continuous spaces of matches and transformations. This model allows rigorous investigation of the limits of our IRE method for the case of 2D translation, further giving bounds and insights for the more general case. Our theoretical analysis is validated empirically and is shown to hold in practice for the more general case of 2D affinities. In addition, we show that the combined framework works on challenging cases of 2D homography estimation, with very few and possibly noisy inliers, where RANSAC generally fails. Joint work with Roee Litman, Alex Bronstein and Shai Avidan.
ThursdayJan 29, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Avishay Gal-Yam and Barak ZackayTitle:New ways to look at the skyAbstract:opens in new windowin html    pdfopens in new window
We present a general review of astronomical observation, with emphasis on the ways it differs from conventional imaging or photography. We then describe emerging trends in this area driven mainly by advances in detector technology and computing power. Having set a broad context, we then describe the new multiplexed imaging technique we have developed. This method uses the sparseness of typical astronomical data in order to image large areas of target sky using a physically small detector.
MondayJan 26, 201514:00
Vision and Robotics SeminarRoom 141
Speaker:Greg ShakhnarovichTitle:Feedforward semantic segmentation with zoom-out featuresAbstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL DAY/TIME/ROOM
We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (superpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These regions are obtained by "zooming out" from the superpixel all the way to scene-level resolution. This approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead superpixels are classified by a feedforward multilayer network. Our architecture achieves new state of the art performance in semantic segmentation, obtaining 64.4% average accuracy on the PASCAL VOC 2012 test set. Joint work with Mohammadreza Mostajabi and Payman Yadollahpour.
MondayJan 12, 201514:00
Vision and Robotics SeminarRoom 141
Speaker:Karen LivescuTitle:Multi-view representation learning: A tutorial introduction and applications to speech and languageAbstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL ROOM, DAY, TIME

Many types of multi-dimensional data have a natural division into two "views", such as audio and video or images and text.
  Multi-view learning includes a variety of techniques that use multiple views
  of data to learn improved models for each of the views. The views can be multiple measurement modalities (like the examples above) but also can be different types of information extracted from the same source (words + context, document text + links) or any division of the data dimensions into subsets satisfying certain learning assumptions. Theoretical and empirical results show that multi-view  techniques can improve over single-view ones in certain settings. In many  cases multiple views help by reducing noise in some sense (what is noise in one view is not in the other). In this talk, I will focus on multi-view learning of representations (features), especially using canonical correlation analysis (CCA) and related techniques.  I will give a tutorial overview of CCA and its relationship with other techniques such as partial least squares (PLS) and linear discriminant analysis (LDA).  I will also present extensions developed by ourselves and others, such as kernel, deep, and generalized
("many-view") CCA.  Finally, I will give recent results on speech and language tasks, and demonstrate our publicly available code.

Based on joint work with Raman Arora, Weiran Wang, Jeff Bilmes, Galen Andrew, and others.

ThursdayJan 08, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Tomer MichaeliTitle:Blind deblurring and blind super-resolution using internal patch recurrenceAbstract:opens in new windowin html    pdfopens in new window

Small image patches tend to recur at multiple scales within high-quality natural images.
This fractal-like behavior has been used in the past for various tasks  including image compression, super-resolution and denoising. In this talk, I will show that this phenomenon  can also be harnessed for "blind deblurring"  and for "blind  super-resolution", that is, for removing blur or increasing resolution without a-priori knowledge of the associated blur kernel. It turns out that the cross-scale patch recurrence property is strong only in images taken under ideal imaging conditions, but significantly diminishes when the imaging conditions deviate from ideal ones. Therefore, the deviations from ideal patch recurrence actually provide information on the unknown camera blur kernel.
More specifically, we show that the correct blur kernel is the one which maximizes the  similarity between patches across scales of the image. Extensive experiments  indicate that our approach leads to state of the art results, both in deblurring and in super-resolution.

Joint work with Michal Irani.

ThursdayJan 01, 201512:15
Vision and Robotics SeminarRoom 1
Speaker:Tal HassnerTitle:Towards Dense Correspondences Between Any Two ImagesAbstract:opens in new windowin html    pdfopens in new window

We present a practical method for establishing dense correspondences between two images with similar  content, but possibly different 3D scenes. One of the challenges in designing  such a system is the local scale differences of objects appearing in the two  images. Previous methods often considered only small subsets of image pixels; matching only pixels for which stable scales may be reliably estimated. More recently, others have considered dense correspondences, but with substantial costs  associated with generating, storing and matching scale invariant descriptors.
Our work here is motivated by the observation that pixels in the image have contexts -- the pixels around them -- which may be exploited in order to estimate local scales reliably and repeatably. In practice, we demonstrate that scales estimated in sparse interest points may be propagated to neighboring pixels where this information cannot be reliably determined. Doing so allows scale invariant descriptors to be extracted anywhere in the image, not just in detected interest points. As a consequence, accurate dense correspondences are obtained even between very different images, with little computational costs beyond those required by existing methods.

This is joint work with Moria Tau from the Open University of Israel

ThursdayDec 25, 201413:00
Vision and Robotics SeminarRoom 1
Speaker:Hadar ElorTitle:RingIt: Ring-ordering Casual Photos of a Dynamic EventAbstract:opens in new windowin html    pdfopens in new window
The multitude of cameras constantly present nowadays redefined the meaning of capturing an event and the meaning of sharing this event with others. The images are frequently uploaded to a common platform, and the image-navigation challenge naturally arises. In this talk I will present RingIt a novel technique to sort an unorganized set of casual photographs taken along a general ring, where the cameras capture a dynamic event in the center of the ring. We assume a nearly instantaneous event, e.g., an interesting moment in a performance captured by the digital cameras and smartphones of the surrounding crowd. The ordering method extracts the K-nearest neighbors (KNN) of each image from a rough all-pairs dissimilarity estimate. The KNN dissimilarities are refined to form a sparse Weighted Laplacian, and a spectral analysis reveals the spatial ordering of the images, allowing for a sequential display of the captured object.
ThursdayDec 11, 201412:15
Vision and Robotics SeminarRoom 1
Speaker:Boaz NadlerTitle: Edge Detection under computational constraints: a sublinear approachAbstract:opens in new windowin html    pdfopens in new window
Edge Detection is an important task in image analysis. Various applications require real-time detection of long edges in large noisy images. Motivated by such settings, in this talk we'll address the following question: How well can one detect long edges under severe computational constraints, that allow only a fraction of all image pixels to be processed ? We present fundamental lower bounds on edge detection in this setup, a sublinear algorithm for long edge detection and a theoretical analysis of the inevitable tradeoff between its detection performance and the allowed computational budget. The competitive performance of our algorithm will be illustrated on both simulated and real images. Joint work with Inbal Horev, Meirav Galun, Ronen Basri (Weizmann) and Ery Arias-Castro (UCSD).
ThursdayDec 04, 201412:00
Vision and Robotics SeminarRoom 1
Speaker:Shai AvidanTitle:Extended Lucas-Kanade TrackingAbstract:opens in new windowin html    pdfopens in new window
Lucas-Kanade (LK) is a classic tracking algorithm exploiting target structural constraints thorough template matching. Extended Lucas Kanade or ELK casts the original LK algorithm as a maximum likelihood optimization and then extends it by considering pixel object / background likelihoods in the optimization. Template matching and pixel-based object / background segregation are tied together by a unified Bayesian framework. In this framework two log-likelihood terms related to pixel object / background affiliation are introduced in addition to the standard LK template matching term. Tracking is performed using an EM algorithm, in which the E-step corresponds to pixel object/background inference, and the M-step to parameter optimization. The final algorithm, implemented using a classifier for object / background modeling and equipped with simple template update and occlusion handling logic, is evaluated on two challenging data-sets containing 50 sequences each. The first is a recently published benchmark where ELK ranks 3rd among 30 tracking methods evaluated. On the second data-set of vehicles undergoing severe view point changes ELK ranks in 1st place outperforming state-of-the-art methods. Joint work with Shaul Oron (Tel-Aviv University) and Aharon Bar-Hillel (Microsoft).
ThursdayNov 27, 201412:00
Vision and Robotics SeminarRoom 1
Speaker:Fred HamprechtTitle:Joint segmentation and tracking, and new unsolved problemsAbstract:opens in new windowin html    pdfopens in new window

On my last visit in 2012, I posed a number of open questions, including how to achieve joint segmentation and tracking, and how to obtain uncertainty estimates for a segmentation.

Some of these questions we have been able to solve [Schiegg ICCV 2013, Schiegg Bioinformatics  2014, Fiaschi CVPR 2014] and I would like to report on this progress.

Given that I will be at Weizmann for another four months, I will also pose new open questions  on image processing problems that require a combination of combinatorial  optimization and (structured) learning, as an invitation to work together.

ThursdayNov 20, 201412:00
Vision and Robotics SeminarRoom 1
Speaker:Marina AltermanTitle:Vision Through Random Refractive DistortionsAbstract:opens in new windowin html    pdfopens in new window
Random dynamic distortions naturally affect images taken through atmospheric turbulence or wavy water. We show how computer vision can function under such effects, and even exploit them, relying on physical, geometric and statistical models of refractive disturbances. We make good use of distortions created by atmospheric turbulence: distorted multi-view videos lead to tomographic reconstruction of large-scale turbulence fields, outdoors. We also demonstrate several approaches to a 'virtual periscope', to view airborne scenes from submerged cameras: (a) multiple submerged views enable stochastic localization of airborne objects in 3D; (b) the wavy water surface (and hence distortion) can be passively estimated instantly, using a special sensor, analogous to modern astronomic telescopes and (c) we show how airborne moving objects can be automatically detected, despite dynamic distortions affecting the entire scene. In all these works, exploiting physical models in new ways leads to novel imaging tasks, while the approaches we take are demonstrated in field experiments.
ThursdayNov 13, 201412:00
Vision and Robotics SeminarRoom 1
Speaker:Barak Zackay Title: Imaging through turbulence a long quest of innovative computational photography in astronomy Abstract:opens in new windowin html    pdfopens in new window

The astronomical community's largest technical challenge is coping with the earths atmosphere.  In this talk, I will present the popular methods for performing scientific measurement from the ground, coping with the time dependant distortions generated by the earths atmosphere.  We will talk about the following topics:

1) Scientific motivation for eliminating the effect of the atmosphere.

2) The statistics of turbulence - the basis for all methods is in deep understanding of the atmospheric turbulence

3) wave-front sensing + adaptive optics -  A way to correct it in hardware.

4) lucky imaging + speckle Interferometry - Ways to computationally extract scientifically valuable data despite the turbulent atmosphere.

TuesdayNov 04, 201410:00
Vision and Robotics SeminarRoom 141
Speaker:Rob FergusTitle:Learning to Discover Efficient Mathematical IdentitiesAbstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL TIME AND PLACE
In this talk, I will describe how machine learning techniques can be applied to the discovery of efficient mathematical identities. We introduce an attribute grammar framework for representing symbolic expressions. Given a set of grammar rules we build trees that combine different rules, looking for branches which yield compositions that are analytically equivalent to a target expression, but of lower computational complexity. However, as the size of the trees grows exponentially with the complexity of the target expression, brute force search is impractical for all but the simplest of expressions. Consequently, we explore two learning approaches that are able to learn from simpler expressions to guide the tree search. The first of these is a simple n-gram model, the other being a recursive neural-network. We show how these approaches enable us to derive complex identities, beyond reach of brute-force search, or human derivation. Joint work with Wojciech Zaremba and Karol Kurach.