You are here

Vision and AI

ThursdayAug 03, 202312:15
Vision and AIRoom 1
Speaker:Eyal OfekTitle:Work-Verse: Using augmentation of user’s senses, and scene understanding to enable a more inclusive workspaceAbstract:opens in new windowin html    pdfopens in new window

As more people work from home or during travel, new opportunities and challenges arise around mobile office work. On the one hand, people may work at flexible hours, independent of traffic limitations, but on the other hand, they may need to work in makeshift spaces with less-than-optimal working conditions; applications are not flexible to their physical and social context and remote collaboration do not account for the difference between the user's conditions and capabilities.

Using better understanding of the physical and social constraints of the user's environments, and the ability to augment users' senses to disconnect them from such constraints. My research looked at designing applications that can be flexible to fit a changing environment, and personalize to each user, while maintaining usability and familiarity.


Eyal Ofek received his Ph.D. in computer vision from the Hebrew University of Jerusalem in 2000. He founded two startup companies, in areas of computer graphics and computer vision, and in 1996 he joined the founding group of 3DV Systems, developing the world's first time-of-flight (TOF) depth camera. The technology that he worked on became the basis for the depth sensors, later included in Augmented Reality headsets such as Microsoft HoloLens and Magic Leap HMDs.

In 2004, Eyal joined Microsoft Research and for 6 years was the head of Bing Maps & Mobile research. His group developed technologies and services such as the world's first street-side service, and the popular Stroke Width Transform text detection, later included in OpenCV. In 2011 Eyal formed a new research group at Microsoft Research, centered on augmented reality. The group has developed experimental systems for an environment-aware layout of augmented reality experiences, used by several HoloLens teams and was a base for the Unity Mars product.

Since 2014, Eyal has focused on Human-Computer Interaction (HCI) and Mixed Reality (MR) and Haptics.

SundayJul 16, 202312:15
Vision and AIRoom 155
Speaker:Tomer WeissTitle:Deep Learning Approaches for Inverse Problems in Computational Imaging and ChemistryAbstract:opens in new windowin html    pdfopens in new window*** Please Note the Unusual Day & Place ***
In this talk, I will present two chapters from my Ph.D. thesis. The core of my research focuses on methods that utilize the power of modern neural networks not only for their conventional tasks such as prediction or reconstruction, but rather use the information they “learned” (usually in the forms of their gradients) in order to optimize some end-task, draw insight from the data, or even guide a generative model. The first part of the talk is dedicated to computational imaging and shows how to apply joint optimization of the forward and inverse models to improve the end performance. We demonstrate these methods on three different tasks in the fields of Magnetic Resonance Imaging (MRI) and Multiple Input Multiple Output (MIMO) radar imaging. In the second part, we show a novel method for molecular inverse design that utilizes the power of neural networks in order to propose molecules with desired properties. We developed a guided diffusion model that uses the gradients of a pre-trained prediction model to guide a pre-trained unconditional diffusion model toward the desired properties. This method allows, in general, to transform any unconditional diffusion model into a conditional generative model.
ThursdayJul 13, 202312:15
Vision and AIRoom 1
Speaker:Deborah LevyTitle:SeaThru-NeRF- neural radiance fields in scattering mediaAbstract:opens in new windowin html    pdfopens in new window
Research on neural radiance fields (NeRFs) for novel view generation is exploding with new models and extensions. However, a question that remains unanswered is what happens in underwater or foggy scenes where the medium strongly influences the appearance of objects. Thus far, NeRF and its variants have ignored these cases. However, since the NeRF framework is based on volumetric rendering, it has inherent capability to account for the medium’s effects, once modeled appropriately. We develop a new rendering model for NeRFs in scattering media, which is based on the SeaThru image formation model, and suggest a suitable architecture for learning both scene information and medium parameters. We demonstrate the strength of our method using simulated and real-world scenes, correctly rendering novel photorealistic views underwater. Even more excitingly, we can render clear views of these scenes, removing the medium between the camera and the scene and reconstructing the appearance and depth of far objects, which are severely occluded by the medium. I will also briefly show several other projects from our lab.
ThursdayJul 06, 202312:15
Vision and AIRoom 1
Speaker:Daniella HoranTitle:When is Unsupervised Disentanglement Possible?Abstract:opens in new windowin html    pdfopens in new window
A common assumption in many domains is that high dimensional data are a smooth nonlinear function of a small number of independent factors. When is it possible to recover the factors from unlabeled data? In the context of deep models this problem is called "disentanglement" and was recently shown to be impossible without additional strong assumptions. In this work, we show that the assumption of local isometry together with non-Gaussianity of the factors, is sufficient to provably recover disentangled representations from data. We leverage recent advances in deep generative models to construct manifolds of highly realistic images for which the ground truth latent representation is known, and test whether modern and classical methods succeed in recovering the latent factors. For many different manifolds, we find that a spectral method that explicitly optimizes local isometry and non-Gaussianity consistently finds the correct latent factors, while baseline deep autoencoders do not. We propose how to encourage deep autoencoders to find encodings that satisfy local isometry and show that this helps them discover disentangled representations. Overall, our results suggest that in some realistic settings, unsupervised disentanglement is provably possible, without any domain-specific assumptions.
ThursdayJun 29, 202311:15
Vision and AIRoom 1
Speaker:Hadar Averbuch-ElorTitle:Marrying Vision and Language: A Mutually Beneficial Relationship?Abstract:opens in new windowin html    pdfopens in new window*** Please Note the Unusual Time ***
Foundation models that connect vision and language have recently shown great promise for a wide array of tasks such as text-to-image generation. Significant attention has been devoted towards utilizing the visual representations learned from these powerful vision and language models. In this talk, I will present an ongoing line of research that focuses on the other direction, aiming at understanding what knowledge language models acquire through exposure to images during pretraining. We first consider in-distribution text and demonstrate how multimodally trained text encoders, such as that of CLIP, outperform models trained in a unimodal vacuum, such as BERT, over tasks that require implicit visual reasoning. Expanding to out-of-distribution text, we address a phenomenon known as sound symbolism, which studies non-trivial correlations between particular sounds and meanings across languages and demographic groups, and demonstrate the presence of this phenomenon in vision and language models such as CLIP and Stable Diffusion. Our work provides new angles for understanding what is learned by these vision and language foundation models, offering principled guidelines for designing models for tasks involving visual reasoning. Bio: Hadar Averbuch-Elor is an Assistant Professor at the School of Electrical Engineering in Tel Aviv University. Before that, Hadar was a postdoctoral researcher at Cornell-Tech. She completed her PhD in Electrical Engineering at Tel-Aviv University. Hadar is a recipient of several awards including the Zuckerman Postdoctoral Scholar Fellowship, the Schmidt Postdoctoral Award for Women in Mathematical and Computing Sciences, and the Alon Fellowship for the Integration of Outstanding Faculty. She was also selected as a Rising Star in EECS in 2020. Hadar's research interests lie in the intersection of computer graphics and computer vision, particularly in combining pixels with more structured modalities, such as natural language and 3D geometry.
SundayJun 25, 202312:15
Vision and AIRoom 155
Speaker:Yftah ZiserTitle:Extending the Reach of NLP: Overcoming the Data BottleneckAbstract:opens in new windowin html    pdfopens in new window***Joint with Machine Learning and Statistics Seminar*** ***Please note the unusual day and time***
Transformer-based models have revolutionized natural language processing (NLP) and significantly improved various NLP tasks. However, many researchers make implicit assumptions about their training setups, assuming that the train and test sets are drawn from the same distribution. This assumption can limit the applicability of these models across different languages and domains. The high cost of training state-of-the-art NLP models using various languages and domains has resulted in training them for only a subset of languages and domains, leading to a significant performance gap in excluded domains and languages. This performance gap marginalizes many individuals from accessing useful models. This talk will address the challenges, approaches, and opportunities for democratizing NLP across different languages and domains. Finally, we will explore future directions for making these models accessible to a broader audience.
ThursdayJun 15, 202312:15
Vision and AIRoom 1
Speaker:Daniel SoudryTitle:Are deep networks broken, and should we fix them?Abstract:opens in new windowin html    pdfopens in new window*** Joint with Machine Learning & Statistics Seminar ***

We analyze three cases where Deep Neural Networks (DNNs) seem to work "sub-optimally". We find if these issues can-or should-be fixed.
Convnets were originally designed to be shift-invariant, but this does not hold because of aliasing. We show how to completely fix this issue using polynomial activations and achieve state-of-the-art performance under adversarial shifts-even for fractional shifts.
DNNs are known to exhibit "catastrophic forgetting" when trained on sequential tasks. We show this can happen even in a linear setting for regression and classification. However, we derive universal bounds which guarantee no catastrophic forgetting in certain cases, such as when tasks are repeated or randomly ordered.
It was recently observed that DNNs, optimized with gradient descent, operate at the "Edge of Stability", in which monotone convergence is not guaranteed. However, we prove a different potential function ("gradient flow solution sharpness") is monotonically decreasing in scalar networks, observe this holds empirically in DNNs, and discuss the implications.
[1] H. Michaeli, T. Michaeli, D. Soudry, "Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations", CVPR 2023. (
[2] I. Evron, E. Moroshko, G. Buzaglo, M. Khriesh, B. Marjieh, N. Srebro, D. Soudry, "Continual Learning in Linear Classification on Separable Data", ICML 2023. (
[3] I. Kreisler*, M. Shpigel Nacson*, D. Soudry, Yair Carmon, "Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond", ICML 2023. (

ThursdayJun 08, 202312:15
Vision and AIRoom 1
Speaker:Shiran ZadaTitle:Imagic: Text-Based Real Image Editing with Diffusion ModelsAbstract:opens in new windowin html    pdfopens in new window
Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently either limited to specific editing types (e.g., object overlay, style transfer), or apply to synthetically generated images, or require multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-guided semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down or jump, cause a bird to spread its wings, etc. — each within its single high-resolution natural image provided by the user. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, which we call "Imagic", leverages a pre-trained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of our method on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework
ThursdayJun 01, 202312:15
Vision and AIRoom 1
Speaker:Yoav SchechnerTitle:Spaceborne multi-view computational tomography (CT)Abstract:opens in new windowin html    pdfopens in new window
We describe new computer vision tasks stemming from upcoming multiview tomography from space. Solutions involve both novel imaging hardware and computational algorithms, based on machine learning and differential rendering. This can transform climate research and medical X-ray CT. The key idea is that advanced computing can enable computed tomography of volumetric scenes, based scattered radiation. We describe an upcoming space mission (CloudCT, funded by the ERC). It has 10 nano-satellites that will fly in an unprecedented formation, to capture the same scene (cloud fields) from multiple views simultaneously, using special cameras. The satellites and cameras are built now. They - and the algorithms - are specified to meet computer vision tasks, including geometric and polarimetric self-calibration in orbit, and estimation of 3D volumetric distribution of matter and microphysical properties. Deep learning and differential rendering enable analysis to scale to big data downlinked from orbit. Core ideas are generalized for medical X-ray imaging, to enable significant reduction of dose and acquisition time, while extracting chemical properties per voxel. The creativity of the computer vision and graphics communities can assist in critical needs for society, and this talk points out relevant challenges.
ThursdayMay 11, 202312:15
Vision and AI
Speaker:Guy TevetTitle:Human Motion Diffusion ModelAbstract:opens in new windowin html    pdfopens in new window*** Please Note The Unusual Location *** *** The Ullman 101 auditorium is Left from the main entrance ***
Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models, which have already shown remarkable generative capabilities in other domains, are promising candidates for human motion due to their many-to-many nature, but they tend to be resource hungry and hard to control. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for the human motion domain. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion.
ThursdayMay 04, 202312:15
Vision and AIRoom 1
Speaker:Shai Avidan Title:Matching 3D Point CloudsAbstract:opens in new windowin html    pdfopens in new window
I will present three deep learning algorithms for registering 3D point clouds in different settings. The first is designed to find a rigid transformation between point clouds and is based on the concept of best buddies similarity. The second algorithm offers a fast method for non-rigid dense correspondence between point clouds based on structured shape construction. Finally, I extend the second algorithm to handle scene flow estimation that can be learned on a small amount of data without employing ground-truth flow supervision.
ThursdayApr 27, 202312:15
Vision and AIRoom 1
Speaker:Omri Avrahami Title:SpaText: Spatio-Textual Representation for Controllable Image GenerationAbstract:opens in new windowin html    pdfopens in new window
Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.
ThursdayApr 20, 202312:15
Vision and AIRoom 1
Speaker:Niv Haim Title:Training Set Reconstruction and Single-Video GenerationAbstract:opens in new windowin html    pdfopens in new window
Over the past decade, deep learning has made significant strides in the fields of computer vision and machine learning. However, there is still a lack of understanding regarding how these machines store and utilize training samples to generalize to unseen data. In my thesis (guided by Prof. Irani), I investigated how neural networks encode training samples in their parameters and how such samples can sometimes be reconstructed. Additionally, I examined the capabilities of generative models in learning and generalizing from a single video. Specifically, I explored the effectiveness of patch-based methods and diffusion models in generating diverse output samples, and how such models can utilize the motion and dynamics of a single input video to learn and generalize.
ThursdayFeb 16, 202312:15
Vision and AIRoom 1
Speaker:Gal Chechik Title:Perceive, reason, actAbstract:opens in new windowin html    pdfopens in new window

AI aims to build systems that interact with their environment, with people, and with other agents in the real world. This vision requires combining perception with reasoning and decision-making. It poses hard algorithmic challenges: from generalizing effectively from few or no samples to adapting to new domains to communicating in ways that are natural to people. I'll discuss our recent research thrusts for facing these challenges. These will include approaches to model the high-level structure of a visual scene; leveraging compositional structures in attribute space to learn from descriptions without any visual samples; and teaching agents new concepts without labels, by using elimination to reason about their environment.



Gal Chechik is a Professor at Bar-Ilan University and a director of AI research at NVIDIA. His current research focuses on learning for reasoning and perception. In 2018, Gal joined NVIDIA to found and head nvidia's research in Israel. Prior to that, Gal was a staff research scientist at Google Brain and Google research developing large-scale algorithms for machine perception, used by millions daily. Gal earned his PhD in 2004 from the Hebrew University, and completed his postdoctoral training at Stanford CS department.  In 2009, he started the learning systems lab at the Gonda brain center of Bar Ilan university, and was appointed an associate professor in 2013. Gal authored ~120 refereed publications, ~49 patents, including publications in Nature Biotechnology, Cell and PNAS. His work won awards at ICML and NeurIPS.

ThursdayFeb 09, 202312:15
Vision and AIRoom 1
Speaker:Tomer Michaeli Title:The implicit bias of SGD: A Minima stability analysisAbstract:opens in new windowin html    pdfopens in new window*** Joint ML/Statistics & Vision/AI Seminar ***
One of the puzzling phenomena in deep learning, is that neural networks tend to generalize well even when they are highly overparameterized. Recent works linked this behavior with implicit biases of the algorithms used to train networks (like SGD). Here we analyze the implicit bias of SGD from the standpoint of minima stability, focusing on shallow ReLU networks trained with a quadratic loss. Specifically, it is known that SGD can stably converge only to minima that are flat enough w.r.t. its step size. Here we show that this property enforces the predictor function to become smoother as the step size increases, thus significantly regularizing the solution. Furthermore, we analyze the representation power of stable solutions. Particularly, we prove a depth-separation result: There exist functions that cannot be approximated by depth-2 networks corresponding to stable minima, no matter how small the step size is taken to be, but which can be implemented with depth-3 networks corresponding to stable minima. We show how our theoretical findings explain behaviors observed in practical settings. (Joint works with Rotem Mulayoff, Mor Shpigel Nacson, Greg Ongie, Daniel Soudry).
ThursdayJan 26, 202312:15
Vision and AIRoom 1
Speaker:Lihi Zelnik-Manor Title:Digitizing TouchAbstract:opens in new windowin html    pdfopens in new window
Imagine being able to touch virtual objects, interact physically with computer games, or feel items that are located elsewhere on the globe. The breadth of applications of such haptic technology would be diverse and broad. Interestingly, while excellent visual and auditory feedback devices exist, cutaneous feedback devices are in infancy stages. In this talk I will present a brief introduction to the world of haptic feedback devices and the challenges it poses. Then I will present HUGO, a device designed in a human-centered process, triggering the mechanoreceptors in our skin thus enabling people to experience the touch of digitized surfaces "in-the-wild". This talk is likely to leave us with many open questions that require research to answer. Bio: Prof. Lihi Zelnik-Manor is a Full Professor and Vice Dean of Graduate Studies in the Faculty of Electrical and Computer Engineering at the Technion. Between 2018-2021 she was a Senior Director and the General Manager of Alibaba's R&D center in Israel. Prior to that she was a visiting Associate Professor at CornellTech during its establishment years, and a Post-doctoral scholar at Caltech. Her main area of expertise is Computer Vision, in which she performs research as well as holds industry advisory roles. Prof Zelnik-Manor has done extensive community contribution, serving as General Chair of CVPR'21 and ECCV'22, Program Chair of CVPR'16, Associate Editor at TPAMI, served multiple times as Area Chair at CVPR, ECCV and was on the award committee of ACCV'18, CVPR'19 and CVPR'22. Looking forward she will serve as Program Chair of ICCV'25.
ThursdayJan 12, 202312:15
Vision and AIRoom 1
Speaker:Mark Sheinin Title:Computational Imaging for Enabling Vision Beyond Human PerceptionAbstract:opens in new windowin html    pdfopens in new window
From minute surface vibrations to very fast-occurring events, the world is rich with phenomena humans cannot perceive. Likewise, most computer vision systems are primarily based on 'conventional' cameras, which were designed to mimic the imaging principle of the human eye, and therefore are equally blind to these ubiquitous phenomena. In this talk, I will show that we can capture these hidden phenomena by creatively building novel vision systems composed of common off-the-shelf components (i.e., cameras and optics) coupled with cutting-edge algorithms. Specifically, I will cover three projects using computational imaging to sense hidden phenomena. First, I will describe the ACam - a camera designed to capture the minute flicker of electric lights ubiquitous in our modern environments. I will show that bulb flicker is a powerful visual cue that enables various applications like scene light source unmixing, reflection separation, and remote analyses of the electric grid itself. Second, I will describe Diffraction Line Imaging, a novel imaging principle that exploits diffractive optics to capture sparse 2D scenes with 1D (line) sensors. The method's applications include capturing fast motions (e.g., actors and particles within a fast-flowing liquid) and structured light 3D scanning with line illumination and line sensing. Lastly, I will present a new approach for sensing minute high-frequency surface vibrations (up to 63kHz) for multiple scene sources simultaneously, using "slow" sensors rated for only 130Hz. Applications include capturing vibration caused by audio sources (e.g., speakers, human voice, and musical instruments) and localizing vibration sources (e.g., the position of a knock on the door). Bio: Mark Sheinin is a Post-doctoral Research Associate at Carnegie Mellon University's Robotic Institute at the Illumination and Imaging Laboratory. He received his Ph.D. in Electrical Engineering from the Technion - Israel Institute of Technology in 2019. His work has received the Best Student Paper Award at CVPR 2017 and the Best Paper Honorable Mention Award at CVPR 2022. He received the Porat Award for Outstanding Graduate Students, the Jacobs-Qualcomm Fellowship in 2017, and the Jacobs Distinguished Publication Award in 2018. His research interests include computational photography and computer vision.
WednesdayJan 11, 202311:15
Vision and AIRoom 1
Speaker:Sagie BenaimTitle:Towards a Controllable Generation of the 3D WorldAbstract:opens in new windowin html    pdfopens in new window***Joint Vision and Machine Learning Seminar*** PLEASE NOTE THE UNUSUAL DAY AND TIME
Recent breakthroughs in Generative AI have enabled the controllable generation of diverse and photorealistic 2D imagery, resulting in transformative applications in areas such as art and design. As human perception is inherently three-dimensional, the ability to generate 3D content in a controllable manner could unlock numerous applications in virtual and augmented reality, healthcare, autonomous vehicles, robotics, and more, and have wide-reaching implications. However, we are yet to witness the same success of 2D generation in 3D. In this talk, I will outline three important challenges on the path to closing this gap. The first challenge is that of representing the 3D world in a controllable, expressive, and compact manner. To this end, I will describe a novel approach for representing signals (such as 3D objects or scenes) in a decomposable and interpretable manner that allows constraints to be imposed on the signal with provable guarantees. The second challenge is in remodeling the 3D world in a controllable manner from limited 2D observations. To this end, I will describe a framework for decomposing and manipulating objects in a 3D scene as well as for generating them from novel views, given only 2D training data. The third challenge is in providing an intuitive and flexible interface for humans to create 3D content in a controllable manner. To this end, I will describe a method for intuitively stylizing 3D objects using textual descriptions. Lastly, I will conclude my talk with future directions on using controllable 3D generation for augmented reality, photorealistic simulations for applications such as autonomous vehicles, as well as enabling machines to better understand the world. Short bio: Sagie Benaim is a postdoctoral researcher at the Pioneer Center for AI, University of Copenhagen, advised by Prof. Serge Belongie. His research interests are in generative modeling, 3D content creation, few-shot learning, and representation learning. He received his PhD from the Computer Science department at Tel Aviv University, advised by Prof. Lior Wolf. Previously, he did his MSc at the University of Oxford and BSc at Imperial College London. During his PhD, he spent time as a research intern at Google.
ThursdayJan 05, 202312:15
Vision and AIRoom 1
Speaker:Haggai Maron Title:Subgraph-based networks for expressive, efficient, and domain-independent graph learningAbstract:opens in new windowin html    pdfopens in new window
While message-passing neural networks (MPNNs) are the most popular architectures for graph learning, their expressive power is inherently limited. In order to gain increased expressive power while retaining efficiency, several recent works apply MPNNs to subgraphs of the original graph. As a starting point, the talk will introduce the Equivariant Subgraph Aggregation Networks (ESAN) architecture, which is a representative framework for this class of methods. In ESAN, each graph is represented as a set of subgraphs, selected according to a predefined policy. The sets of subgraphs are then processed using an equivariant architecture designed specifically for this purpose. I will then present a recent follow-up work that revisits the symmetry group suggested in ESAN and suggests that a more precise choice can be made if we restrict our attention to a specific popular family of subgraph selection policies. We will see that using this observation, one can make a direct connection between subgraph GNNs and Invariant Graph Networks (IGNs), thus providing new insights into subgraph GNNs' expressive power and design space. The talk is based on our ICLR and NeurIPS 2022 papers (spotlight and oral presentations accordingly). Bio: Haggai is a Senior Research Scientist at NVIDIA Research and a member of NVIDIA's TLV lab. His main field of interest is machine learning in structured domains. In particular, he works on applying deep learning to sets, graphs, point clouds, and surfaces, usually by leveraging their symmetry structure. He completed his Ph.D. in 2019 at the Weizmann Institute of Science under the supervision of Prof. Yaron Lipman. Haggai will be joining the Faculty of Electrical and Computer Engineering at the Technion as an Assistant Professor in 2023.
TuesdayJan 03, 202312:30
Vision and AIRoom 1
Speaker:Yuval Bahat Title:Neural Volume Super-ResolutionAbstract:opens in new windowin html    pdfopens in new window***PLEASE NOTE THE UNUSUAL DAY AND TIME***
Neural volumetric representations have become a widely adopted model for radiance fields in 3D scenes. These representations are fully implicit or hybrid function approximators of the instantaneous volumetric radiance in a scene, which are typically learned from multi-view captures of the scene. We investigate the new task of neural volume super-resolution - rendering high-resolution views corresponding to a scene captured at low resolution. To this end, we propose a neural super-resolution network that operates directly on the volumetric representation of the scene. This approach allows us to exploit an advantage of operating in the volumetric domain, namely the ability to guarantee consistent super-resolution across different viewing directions. To realize our method, we devise a novel 3D representation that hinges on multiple 2D feature planes. This allows us to super-resolve the 3D scene representation by applying 2D convolutional networks on the 2D feature planes. We validate the proposed method's capability of super-resolving multi-view consistent views both quantitatively and qualitatively on a diverse set of unseen 3D scenes, demonstrating a significant advantage over existing approaches. Bio: Yuval holds a joint postdoctoral researcher position at the computational imaging lab in Princeton and the ZESS center at the university of Siegen. His research interests lie at the intersection of computer vision and computational photography with Machine learning. He was previously a postdoctoral researcher at Prof. Tomer Michaeli's lab at the Technion, after completing his PhD at the Weizmann Institute of Science, advised by Prof. Michal Irani. Prior to that he completed his M.Sc. at the Technion with Prof. Yoav Y. Schechner.
ThursdayDec 29, 202212:15
Vision and AIRoom 1
Speaker:Roi HerzigTitle:Towards Compositionality in Video UnderstandingAbstract:opens in new windowin html    pdfopens in new window
Our understanding of the world is naturally hierarchical and structured, and when new concepts are introduced, humans tend to decompose the familiar parts to reason about what they do not know. This leads to the hypothesis that intelligent machines would need to develop a compositional understanding that is robust and generalizable. In this talk, I will discuss our work on compositionality in video understanding from CVPR2022 and NeurIPS2022, as well as a recent preprint, which includes: (i) an object-centric model [1] that directly incorporates object representations into video transformers; (ii) a model [2] that utilizes the structure of a small set of images, whether they are within or outside the domain of interest, available only during training for a video downstream task; (iii) a model [3] that leverages a multi-task prompt learning approach for video transformers, where a shared transformer backbone is enhanced with task-specific prompts. [1] [2] [3] Bio: Roi is a a 4th-year CS Ph.D. student at Tel Aviv University and a visiting scholar at Berkeley AI Research Lab (BAIR), working with Prof. Amir Globerson and Prof. Trevor Darrell. Roi is also affiliated as a research scientist at IBM Research AI.
ThursdayDec 22, 202212:15
Vision and AIRoom 1
Speaker:Rinon Gal Title:Personalizing Text-to-Image GenerationAbstract:opens in new windowin html    pdfopens in new window
Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In this talk, I will outline Textual Inversion and other recent methods that enable such control. We will discuss their strengths and limitations, and leverage them to provide some insights into the structure of the word embedding space. Bio: Rinon Gal is a Ph.D. student at Tel Aviv University where he is supervised by Prof. Daniel Cohen-Or and Dr. Amit Bermano. His research focuses on generative models, few-shot and unsupervised approaches, and on combining vision and language.
ThursdayDec 15, 202212:15
Vision and AIRoom 1
Speaker:Yossi GandelsmanTitle:Test-time Training with Self-SupervisionAbstract:opens in new windowin html    pdfopens in new window
Test-Time Training is a general approach for improving the performance of predictive models when training and test data come from different distributions. It adapts to a new test distribution on the fly by optimizing a model for each test input using self-supervision before making the prediction. This method improves generalization on many real-world visual benchmarks for distribution shifts. In this talk, I will present the recent progress in the test-time training paradigm. I will show how masked auto-encoding overcomes the shortcomings of previously used self-supervised tasks and improves results by a large margin. In addition, I will demonstrate how test-time training extends to videos - instead of just testing each frame in temporal order, the model is first fine-tuned on the recent past before making a prediction and only then proceeding to the next frame.
ThursdayDec 08, 202212:15
Vision and AILecture Hall
Speaker:Niv Cohen Title:"This is my unicorn, Fluffy": Personalizing frozen vision-language representationsAbstract:opens in new windowin html    pdfopens in new window
Large Vision & Language models pretrained on web-scale data provide representations invaluable for numerous V&L problems. However, it is unclear how they can be used for reasoning about user-specific visual concepts in unstructured language. We introduce a new learning setup called Personalized Vision & Language (PerVL) with two new benchmark datasets for retrieving and segmenting user-specific "personalized" concepts in the wild. We propose an architecture for solving PerVL that operates by extending the input vocabulary of a pretrained model with new word embeddings for the new personalized concepts. We demonstrate that our approach learns personalized visual concepts from a few examples and can effectively apply them in image retrieval and semantic segmentation using rich textual queries. (Published as an oral presentation at ECCV2022. This work was done during an internship at NVIDIA Research Tel Aviv.) Bio: Niv is a Ph.D. student at The Hebrew University of Jerusalem, advised by Dr. Yedid Hoshen. He received his BSc. in mathematics with physics, and M.Sc. in physics, both from the Technion. He's interested in computer vision and representation learning with a focus on anomaly detection and scientific data.
ThursdayDec 01, 202217:00
Vision and AI
Speaker:Rana Hanocka Title:Data-Driven Geometry Processing - without 3D DataAbstract:opens in new windowin html    pdfopens in new window*** Zoom Only Seminar *** *** Please Note the Unusual Hour *** Zoom Link: Zoom recording: Passcode: 5kN9pi
Much of the current success of deep learning has been driven by massive amounts of curated data, whether annotated and unannotated. Compared to image datasets, developing large-scale 3D datasets is either prohibitively expensive or impractical. In this talk, I will present several works which harness the power of data-driven deep learning for tasks in geometry processing, without any 3D datasets. I will discuss works which reconstruct surfaces from noisy point cloud data without any 3D datasets. In addition, I will demonstrate that it is possible to learn to edit 3D geometry using large image datasets. Bio: Rana Hanocka is an Assistant Professor at the University of Chicago and holds a courtesy appointment at the Toyota Technological Institute at Chicago (TTIC). Rana founded and directs the 3DL (Threedle) research collective, comprised of enthusiastic researchers passionate about 3D, machine learning, and visual computing. Rana’s research interests span computer graphics, computer vision, and machine learning. Rana completed her Ph.D. at Tel Aviv University under the supervision of Daniel Cohen-Or and Raja Giryes. Her Ph.D. research focused on building neural networks for irregular 3D data and applying them to problems in geometry processing.
ThursdayNov 24, 202212:15
Vision and AILecture Hall
Speaker:Yedid Hoshen Title:Anomaly detection requires better representationsAbstract:opens in new windowin html    pdfopens in new window
Anomaly detection aims to discover data which differ from the norm in a semantically meaningful manner. The task is difficult as anomalies are rare and unexpected. Perhaps the most challenging aspect is its subjective nature: a sample can be an important anomaly to one person and an uninteresting statistical outlier to another. While the task has been studied for decades, deep learning methods have recently brought substantial gains, particularly for image anomaly detection. In this talk, I will present the hypothesis that anomaly detection requires strong representations of data and simple density estimators. This paradigm will be substantiated by our state-of-the-art results from a large number of modalities including: images, point clouds, video, time-series. The success of representation-based anomaly detection and the task's subjective nature, makes it paramount to develop user-guided representations. I will therefore describe our preliminary approach that uses a new representation disentanglement technique for guiding representations in anomaly detection. I will conclude by describing the outstanding challenges for representations in anomaly detection.
ThursdayNov 17, 202216:30
Vision and AI
Speaker:Phillip Isola Title:Learning vision from procedural image programsAbstract:opens in new windowin html    pdfopens in new windowZoom Link: *** Zoom Only Seminar *** *** Please Note the Unusual Time ***
Current vision systems are trained on huge datasets, and these datasets come with costs: curation is expensive, they inherit human biases, and there are concerns over privacy and usage rights. To counter these costs, interest has surged in learning from cheaper data sources, such as unlabeled images. In this talk I will consider if we can go a step further and do away with real images entirely, instead learning from synthetic images sampled from procedural image programs. I will present our work on several ways of doing this: 1) learning from simple statistical models of images such as dead leaves, and 2) learning from thousands of video programs collected from the "demoscene" community of algorithmic art. I will talk about the lessons we have learned as to what kinds of synthetic processes make for the effective training data, and will touch on related work in adjacent communities such as NLP, where there is growing evidence that pretraining on random processes can lead to strong representations.
ThursdayNov 03, 202212:15
Vision and AIRoom 1
Speaker:Aviad Levis Title:How novel imaging algorithms could reveal new structures around the black hole in our galactic centerAbstract:opens in new windowin html    pdfopens in new window
In this talk, I want to take you on a journey toward our galactic center where a bright radio source called Sagittarius A* (Sgr A*) is located. In 2017 this radio source was observed by the Event Horizon Telescope (EHT) - a virtual instrument made of radio-telescopes around the world. Even though Sgr A* was observed at the same period as the first black hole image of M87*, it took an extra three years to analyze the data. One of the key challenges we faced was the dynamic nature of Sgr A* which evolves on the timescale of acquisition. Computationally, this is analogous to an MRI patient that refuses to sit still while being imaged. Furthermore, I will give you a peek into the future, where new computational algorithms we are developing could reveal new structures beyond a 2D image. Could we reveal the dynamic evolution? Could we look at the 3D structure? These are the type of imaging questions and computational algorithms we are working on for the next generation of EHT observations. Bio: Aviad Levis is a postdoctoral scholar in the Department of Computing and Mathematics at Caltech, working with Prof. Katherine Bouman. Currently, as part of the Event Horizon Telescope collaboration, his work focuses on developing novel computational methods for imaging black hole dynamics. Prior to that, he received his Ph.D. (2020) from the Technion and his B.Sc. (2013) from Ben-Gurion University. Notably, his Ph.D. research into 3D remote sensing of clouds is a key enabler in a novel interdisciplinary space mission (CloudCT) funded by the ERC and led by Yoav Schechner, Ilan Koren, and Klaus Schilling. Aviad is a recipient of the Zuckerman and the Viterbi postdoctoral fellowships.
ThursdayJun 30, 202212:15
Vision and AIRoom 1
Speaker:Yael VinkerTitle:CLIPasso: Semantically-Aware Object Sketching Abstract:opens in new windowin html    pdfopens in new window
Abstraction is at the heart of sketching due to the simple and minimal nature of line drawings. Abstraction entails identifying the essential visual properties of an object or scene, which requires semantic understanding and prior knowledge of high-level concepts. Abstract depictions are therefore challenging for artists, and even more so for machines. We present CLIPasso, an object sketching method that can achieve different levels of abstraction, guided by geometric and semantic simplifications. While sketch generation methods often rely on explicit sketch datasets for training, we utilize the remarkable ability of CLIP (Contrastive-Language-Image-Pretraining) to distill semantic concepts from sketches and images alike. We define a sketch as a set of Bézier curves and use a differentiable rasterizer to optimize the parameters of the curves directly with respect to a CLIP-based perceptual loss. The abstraction degree is controlled by varying the number of strokes. The generated sketches demonstrate multiple levels of abstraction while maintaining recognizability, underlying structure, and essential visual components of the subject drawn.
WednesdayJun 15, 202213:15
Vision and AI
Speaker:Ivan SkorokhodovTitle:Deep Generative Models over Continuous DataAbstract:opens in new windowin html    pdfopens in new window*** Please Note The Unusual Date & Time *** The lecture will be in Zoom Only at:
Recently, deep generative modeling has attracted a lot of interest from the research community, and the developed methods now enjoy many practical applications. In computer vision, people commonly build such models for images, videos, or discretized 3D shapes, which in reality are just subsampled versions of some continuous signals. While this continuous nature is typically neglected by the community, it can yield some useful inductive biases in practice, like interpolation and extrapolation capabilities or better geometric priors. Besides, it can serve as an interesting way to unify the generators designed for different types of data or compress the underlying signals. In the talk, we will explore how one can represent continuous data via neural networks, build a generator over it, and which benefits this can provide.
MondayJun 06, 202216:30
Vision and AIRoom 1
Speaker:Michael LustigTitle:Adventures in computational Magnetic Resonance ImagingAbstract:opens in new windowin html    pdfopens in new window** Please note the unusual day and Time** The meeting will also be in Zoom at:
Magnetic resonance imaging (MRI) is a powerful, ionizing-radiation-free medical imaging modality. The vast physical and physiological parameters, which MRI is sensitive to, makes it possible to visualize both structure and function in the body. However the prolonged time necessary to capture the information in this large parameter space remains a major limitation of this phenomenal modality, which the field of computational MRI aims to address. By computational MRI we refer to the joint optimization of the imaging system hardware, the data encoding, the data acquisition and the image reconstruction together. In this talk I will describe some of the efforts my group has been engaged in towards mitigating with motion and dynamics that occurs during MRI scanning, in particular when performing body imaging of pediatric patients. Specifically I will focus on unsupervised and supervised methods for dynamic 3D imaging, learning based high fidelity reconstructions of fine structures and textures, and a new exciting approach for sensing, estimating and correcting for motion using external radio-frequency radar within the MRI bore.
ThursdayMay 26, 202212:15
Vision and AIRoom 1
Speaker:Miki EladTitle:Image Denoising - Not What You ThinkAbstract:opens in new windowin html    pdfopens in new window

Image denoising - removal of white additive Gaussian noise from an image - is one of the oldest and most studied problems in image processing. An extensive work over several decades has led to thousands of papers on this subject, and to many well-performing algorithms for this task. As expected, the era of deep learning has brought yet another revolution to this subfield, and took the lead in today's ability for noise suppression in images. All this progress has led some researchers to believe that "denoising is dead", in the sense that all that can be achieved is already done.
Exciting as all this story might be, this talk IS NOT ABOUT IT!
Our story focuses on recently discovered abilities and vulnerabilities of image denoisers. In a nut-shell, we expose the possibility of using image denoisers for serving other problems, such as regularizing general inverse problems and serving as the engine for image synthesis. We also unveil the (strange?) idea that denoising (and other inverse problems) might not have a unique solution, as common algorithms would have you believe. Instead, we will describe constructive ways to produce randomized and diverse high perceptual quality results for inverse problems.

ThursdayMay 12, 202212:15
Vision and AIRoom 1
Speaker:Yotam NitzanTitle:MyStyle - A Personalized Generative PriorAbstract:opens in new windowin html    pdfopens in new window

Deep generative models have proved to be successful for many image-to-image applications. Such models hallucinate information based on their large and diverse training datasets. Therefore, when enhancing or editing a portrait image, the model produces a generic and plausible output, but often it isn't the person who actually appears in the image.

In this talk, I'll present our latest work, MyStyle - which introduces the notion of a personalized generative model. Trained on ~100 images of the same individual, MyStyle learns a personalized prior, custom to their unique appearance. This prior is then leveraged to solve ill-posed image enhancement and editing tasks - such as super-resolution, inpainting and changing the head pose.

WednesdayApr 27, 202211:15
Vision and AIRoom 1
Speaker:Mor Geva Pipek Title:A Trip Down Memory Lane: How Transformer Feed-Forward Layers Build Predictions in Language ModelsAbstract:opens in new windowin html    pdfopens in new window***A joint Seminar with Machine Learning.***

The field of natural language processing is dominated by transformer-based language models (LMs). One of the core building blocks of these models is the feed-forward network (FFN) layers, which typically account for >2/3 of the network parameters. Yet, how these layers are being utilized by the model to build predictions is largely unknown. In this talk, I will share recent findings on the operation of FFN layers in LMs, and demonstrate their utility in real-world applications. First, I will show that FFN layers can be cast as human-interpretable key-value memories, and describe how the output from each layer can be viewed as a collection of updates to the model's output distribution. Then, I will demonstrate the utility of these findings in the context of (a) controlled language generation, where we reduce the toxicity of GPT2 by almost 50%, and (b) improving computation efficiency, through a simple rule for early exit, saving 20% of computation on average.

ThursdayApr 07, 202212:15
Vision and AIRoom 1
Speaker:Rita Schmidt Title:Developing new strategies for high resolution imaging at ultra-high field human MRIAbstract:opens in new windowin html    pdfopens in new window
Today, ultra-high-field MRI is at the forefront of the development for high-precision non-invasive imaging, capable of distinguishing between brain layers. In our lab, we utilize ultra-high field MRI to develop methods to increase the spatial and temporal resolution in scans that capture the structure and function of the human brain. We explore methods that can control the signal we collect, optimizing MRI signal encoding, scan acceleration and strategies of the signal acquisition, followed by computational approaches to improve the final image. I will show three examples. One is developing methods to increase temporal resolution in functional MRI, exploring strategies to capture the delay between neural processes. Another is developing quantitative method to characterize neurovascular and physiological changes in the human brain. With this technique, we are interested to follow the changes in the brain with age, gender and population, thus providing new insights for basic neuroscience and long-term personalized medicine. The last example shows how we can minimize a major sensitivity that is common to 3D whole brain acquisition – sensitivity to fluid movement during the scan duration, which results in severe artifacts in the images. This includes fluid in the ventricles that beats with cardiac pulsation, fluid movement in the eyes while moving our gaze and the blood flow in small vessels.
ThursdayMar 31, 202212:15
Vision and AIRoom 1
Speaker:Avi Ben-Cohen & Emanuel Ben Baruch Title:Multi-label classification: dealing with bias, diversity, and lack of dataAbstract:opens in new windowin html    pdfopens in new window
Photos of everyday life are inherently multi-label in nature. Hence, multi-label classification is commonly used to analyze their content. In this talk, we discuss major challenges in multi-label classification and present an overview of three novel techniques that aim at tackling them. As a typical photo contains a few positive labels and many negative ones, a negative-positive imbalance may harm the optimization process. To this end, we introduce an asymmetric loss (ASL) which enables to dynamically down-weight easy negative samples. Second, large-scale multi-label classification datasets are commonly, and perhaps inevitably, partially annotated. That is, only a small subset of labels are annotated per sample. To handle partial annotation, we propose a selective approach that treats each class differently in the loss function. Finally, we deal with zero-shot learning for multi-label classification and propose an end-to-end model training that supports the semantic diversity of images and labels by using an embedding matrix with multiple principal embedding vectors. Asymmetric Loss For Multi-Label Classification (ICCV 2021): Paper: Code: Semantic Diversity Learning for Zero-Shot Multi-label Classification (ICCV 2021, oral): Paper: Code: Multi-label Classification with Partial Annotations using Class-aware Selective Loss (CVPR 2022): Paper: Code:
ThursdayJan 20, 202215:00
Vision and AI
Speaker:Dor VerbinTitle:Images as Fields of JunctionsAbstract:opens in new windowin html    pdfopens in new window**PLEASE NOTE THE UNUSUAL TIME** The seminar will be ONLY in zoom:
Which pixels belong to a segment, and where are the boundaries between segments? My talk revisits these long-lasting questions about grouping, this time knowing that approximate segments and boundaries can often be provided by deep feedforward networks. In this context I focus on exactness and robustness, aiming to localize boundaries, including corners and junctions, with high spatial precision and with enhanced stability under noise. The approach is to represent the appearance of each small receptive field by a low-parameter, piecewise smooth model (a “generalized junction”), and to iteratively estimate these local parameters using parallel mean-field updates. I introduce initialization and refinement algorithms that allow this to succeed, despite the problem’s non-convexity, and I show experimentally that the resulting approach extracts precise edges, curves, corners, junctions and boundary-aware smoothing---all at the same time. I also show that it exhibits unprecedented resilience to noise, providing stable output at high noise levels where previous methods fail. (Work done at Harvard University)
ThursdayJan 13, 202212:15
Vision and AI
Speaker:Amir GlobersonTitle:On the implicit bias of SGD in deep learningAbstract:opens in new windowin html    pdfopens in new windowTHE MEETING WILL BE ONLY IN ZOOM !
Artificial neural networks have recently revolutionized the field of machine learning. However, we still do not have sufficient theoretical understanding of how such models can be successfully learned. Two specific questions in this context are: how can neural nets be learned despite the non-convexity of the learning problem, and how can they generalize well despite often having more parameters than training data. I will describe our recent work showing that gradient-descent optimization indeed leads to "simpler" models, where simplicity is captured by lower weight norm and in some cases clustering of weight vectors. We demonstrate this for several teacher and student architectures, including learning linear teachers with ReLU networks, learning boolean functions and learning convolutional pattern detection architectures. I will also discuss our results on fine-tuning neural nets, which has become common practice for large language models.
WednesdayJan 12, 202212:15
Vision and AIRoom 1
Speaker:Mark Sheinin Title:Computational Imaging for Enabling Vision Beyond Human PerceptionAbstract:opens in new windowin html    pdfopens in new window
From minute surface vibrations to very fast-occurring events, the world is rich with phenomena humans cannot perceive. Likewise, most computer vision systems are primarily based on 'conventional' cameras, which were designed to mimic the imaging principle of the human eye, and therefore are equally blind to these ubiquitous phenomena. In this talk, I will show that we can capture these hidden phenomena by creatively building novel vision systems composed of common off-the-shelf components (i.e., cameras and optics) coupled with cutting-edge algorithms. Specifically, I will cover three projects using computational imaging to sense hidden phenomena. First, I will describe the ACam - a camera designed to capture the minute flicker of electric lights ubiquitous in our modern environments. I will show that bulb flicker is a powerful visual cue that enables various applications like scene light source unmixing, reflection separation, and remote analyses of the electric grid itself. Second, I will describe Diffraction Line Imaging, a novel imaging principle that exploits diffractive optics to capture sparse 2D scenes with 1D (line) sensors. The method's applications include capturing fast motions (e.g., actors and particles within a fast-flowing liquid) and structured light 3D scanning with line illumination and line sensing. Lastly, I will present a new approach for sensing minute high-frequency surface vibrations (up to 63kHz) for multiple scene sources simultaneously, using "slow" sensors rated for only 130Hz. Applications include capturing vibration caused by audio sources (e.g., speakers, human voice, and musical instruments) and localizing vibration sources (e.g., the position of a knock on the door). Bio: Mark Sheinin is a Post-doctoral Research Associate at Carnegie Mellon University's Robotic Institute at the Illumination and Imaging Laboratory. He received his Ph.D. in Electrical Engineering from the Technion - Israel Institute of Technology in 2019. His work has received the Best Student Paper Award at CVPR 2017 and the Best Paper Honorable Mention Award at CVPR 2022. He received the Porat Award for Outstanding Graduate Students, the Jacobs-Qualcomm Fellowship in 2017, and the Jacobs Distinguished Publication Award in 2018. His research interests include computational photography and computer vision.
ThursdayJan 06, 202212:15
Vision and AI
Speaker:Tavi HalperinTitle:Endless Loops: Detecting and Animating Periodic Patterns in Still ImagesAbstract:opens in new windowin html    pdfopens in new window
We present an algorithm for producing a seamless animated loop from a single image. The algorithm detects periodic structures, such as the windows of a building or the steps of a staircase, and generates a non-trivial displacement vector field that maps each segment of the structure onto a neighboring segment along a user- or auto-selected main direction of motion. This displacement field is used, together with suitable temporal and spatial smoothing, to warp the image and produce the frames of a continuous animation loop. Our cinemagraphs are created in under a second on a mobile device. Over 140,000 users downloaded our app and exported over 350,000 cinemagraphs. Moreover, we conducted two user studies that show that users prefer our method for creating surreal and structured cinemagraphs compared to more manual approaches and compared to previous methods. Zoom:
ThursdayDec 30, 202112:15
Vision and AI
Speaker:Dror MoranTitle:Deep Permutation Equivariant Structure from MotionAbstract:opens in new windowin html    pdfopens in new windowIn Zoom ONLY:
Existing deep methods produce highly accurate 3D reconstructions in stereo and multiview stereo settings, i.e., when cameras are both internally and externally calibrated. Nevertheless, the challenge of simultaneous recovery of camera poses and 3D scene structure in multiview settings with deep networks is still outstanding. Inspired by projective factorization for Structure from Motion (SFM) and by deep matrix completion techniques, we propose a neural network architecture that, given a set of point tracks in multiple images of a static scene, recovers both the camera parameters and a (sparse) scene structure by minimizing an unsupervised reprojection loss. Our network architecture is designed to respect the structure of the problem: the sought output is equivariant to permutations of both cameras and scene points. Notably, our method does not require initialization of camera parameters or 3D point locations. We test our architecture in two setups: (1) single scene reconstruction and (2) learning from multiple scenes. Our experiments, conducted on a variety of datasets in both internally calibrated and uncalibrated settings, indicate that our method accurately recovers pose and structure, on par with classical state of the art methods. Additionally, we show that a pre-trained network can be used to reconstruct novel scenes using inexpensive fine-tuning with no loss of accuracy.
ThursdayDec 23, 202115:00
Vision and AI
Speaker:Mark SheininTitle:Computational Imaging for Sensing High-speed PhenomenaAbstract:opens in new windowin html    pdfopens in new window*** Seminar will take place on ZOOM ONLY *** *** Please note the UNUSUAL TIME ***
Despite recent advances in sensor technology, capturing high-speed video at high-spatial resolutions remains a challenge. This is because, in a conventional camera, the available bandwidth limits either the maximum sampling frequency or the captured spatial resolution. In this talk, I am going to cover our recent works that use computational imaging to allow high-speed high-resolution imaging under certain conditions. First I will describe Diffraction Line Imaging, a novel imaging principle that combines diffractive optics with 1D (line) sensors to allow high-speed positioning of light sources (e.g., motion capture markers, car headlights) as well structured light 3D scanning with line illumination and line sensing. Second, I will present a recent work that generalizes Diffraction Line Imaging to handle a new class of scenes, resulting in new application domains such as high-speed imaging for Particle Image Velocimetry and imaging combustible particles. Lastly, I will present a novel method for sensing vibrations at high speeds (up to 63kHz), for multiple scene sources at once, using sensors rated for only 130Hz operation. I will present results from our method that include capturing vibration caused by audio sources(e.g. speakers, human voice, and musical instruments) and analyzing the vibration modes of a tuning fork.
ThursdayDec 16, 202112:15
Vision and AI
Speaker:Alex BronsteinTitle:Learning to see in the Data AgeAbstract:opens in new windowin html    pdfopens in new window*** Seminar will take place on ZOOM ONLY ***
Recent spectacular advances in machine learning techniques allow solving complex computer vision tasks -- all the way down to vision-based decision making. However, the input image itself is still produced by imaging systems that were built to produce human-intelligible pictures that are not necessarily optimal for the end task. In this talk, I will try to entertain ourselves with the idea of including the camera hardware (optics and electronics) among the learnable degrees of freedom. I will show examples from optical, ultrasound, and magnetic resonance imaging demonstrating that simultaneously learning the "software" and the "hardware" parts of an imaging system is beneficial for the end task.
ThursdayDec 09, 202112:15
Vision and AIRoom 1
Speaker:Or PatashnikTitle:Leveraging StyleGAN for Image Editing and ManipulationAbstract:opens in new windowin html    pdfopens in new window
StyleGAN has recently been established as the state-of-the-art unconditional generator, synthesizing images of phenomenal realism and fidelity, particularly for human faces. With its rich semantic space, many works have attempted to understand and control StyleGAN’s latent representations with the goal of performing image manipulations. To perform manipulations on real images, however, one must learn to “invert” the GAN and encode the image into StyleGAN’s latent space, which remains a challenge. In this talk, I will discuss recent techniques and advancements in GAN Inversion and explore their importance for real image editing applications. In addition, going beyond the inversion task, I will demonstrate how StyleGAN can be used for performing a wide range of image editing tasks.
ThursdayNov 25, 202112:15
Vision and AIRoom 1
Speaker:Yossi AdiTitle:A Textless Approach for Generative Spoken Language ModelingAbstract:opens in new windowin html    pdfopens in new windowZoom:
An open question for AI research is creating systems that learn from natural interactions as infants learn their first language(s): spontaneously and without access to text or expert labels. Current NLP systems require large amounts of text, which excludes plenty of the world’s languages that have little textual resources or no widely used written form. In addition, textual features do not encode speaker-specific speech properties beyond content (e.g., identity, style, emotion, etc.), as well as structured signals that are part of natural human interaction (intonation, hesitation, laughter, etc.) which are important in the oral form. In this talk, I'll present our recent studies in developing a Textless Approach for Generative Spoken Language Modeling. The proposed framework is comprised of a pseudo-text Encoder, Sequential modeling, and Speech generation components, all of which were trained in an unsupervised fashion. Lastly, I will present various applications which can benefit from such modeling together with future research directions.
ThursdayNov 18, 202112:15
Vision and AI
Speaker:Or Perel Title:SAPE: Spatially-Adaptive Progressive Encoding for Neural OptimizationAbstract:opens in new windowin html    pdfopens in new windowThe meeting will be only in ZOOM today:
Multilayer-perceptrons (MLP) are known to struggle with learning functions of high-frequencies, and in particular cases with wide frequency bands. We present a spatially adaptive progressive encoding (SAPE) scheme for input signals of MLP networks, which enables them to better fit a wide range of frequencies without sacrificing training stability or requiring any domain specific preprocessing. SAPE gradually unmasks signal components with increasing frequencies as a function of time and space. The progressive exposure of frequencies is monitored by a feedback loop throughout the neural optimization process, allowing changes to propagate at different rates among local spatial portions of the signal space. We demonstrate the advantage of SAPE on a variety of domains and applications, including regression of low dimensional signals and images, representation learning of occupancy networks, and a geometric task of mesh transfer between 3D shapes. Project page:
SundayAug 15, 202111:00
Vision and AI
Speaker:Yoni KastenTitle:Algebraic Characterization of Relational Camera Pose Measurements in Multiple ImagesAbstract:opens in new windowin html    pdfopens in new window *** Please note the Unusual Time ***

Structure from Motion (SfM) deals with recovering camera parameters and 3D scene structure from collections of 2D images. SfM is commonly solved by minimizing the non-covex, bundle adjustment objective, which generally requires sophisticated initialization. In this talk I will present two approaches to SfM: the first approach involves averaging of essential or fundamental matrices (also called bifocal tensors). Since the bifocal tensors are computed independently from image pairs they are generally inconsistent with any set of n cameras.  We provide a complete algebraic characterization of the manifold of bifocal tensors for n cameras and present an optimization framework to project measured bifocal tensors onto the manifold. Our second approach is an online approach: given n-1 images, I_1,...,I_{n-1}, whose camera matrices have already been recovered, we seek to recover the camera matrix associated with an image I_n . We present a novel solution to the six-point online algorithm to recover the exterior parameters associated with I_n. Our algorithm uses just six corresponding pairs of 2D points, extracted each from I_n and from any of the preceding n-1 images, allowing the recovery of the full six degrees of freedom of the n'th camera, and unlike common methods, does not require tracking feature points in three or more images. We present experiments that demonstrate the utility of both our approaches. If time permits, I will briefly present additional recent work for solving SfM using deep neural models.

ThursdayJun 17, 202112:15
Vision and AI
Speaker:Hila LeviTitle:Combining bottom-up and top-down computations for image interpretationAbstract:opens in new windowin html    pdfopens in new window

Visual scene understanding has traditionally focused on identifying objects in images learning to predict their presence and spatial extent. However, understanding a visual scene often goes beyond recognizing individual objects. In my thesis (guided by Prof. Ullman), I mainly focused on developing task-dependent network, that uses processing instructions to guide the functionality of a shared network via an additional task input. In addition, I also studied strategies for incorporating relational information into recognition pipelines to efficiently extract structures of interest from the scene. In the scope of high level scene understanding, which might be dominated by recognizing a rather small number of objects and relations, the above task-dependent scheme naturally allows goal-directed scene interpretation by either a single step or by a sequential execution with a series of different TD instructions. It simplifies the use of referring relations, grounds requested visual concepts back into the image-plane and improves combinatorial generalization, essential for AI systems, by using structured representations and computations. In the scope of multi-task learning the above scheme offers an alternative to the popular multi-branched architecture, which simultaneously execute all tasks using task-specific branches on top of a shared backbone, challenges capacity limitations, increases task selectivity, allows scalability and further tasks extensions. Results will be shown in various applications: object detection, visual grounding, properties classification, human-object interactions and general scene interpretation. 
 Works included:
1.       H. Levi and S. Ullman.  Efficient coarse-to-fine non-local module for the detection of small objects. BMVC, 2019.
2.       H. Levi and S. Ullman.  Multi-task learning by a top-down control network. ICIP 2021.
3.       S. Ullman et al. Image interpretation by iterative bottom-up top-down processing.
4.       A. Arbelle et al. Detector-Free Weakly Supervised Grounding by Separation. Submitted to ICCV.
5.       Ongoing work – Human Object Interactions


ThursdayJun 17, 202112:15
Vision and AI
Speaker:Hila LeviTitle:Combining bottom-up and top-down computations for image interpretationAbstract:opens in new windowin html    pdfopens in new window
Visual scene understanding has traditionally focused on identifying objects in images – learning to predict their presence and spatial extent. However, understanding a visual scene often goes beyond recognizing individual objects. In my thesis (guided by Prof. Ullman), I mainly focused on developing `task-dependent network’, that uses processing instructions to guide the functionality of a shared network via an additional task input. In addition, I also studied strategies for incorporating relational information into recognition pipelines to efficiently extract structures of interest from the scene. In the scope of high level scene understanding, which might be dominated by recognizing a rather small number of objects and relations, the above task-dependent scheme naturally allows goal-directed scene interpretation by either a single step or by a sequential execution with a series of different TD instructions. It simplifies the use of referring relations, grounds requested visual concepts back into the image-plane and improves combinatorial generalization, essential for AI systems, by using structured representations and computations. In the scope of ‘multi-task learning’ the above scheme offers an alternative to the popular ‘multi-branched architecture’, which simultaneously execute all tasks using task-specific branches on top of a shared backbone, challenges capacity limitations, increases task selectivity, allows scalability and further tasks extensions. Results will be shown in various applications: object detection, visual grounding, properties classification, human-object interactions and general scene interpretation. Works included: 1. H. Levi and S. Ullman. Efficient coarse-to-fine non-local module for the detection of small objects. BMVC, 2019. 2. H. Levi and S. Ullman. Multi-task learning by a top-down control network. ICIP 2021. 3. S. Ullman et al. Image interpretation by iterative bottom-up top-down processing. 4. A. Arbelle et al. Detector-Free Weakly Supervised Grounding by Separation. Submitted to ICCV. 5. Ongoing work – Human Object Interactions
ThursdayJun 17, 202112:15
Vision and AI
Speaker:Hila LeviTitle:Combining bottom-up and top-down computations for image interpretationAbstract:opens in new windowin html    pdfopens in new window

Visual scene understanding has traditionally focused on identifying objects in images – learning to predict their presence and spatial extent. However, understanding a visual scene often goes beyond recognizing individual objects. In my thesis (guided by Prof. Ullman), I mainly focused on developing `task-dependent network’, that uses processing instructions to guide the functionality of a shared network via an additional task input. In addition, I also studied strategies for incorporating relational information into recognition pipelines to efficiently extract structures of interest from the scene. In the scope of high level scene understanding, which might be dominated by recognizing a rather small number of objects and relations, the above task-dependent scheme naturally allows goal-directed scene interpretation by either a single step or by a sequential execution with a series of different TD instructions. It simplifies the use of referring relations, grounds requested visual concepts back into the image-plane and improves combinatorial generalization, essential for AI systems, by using structured representations and computations. In the scope of ‘multi-task learning’ the above scheme offers an alternative to the popular ‘multi-branched architecture’, which simultaneously execute all tasks using task-specific branches on top of a shared backbone, challenges capacity limitations, increases task selectivity, allows scalability and further tasks extensions. Results will be shown in various applications: object detection, visual grounding, properties classification, human-object interactions and general scene interpretation.

Works included:

1. H. Levi and S. Ullman. Efficient coarse-to-fine non-local module for the detection of small objects. BMVC, 2019.

2. H. Levi and S. Ullman. Multi-task learning by a top-down control network. ICIP 2021.

3. S. Ullman et al. Image interpretation by iterative bottom-up top-down processing.

4. A. Arbelle et al. Detector-Free Weakly Supervised Grounding by Separation. Submitted to ICCV.

5. Ongoing work – Human Object Interactions

ThursdayJun 10, 202112:15
Vision and AI
Speaker:Niv GranotTitle:Drop the GAN: In Defense of Patches Nearest Neighbors as Single Image Generative ModelsAbstract:opens in new windowin html    pdfopens in new window
Single image generative models perform synthesis and manipulation tasks by capturing the distribution of patches within a single image. The classical (pre Deep Learning) prevailing approaches for these tasks are based on an optimization process that maximizes patch similarity between the input and generated output. Recently, however, Single Image GANs were introduced both as a superior solution for such manipulation tasks, but also for remarkable novel generative tasks. Despite their impressiveness, single image GANs require long training time (usually hours) for each image and each task. They often suffer from artifacts and are prone to optimization issues such as mode collapse. We show that all of these tasks can be performed without any training, within several seconds, in a unified, surprisingly simple framework. The "good-old" patch-based methods are revisited and casted into a novel optimization-free framework. This allows generating random novel images better and much faster than GANs. We further demonstrate a wide range of applications, such as image editing and reshuffling, retargeting to different sizes, structural analogies, image collage and a newly introduced task of conditional inpainting. Not only is our method faster (×103-×104 than a GAN), it produces superior results (confirmed by quantitative and qualitative evaluation), less artifacts and more realistic global structure than any of the previous approaches (whether GAN-based or classical patch-based).
ThursdayJun 03, 202112:15
Vision and AI
Speaker:Zongze WuTitle:StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation Abstract:opens in new windowin html    pdfopens in new window
We explore and analyze the latent style space of StyleGAN2, a state-of-the-art architecture for image generation, using models pretrained on several different datasets. We first show that StyleSpace, the space of channel-wise style parameters, is significantly more disentangled than the other intermediate latent spaces explored by previous works. Next, we describe a method for discovering a large collection of style channels, each of which is shown to control a distinct visual attribute in a highly localized and disentangled manner. Third, we propose a simple method for identifying style channels that control a specific attribute, using a pretrained classifier or a small number of example images. Manipulation of visual attributes via these StyleSpace controls is shown to be better disentangled than via those proposed in previous works. To show this, we make use of a newly proposed Attribute Dependency metric. Finally, we demonstrate the applicability of StyleSpace controls to the manipulation of real images. Our findings pave the way to semantically meaningful and well-disentangled image manipulations via simple and intuitive interfaces. This is a joint work with Dani Lischinski and Eli Shechtman. paper(CVPR 2021 oral): code: video:
ThursdayMay 13, 202112:15
Vision and AI
Speaker:Assaf ShocherTitle:Deep Internal LearningAbstract:opens in new windowin html    pdfopens in new window
Deep Learning has always been divided into two phases: Training and Inference. The common practice for Deep Learning is training big networks on huge datasets. While very successful, such networks are only applicable to the type of data they were trained for and require huge amounts of annotated data, which in many cases are not available. In my thesis (guided by Prof. Irani), I invented ``Deep Internal Learning''. Instead of learning to generally solve a task for all inputs, we perform ``ad hoc'' learning for specific input. We train an image-specific network, we do it at test-time and on the test-input only, in an unsupervised manner (no label or ground-truth). In this regime, training is actually a part of the inference, no additional data or prior training is taking place. I will demonstrate how we applied this framework for various challenges: Super-Resolution, Segmentation, Dehazing, Transparency-Separation, Watermark removal. I will also show how this approach can be incorporated to Generative Adversarial Networks by training a GAN on a single image. If time permits I will also cover some partially related works. Links to papers:
ThursdayMay 06, 202112:15
Vision and AI
Speaker:Nathan SrebroTitle:What, How and When can we Learn Adversarially Robustly?Abstract:opens in new windowin html    pdfopens in new window A Joint Computer Vision & Machine Learning seminar
In this talk we will discuss the problem of learning an adversarially robust predictor from clean training data. That is, learning a predictor that performs well not only on future test instances, but also when these instances are corrupted adversarially. There has been much empirical interest in this question, and in this talk we will take a theoretical perspective and see how it leads to practically relevant insights, including: the need to depart from an empirical (robust) risk minimization approach, and thinking of what kind of accesses and reductions can allow learning. Joint work with Omar Montasser and Steve Hanneke.
ThursdayApr 29, 202112:15
Vision and AI
Speaker:Raja GiryesTitle:Robustifying neural networks Abstract:opens in new windowin html    pdfopens in new window
In this talk I will survey several techniques to make neural networks more robust. While neural networks achieve groundbreaking results in many applications, they depend strongly on the availability of good training data and the assumption that the data in the test time will resemble the one at train time. In this talk, we will survey different techniques that we developed for improving the network robustness and/or adapting it to the data at hand.
ThursdayApr 22, 202112:15
Vision and AI
Speaker:Adi ShamirTitle:A New Theory of Adversarial Examples in Machine LearningAbstract:opens in new windowin html    pdfopens in new windowSchmidt Hall and via Zoom: ** Note that attendance is limited to 50 people under the 'green and purple badges' ** Covid-19 instructions for Schmidt hall: 1. Only vaccinated/recovered people can enter. 2. At least one empty chair is required between each two participants. 3. Sitting is not allowed in the first two rows. 4. All participants should wear a mask (also during the lecture).
The extreme fragility of deep neural networks when presented with tiny perturbations in their inputs was independently discovered by several research groups in 2013. Due to their mysterious properties and major security implications, these adversarial examples had been studied extensively over the last eight years, but in spite of enormous effort they remained a baffling phenomenon with no clear explanation. In particular, it was not clear why a tiny distance away from almost any cat image there are images which are recognized with a very high level of confidence as cars, planes, frogs, horses, or any other desired class, why the adversarial modification which turns a cat into a car does not look like a car at all, and why a network which was adversarially trained with randomly permuted labels (so that it never saw any image which looks like a cat being called a cat) still recognizes most cat images as cats. The goal of this talk is to introduce a new theory of adversarial examples, which we call the Dimpled Manifold Model. It can easily explain in a simple and intuitive way why they exist and why they have all the bizarre properties mentioned above. In addition, it sheds new light on broader issues in machine learning such as what happens to deep neural networks during regular and during adversarial training. Experimental support for this theory, obtained jointly with Oriel Ben Shmuel and Odelia Melamed, will be presented and discussed in the last part of the talk.
ThursdayApr 08, 202112:15
Vision and AI
Speaker:Shai BagonTitle:Ultrasound. Lung. Deep LearningAbstract:opens in new windowin html    pdfopens in new window
Lung ultrasound (LUS) is a cheap, safe and non-invasive imaging modality that can be performed at patient bed-side. However, to date LUS is not widely adopted due to lack of trained personnel required for interpreting the acquired LUS frames. In this work we propose a framework for training deep artificial neural networks for interpreting LUS, which may promote broader use of LUS. In our framework we explicitly address the issue of incorporating domain-specific prior knowledge to DL models. In our framework, we propose to provide a deep neural network not only the raw LUS frames as input, but explicitly inform it of these important anatomical features and artifacts in the form of additional channels containing pleural and vertical artifacts masks. By explicitly supplying this domain knowledge in this form to deep models standard off-the-shelf neural networks can be rapidly and efficiently finetuned to perform well various tasks on LUS data, such as frame classification or semantic segmentation. Our framework allows for a unified treatment of LUS frames captured by either convex or linear probes. We evaluated our proposed framework on the task of COVID-19 severity assessment using the ICLUS dataset. In particular, we finetuned simple image classification models to predict per-frame COVID-19 severity score. We also trained a semantic segmentation model to predict per-pixel COVID-19 severity annotations. Using the combined raw LUS frames and the detected lines for both tasks, our off the shelf models performed better than complicated models specifically designed for these tasks, exemplifying the efficacy of our framework.
ThursdayJan 28, 202112:15
Vision and AI
Speaker:Shai Avidan Title:Learning to SampleAbstract:opens in new windowin html    pdfopens in new window

There is a growing number of tasks that work directly on point clouds. As the size of the point cloud grows, so do the computational demands of these tasks. A possible solution is to sample the point cloud first. Classic sampling approaches, such as farthest point sampling (FPS), do not consider the downstream task. A recent work showed that learning a task-specific sampling can improve results significantly. However, the proposed technique did not deal with the non-differentiability of the sampling operation and offered a workaround instead. We introduce a novel differentiable relaxation for point cloud sampling that approximates sampled points as a mixture of points in the primary input cloud. Our approximation scheme leads to consistently good results on classification and geometry reconstruction applications. We also show that the proposed sampling method can be used as a front to a point cloud registration network. This is a challenging task since sampling must be consistent across two different point clouds for a shared downstream task. In all cases, our approach outperforms existing non-learned and learned sampling alternatives.

Based on the work of: Itai Lang, Oren Dovrat, and Asaf Manor

ThursdayJan 21, 202118:00
Vision and AI
Speaker:Or LitanyTitle:Learning on Pointclouds for 3D Scene UnderstandingAbstract:opens in new windowin html    pdfopens in new window

In this talk i'll be covering several works in the topic of 3D deep learning on pointclouds for scene understanding tasks.
First, I'll describe VoteNet (ICCV 2019, best paper nomination): a method for object detection from 3D pointclouds input, inspired by the classical generalized Hough voting technique. I'll then explain how we integrated image information into the voting scheme to further boost 3D detection (ImVoteNet, CVPR 2020). In the second part of my talk I'll describe recent studies focusing on reducing supervision for 3D scene understanding tasks, including PointContrast -- a self-supervised representation learning framework for 3D pointclods (ECCV 2020). Our findings in PointContrast are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -- demonstrating that the learned representation can generalize across domains.

ThursdayJan 14, 202112:15
Vision and AI
Speaker:Yoav Shechtman Title:Next generation localization microscopy - or - how and why to ruin a perfectly good microscopeAbstract:opens in new windowin html    pdfopens in new window

In localization microscopy, the positions of individual nanoscale point emitters (e.g. fluorescent molecules) are determined at high precision from their point-spread functions (PSFs). This enables highly precise single/multiple-particle-tracking, as well as super-resolution microscopy, namely single molecule localization microscopy (SMLM). To obtain 3D localization, we employ PSF engineering – namely, we physically modify the standard PSF of the microscope, to encode the depth position of the emitter.  In this talk I will describe how this method enables unprecedented capabilities in localization microscopy; specific applications include dense emitter fitting for super-resolution microscopy, multicolor imaging from grayscale data, volumetric multi-particle tracking/imaging, dynamic surface profiling, and high-throughput in-flow colocalization in live cells. We often combine the optical encoding method with neural nets (deep-learning) for decoding, i.e. image reconstruction; however, our use of neural nets is not limited to image processing - we use nets to design the optimal optical acquisition system in a task-specific manner.

ThursdayJan 07, 202112:15
Vision and AI
Speaker:Guy Gaziv Title:Decoding visual experience from brain activity Abstract:opens in new windowin html    pdfopens in new window
Deep Learning introduced powerful research tools for studying visual representation in the human brain. Here, we harnessed those tools for two branches of research: 1. The primary branch focuses on brain decoding: reconstructing and semantically classifying observed natural images from novel (unknown) fMRI brain recordings. This is a very difficult task due to the scarce supervised "paired" training examples (images with their corresponding fMRI recordings) that are available, even in the largest image-fMRI datasets. We present a self-supervised deep learning approach that overcomes this barrier. This is obtained by enriching the scarce paired training data with additional easily accessible "unpaired" data from both domains (i.e., images without fMRI, and fMRI without images). Our approach achieves state-of-the-art results in image reconstruction from fMRI responses, as well as unprecedented large-scale (1000-way) semantic classification of never-before-seen classes. 2. The secondary branch of research focuses on face representation in the human brain. We studied whether the unique structure of the face-space geometry, which is defined by pairwise similarities in activation patterns to different face images, constitutes a critical aspect in face perception. To test this, we compared the pairwise similarity between responses to face images of human-brain and of artificial Deep Convolutional Neural Networks (DCNN) that achieve human-level face recognition performance. Our results revealed a stark match between neural and intermediate DCNN layers' face-spaces. Our findings support the importance of face-space geometry in enabling face perception as well as a pictorial function of high-order face-selective regions of the human visual cortex.
ThursdayDec 31, 202012:15
Vision and AI
Speaker:Yedid HoshenTitle:Demystifying Unsupervised Image Translation through the Lens of Attribute Disentanglement Abstract:opens in new windowin html    pdfopens in new window

Recent approaches for unsupervised image translation are strongly reliant on generative adversarial training and ad-hoc architectural locality constraints. Despite their appealing results, it can be easily observed that the learned class and content representations are entangled which often hurts the translation performance. We analyse this task under the framework of image disentanglement into meaningful attributes. We first analyse the simpler setting, where the domain of the image and its other attributes are independent. By information arguments, we present a non-adversarial approach (LORD) that carefully designed an information bottleneck for class-content disentanglement. Our approach brings attention to several interesting and poorly explored phenomena, particularly the beneficial inductive biases of latent optimization and conditional generators, and it outperforms the top adversarial and non-adversarial class-content disentanglement methods (e,g, DrNet and MLVAE). By further information constraints, we extend our approach to the standard unsupervised image translation task where the unknown image properties are dependent on the domain. Our full approach surpasses the top unsupervised image translation methods (e.g. FUNIT and StartGAN-v2).


ThursdayDec 17, 202012:15
Vision and AI
Speaker:Yuval AtzmonTitle:A causal view of compositional zero-shot recognitionAbstract:opens in new windowin html    pdfopens in new window

People easily recognize new visual categories that are new combinations of known components. This compositional generalization capacity is critical for learning in real-world domains like vision and language because the long tail of new combinations dominates the distribution. Unfortunately, learning systems struggle with compositional generalization because they often build on features that are correlated with class labels even if they are not "essential" for the class. This leads to consistent misclassification of samples from a new distribution, like new combinations of known components.

In this talk I will present our work on compositional zero-shot recognition: I will describe a causal approach that asks "which intervention caused the image?"; A probabilistic approach using semantic soft “and-or” side-information; And last, an approach that is simultaneously effective for both many-shot and zero-shot classes.


ThursdayDec 10, 202012:15
Vision and AI
Speaker:Tomer Michaeli Title:GANs: Origins, Effective Training, and SteeringAbstract:opens in new windowin html    pdfopens in new window

Since their introduction by Goodfellow et al. in 2014, generative adversarial models seem to have completely transformed Computer Vision and Graphics. In this talk I will address three questions: (1) What did we do before the GAN era (and was it really that different)? (2) Is the way we train GANs in line with the theory (and can we do it better)? (3) How is information about object transformations encoded in a pre-trained generator?

I will start by showing that Contrastive Divergence (CD) learning (Hinton ‘02), the most widely used method for learning distributions before GANs, is in fact also an adversarial procedure. This settles a long standing debate regarding the objective that this method actually optimizes, which arose due to an unjustified approximation in the original derivation. Our observation explains CD’s great empirical success.

Going back to GANs, I will challenge the common practice for stabilizing training using spectral-normalization. Although theoretically motivated by the Wasserstein GAN formulation, I will show that this heuristic works for different reasons and can be significantly improved upon. Our improved approach leads to state-of-the-art results in many common tasks, including super-resolution and image-to-image-translation.

Finally, I will address the task of revealing meaningful directions in the latent space of a pre-trained GAN. I will show that such directions can be computed in closed form directly from the generator's weights, without the need of any training or optimization as done in existing works. I will particularly discuss nonlinear trajectories that have natural endpoints and allow controlling whether one transformation is allowed to come on the expense of another (e.g. zoom-in with or without allowing translation to keep the object centered).

* These are joint works with Omer Yair, Idan Kligvasser, Nurit Spingarn, and Ron Banner.


ThursdayJul 30, 202012:15
Vision and AI
Speaker:Assaf ShocherTitle:Semantic Pyramid for Image Generation Abstract:opens in new windowin html    pdfopens in new windowZoom:

We present a novel GAN-based model that utilizes the space of deep features learned by a pre-trained classification model. Inspired by classical image pyramid representations, we construct our model as a Semantic Generation Pyramid - a hierarchical framework which leverages the continuum of semantic information encapsulated in such deep features; this ranges from low level information contained in fine features to high level, semantic information contained in deeper features. More specifically, given a set of features extracted from a reference image, our model generates diverse image samples, each with matching features at each semantic level of the classification model. We demonstrate that our model results in a versatile and flexible framework that can be used in various classic and novel image generation tasks. These include: generating images with a controllable extent of semantic similarity to a reference image, and different manipulation tasks such as semantically-controlled inpainting and compositing; all achieved with the same model, with no further training.



ThursdayJun 25, 202012:15
Vision and AI
Speaker:Omry Sendik Title:Unsupervised multi-modal Styled Content GenerationAbstract:opens in new windowin html    pdfopens in new windowZoom Link:

The emergence of deep generative models has recently enabled the automatic generation of massive amounts of graphical content, both in 2D and in 3D.
Generative Adversarial Networks (GANs) and style control mechanisms, such as Adaptive Instance Normalization (AdaIN), have proved particularly effective in this context, culminating in the state-of-the-art StyleGAN architecture.
While such models are able to learn diverse distributions, provided a sufficiently large training set, they are not well-suited for scenarios where the distribution of the training data exhibits a multi-modal behavior. In such cases, reshaping a uniform or normal distribution over the latent space into a complex multi-modal distribution in the data domain is challenging, and the generator might fail to sample the target distribution well. Furthermore, existing unsupervised generative models are not able to control the mode of the generated samples independently of the other visual attributes, despite the fact that they are typically disentangled in the training data.
In this work, we introduce UMMGAN, a novel architecture designed to better model multi-modal distributions, in an unsupervised fashion. Building upon the StyleGAN architecture, our network learns multiple modes, in a completely unsupervised manner, and combines them using a set of learned weights. We demonstrate that this approach is capable of effectively approximating a complex distribution as a superposition of multiple simple ones.
We further show that UMMGAN effectively disentangles between modes and style, thereby providing an independent degree of control over the generated content.

This is a joint work with Prof. Dani Lischinski and Prof. Daniel Cohen-Or.

Zoom Link:

ThursdayJun 11, 202012:15
Vision and AI
Speaker:Yuval Bahat Title:Explorable Super ResolutionAbstract:opens in new windowin html    pdfopens in new windowZoom meeting:

Single image super resolution (SR) has seen major performance leaps in recent years. However, existing methods do not allow exploring the infinitely many plausible reconstructions that might have given rise to the observed low-resolution (LR) image. These different explanations to the LR image may dramatically vary in their textures and fine details, and may often encode completely different semantic information. In this work, we introduce the task of explorable super resolution. We propose a framework comprising a graphical user interface with a neural network backend, allowing editing the SR output so as to explore the abundance of plausible HR explanations to the LR input. At the heart of our method is a novel module that can wrap any existing SR network, analytically guaranteeing that its SR outputs would precisely match the LR input, when downsampled. Besides its importance in our setting, this module is guaranteed to decrease the reconstruction error of any SR network it wraps, and can be used to cope with blur kernels that are different from the one the network was trained for. We illustrate our approach in a variety of use cases, ranging from medical imaging and forensics, to graphics.

Zoom link:

ThursdayJun 04, 202012:15
Vision and AI
Speaker:Tali Treibitz Title:A Method For Removing Water From Underwater ImagesAbstract:opens in new windowin html    pdfopens in new windowZoom meeting:

Robust recovery of lost colors in underwater images remains a challenging problem. We recently showed that this was partly due to the prevalent use of an atmospheric image formation model for underwater images and proposed a physically accurate model. The revised model showed: 1) the attenuation coefficient of the signal is not uniform across the scene but depends on object range and reflectance, 2) the coefficient governing the increase in backscatter with distance differs from the signal attenuation coefficient. Here, we present the first method that recovers color with our revised model, using RGBD images. The Sea-thru method estimates backscatter using the dark pixels and their known range information. Then, it uses an estimate of the spatially varying illuminant to obtain the range-dependent attenuation coefficient. Using more than 1,100 images from two optically different water bodies, which we make available, we show that our method with the revised model outperforms those using the atmospheric model. Consistent removal of water will open up large underwater datasets to powerful computer vision and machine learning algorithms, creating exciting opportunities for the future of underwater exploration and conservation. (Paper published in CVPR 19).

Zoom link:

ThursdayJan 30, 202012:15
Vision and AIRoom 1
Speaker:Oshri Halimi Title:Handling the Unknown with Non-Rigid Geometric InvariantsAbstract:opens in new windowin html    pdfopens in new window

My goal is to demonstrate how geometric priors can replace the requirement for annotated 3D data. In this context I'll present two of my works. First, I will present a deep learning framework that estimates dense correspondence between articulated 3D shapes without using any ground truth labeling which I presented as an oral presentation at CVPR 2019: "Unsupervised Learning of Dense Shape Correspondence". We demonstrated that our method is applicable to full and partial 3D models as well as to realistic scans. The problem of incomplete 3D data can be encountered also in many other different scenarios. One such interesting problem of great practical importance is shape completion, to which I'll dedicate the second part of my lecture.
It is common to encounter situations where there is a considerable distinction between the scanned 3D model and the final rendered one. The distinction can be attributed to occlusions or directional view of the target, when the scanning device is localized in space. I will demonstrate that geometric priors can guide learning algorithms in the task of 3D model completion from partial observation. To this end, I'll present our recent work: "The Whole is Greater than the Sum of its Non-Rigid Parts". This famous declaration of Aristotle was adopted to explain human perception by the Gestalt psychology school of thought in the twentieth century. Here, we claim that observing part of an object which was previously acquired as a whole, one could deal with both partial matching and shape completion in a holistic manner. More specifically, given the geometry of a full, articulated object in a given pose, as well as a partial scan of the same object in a different pose, we address the problem of matching the part to the whole while simultaneously reconstructing the new pose from its partial observation. Our approach is data-driven, and takes the form of a Siamese autoencoder without the requirement of a consistent vertex labeling at inference time; as such, it can be used on unorganized point clouds as well as on triangle meshes. We demonstrate the practical effectiveness of our model in the applications of single-view deformable shape completion and dense shape correspondence, both on synthetic and real-world geometric data, where we outperform prior work on these tasks by a large margin.

ThursdayJan 23, 202012:15
Vision and AIRoom 1
Speaker:Oren Dovrat Title:Learning to SampleAbstract:opens in new windowin html    pdfopens in new window

Processing large point clouds is a challenging task. Therefore, the data is often sampled to a size that can be processed more easily. The question is how to sample the data? A popular sampling technique is Farthest Point Sampling (FPS). However, FPS is agnostic to a downstream application (classification, retrieval, etc.). The underlying assumption seems to be that minimizing the farthest point distance, as done by FPS, is a good proxy to other objective functions.

We show that it is better to learn how to sample. To do that, we propose a deep network to simplify 3D point clouds. The network, termed S-NET, takes a point cloud and produces a smaller point cloud that is optimized for a particular task. The simplified point cloud is not guaranteed to be a subset of the original point cloud. Therefore, we match it to a subset of the original points in a post-processing step. We contrast our approach with FPS by experimenting on two standard data sets and show significantly better results for a variety of applications. Our code is publicly available.

ThursdayJan 16, 202012:15
Vision and AIRoom 1
Speaker:Idan Schwartz Title:Factor Graph AttentionAbstract:opens in new windowin html    pdfopens in new window

Dialog is an effective way to exchange information, but subtle details and nuances are extremely important. While significant progress has paved a path to address visual dialog with algorithms, details and nuances remain a challenge. Attention mechanisms have demonstrated compelling results to extract details in visual question answering and also provide a convincing framework for visual dialog due to their interpretability and effectiveness. However, the many data utilities that accompany visual dialog challenge existing attention techniques. We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities. To this end, we design a factor graph based attention mechanism which combines any number of utility representations. We illustrate the applicability of the proposed approach on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods by 1.1% for VisDial0.9 and by 2% for VisDial1.0 on MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6%.

ThursdayJan 09, 202012:15
Vision and AIRoom 1
Speaker:Daphna WeinshallTitle:Beyond Accuracy: Neural Networks Show Similar Learning Dynamics Across ArchitecturesAbstract:opens in new windowin html    pdfopens in new window

One of the unresolved questions in deep learning is the nature of the solutions that are being discovered. We investigated the collection of solutions reached by the same neural network (NN) architecture, with different random initialization of weights and random mini-batches. These solutions are shown to be rather similar -- more often than not, each train and test example is either classified correctly by all NN instances, or by none at all. Furthermore, all NNs seem to share the same learning dynamics, whereby initially the same train and test examples are correctly recognized by the learned model, followed by other examples that are learned in roughly the same order. When extending the investigation to heterogeneous collections of NN architectures, once again examples are seen to be learned in the same order irrespective of architecture, although the more powerful architecture may continue to learn and thus achieve higher accuracy. Finally, I will discuss cases where this pattern of similarity breaks down, which show that the reported similarity is not an artifact of optimization by gradient descent.

ThursdayJan 02, 202012:15
Vision and AIRoom 1
Speaker:Gabriel Stanovsky Title:Meaning Representation in Natural Language Tasks Abstract:opens in new windowin html    pdfopens in new windowJOINT Machine Learning and Computer Vision seminar

Recent developments in Natural Language Processing (NLP) allow models to leverage large, unprecedented amounts of raw text, culminating in impressive performance gains in many of the field"s long-standing challenges, such as machine translation, question answering, or information retrieval.

In this talk, I will show that despite these advances, state-of-the-art NLP models often fail to capture crucial aspects of text understanding. Instead, they excel by finding spurious patterns in the data, which lead to biased and brittle performance. For example, machine translation models are prone to translate doctors as men and nurses as women, regardless of context. Following, I will discuss an approach that could help overcome these challenges by explicitly representing the underlying meaning of texts in formal data structures. Finally, I will present robust models that use such explicit representations to effectively identify meaningful patterns in real-world texts, even when training data is scarce.

ThursdayDec 26, 201912:15
Vision and AIRoom 1
Speaker:Greg Shakhnarovich Title:Pixel Consensus Voting for Panoptic SegmentationAbstract:opens in new windowin html    pdfopens in new window

I will present a new approach for image parsing, Pixel Consensus Voting (PCV). The core of PCV is a framework for instance segmentation based on the Generalized Hough transform. Pixels cast discretized, probabilistic votes for the likely regions that contain instance centroids. At the detected peaks that emerge in the voting heatmap, backprojection is applied to collect pixels and produce instance masks. Unlike a sliding window detector that densely enumerates object proposals, our method detects instances as a result of the consensus among pixel-wise votes. We implement vote aggregation and backprojection using native operators of a convolutional neural network. The discretization of centroid voting reduces the training of instance segmentation to pixel labeling, analogous and complementary to fully convolutional network-style semantic segmentation, leading to an efficient and unified architecture that jointly models things and stuff. We demonstrate the effectiveness of our pipeline on COCO and Cityscapes Panoptic Segmentation and obtain competitive results. This joint work with Haochen Wang (TTIC/CMU), Ruotian Luo (TTIC), Michael Maire (TTIC/Uchicago) received an Innovation Award at the COCO/Mapillary workshop at ICCV 2019.

ThursdayDec 19, 201912:15
Vision and AIRoom 1
Speaker:Tammy Riklin RavivTitle:Hue-Net: Intensity-based Image-to-Image Translation with Differentiable Histogram Loss FunctionsAbstract:opens in new windowin html    pdfopens in new window

In the talk I will present the Hue-Net - a novel Deep Learning framework for Intensity-based Image-to-Image Translation.

The key idea is a new technique we term network augmentation which allows a differentiable construction of intensity histograms from images.

We further introduce differentiable representations of (1D) cyclic and joint (2D) histograms and use them for defining loss functions based on cyclic Earth Mover's Distance (EMD) and Mutual Information (MI). While the Hue-Net can be applied to several image-to-image translation tasks, we choose to demonstrate its strength on color transfer problems, where the aim is to paint a source image with the colors of a different target image. Note that the desired output image does not exist and therefore cannot be used for supervised pixel-to-pixel learning.

This is accomplished by using the HSV color-space and defining an intensity-based loss that is built on the EMD between the cyclic hue histograms of the output and the target images. To enforce color-free similarity between the source and the output images, we define a semantic-based loss by a differentiable approximation of the MI of these images.  

The incorporation of histogram loss functions in addition to an adversarial loss enables the construction of semantically meaningful and realistic images.

Promising results are presented for different datasets.

WednesdayDec 04, 201911:15
Vision and AIRoom 1
Speaker:Liad Pollak Title:Across Scales & Across Dimensions: Temporal Super-Resolution using Deep Internal LearningAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE THE UNUSUAL DAY AND TIME

When a very fast dynamic event is recorded with a low framerate camera, the resulting video suffers from severe motion blur (due to exposure time) and motion aliasing (due to low sampling rate in time). True Temporal Super-Resolution (TSR) is more than just Temporal-Interpolation (increasing framerate). It also recovers new high temporal frequencies beyond the temporal nyquist limit of the input video, thus resolving both motion-blur and motion-aliasing. In this work we propose a "Deep Internal Learning" approach for true TSR. We train a video-specific CNN on examples extracted directly from the low-framerate input video. Our method exploits the strong recurrence of small space-time patches inside a single video sequence, both within and across different spatio-temporal scales of the video. We further observe (for the first time) that small space-time patches recur also across-dimensions of the video sequence - i.e., by swapping the spatial and temporal dimensions. In particular, the higher spatial resolution of video frames provides strong examples as to how to increase the temporal resolution of that video. Such internal video-specific examples give rise to strong self-supervision, requiring no data but the input video itself. This results in Zero-Shot Temporal-SR of complex videos, which removes both motion blur and motion aliasing, outperforming previous supervised methods trained on external video datasets.

* Joint work with Shai Bagon, Eyal Naor, George Pisha, Michal Irani

ThursdayNov 28, 201912:15
Vision and AIRoom 155
Speaker:Sefi Bell KliglerTitle:Blind Super-Resolution Kernel Estimation using an Internal-GANAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE THE UNUSUAL PLACE

Super resolution (SR) methods typically assume that the low-resolution (LR) image was downscaled from the unknown high-resolution (HR) image by a fixed "ideal" downscaling kernel (e.g. Bicubic downscaling). However, this is rarely the case in real LR images, in contrast to synthetically generated SR datasets. When the assumed downscaling kernel deviates from the true one, the performance of SR methods significantly deteriorates. This gave rise to Blind-SR - namely, SR when the downscaling kernel ("SR-kernel") is unknown. It was further shown that the true SR-kernel is the one that maximizes the recurrence of patches across scales of the LR image. In this paper we show how this powerful cross-scale recurrence property can be realized using Deep Internal Learning. We introduce "KernelGAN", an image-specific Internal-GAN, which trains solely on the LR test image at test time, and learns its internal distribution of patches. Its Generator is trained to produce a downscaled version of the LR test image, such that its Discriminator cannot distinguish between the patch distribution of the downscaled image, and the patch distribution of the original LR image. The Generator, once trained, constitutes the downscaling operation with the correct image-specific SR-kernel. KernelGAN is fully unsupervised, requires no training data other than the input image itself, and leads to state-of-the-art results in Blind-SR when plugged into existing SR algorithms.

ThursdayNov 21, 201912:15
Vision and AIRoom 1
Speaker:Ohad Fried Title:Tools for visual expression and communication.Abstract:opens in new windowin html    pdfopens in new window

Photos and videos are now a main mode of communication, used to tell stories, share experiences and convey ideas. However, common media editing tools are often either too complex to master, or oversimplified and limited. 
In this talk I will present my strategy towards the creation of media editing techniques that are easy to learn, yet expressive enough to reflect unique creative objectives. We will mostly discuss one specific domain --- human heads --- which are both extremely common (i.e. people care about people) and technologically challenging. I will present several works on editing video by editing text, perspective manipulation, and in-camera lighting feedback. I will also discuss exciting future opportunities related to neural rendering and digital representations of humans.

Ohad Fried is a postdoctoral research scholar at Stanford University. His work lies in the intersection of computer graphics, computer vision, and human-computer interaction. He holds a PhD in computer science from Princeton University, and an M.Sc. in computer science and a B.Sc. in computational biology from The Hebrew University. Ohad's research focuses on tools, algorithms, and new paradigms for photo and video editing. Ohad is the recipient of several awards, including a Siebel Scholarship and a Google PhD Fellowship. If you own a cable modem, there's a non-negligible chance that Ohad's code runs within it, so feel free to blame him for your slow internet connection.

SundaySep 08, 201911:15
Vision and AIRoom 1
Speaker:Trevor DarrellTitle:Adapting and Explaining Deep Learning for Autonomous SystemsAbstract:opens in new windowin html    pdfopens in new windowSpecial Seminar

Learning of layered or "deep" representations has recently enabled low-cost sensors for autonomous vehicles and efficient automated analysis of visual semantics in online media. But these models have typically required prohibitive amounts of training data, and thus may only work well in the environment they have been trained in.  I'll describe recent methods in adversarial adaptive learning that excel when learning across modalities and domains. Further, these models have been unsatisfying in their complexity--with millions of parameters--and their resulting opacity. I'll report approaches which achieve explainable deep learning models, including both introspective approaches that visualize compositional structure in a deep network, and third-person approaches that can provide a natural language justification for the classification decision of a deep model.

SundayAug 04, 201913:15
Vision and AIRoom 1
Speaker:Rene VidalTitle:On the Implicit Bias of DropoutAbstract:opens in new windowin html    pdfopens in new window

Dropout is a simple yet effective regularization technique that has been applied to various machine learning tasks, including linear classification, matrix factorization and deep learning. However, the theoretical properties of dropout as a regularizer remain quite elusive. This talk will present a theoretical analysis of dropout for single hidden-layer linear neural networks. We demonstrate that dropout is a stochastic gradient descent method for minimizing a certain regularized loss. We show that the regularizer induces solutions that are low-rank, in the sense of minimizing the number of neurons. We also show that the global optimum is balanced, in the sense that the product of the norms of incoming and outgoing weight vectors of all the hidden nodes equal. Finally, we provide a complete characterization of the optimization landscape induced by dropout.  

ThursdayJul 11, 201912:15
Vision and AIRoom 1
Speaker:Anat Levin Title:A Monte Carlo Framework for Rendering Speckle Statistics in Scattering MediaAbstract:opens in new windowin html    pdfopens in new window

We present a Monte Carlo rendering framework for the physically-accurate simulation of speckle patterns arising from volumetric scattering of coherent waves. These noise-like patterns are characterized by strong statistical properties, such as the so-called memory effect. These properties are at the core of imaging techniques for applications as diverse as tissue imaging, motion tracking, and non-line-of-sight imaging. Our rendering framework can replicate these properties computationally, in a way that is orders of magnitude more efficient than alternatives based on directly solving the wave equations. At the core of our framework is a path-space formulation for the covariance of speckle patterns arising from a scattering volume, which we derive from first principles. We use this formulation to develop two Monte Carlo rendering algorithms, for computing speckle covariance as well as directly speckle fields. While approaches based on wave equation solvers require knowing the microscopic position of wavelength-sized scatterers, our approach takes as input only bulk parameters describing the statistical distribution of these scatterers inside a volume. We validate the accuracy of our framework by comparing against speckle patterns simulated using wave equation solvers, use it to simulate memory effect observations that were previously only possible through lab measurements, and demonstrate its applicability for computational imaging tasks. 

Joint work with Chen Bar, Marina Alterman, Ioannis Gkioulekas 

ThursdayJul 04, 201912:15
Vision and AIRoom 1
Speaker:Nadav Dym Title:Linearly Converging Quasi Branch and Bound Algorithms for Global Rigid RegistrationAbstract:opens in new windowin html    pdfopens in new window

ThursdayJun 27, 201912:15
Vision and AIRoom 1
Speaker:Ehud Barnea Title:Exploring the Bounds of the Utility of Context for Object DetectionAbstract:opens in new windowin html    pdfopens in new window

The recurring context in which objects appear holds valuable information that can be employed to predict their existence. This intuitive observation indeed led many researchers to endow appearance-based detectors with explicit reasoning about context. The underlying thesis suggests that stronger contextual relations would facilitate greater improvements in detection capacity. In practice, however, the observed improvement in many cases is modest at best, and often only marginal. In this work we seek to improve our understanding of this phenomenon, in part by pursuing an opposite approach. Instead of attempting to improve detection scores by employing context, we treat the utility of context as an optimization problem: to what extent can detection scores be improved by considering context or any other kind of additional information? With this approach we explore the bounds on improvement by using contextual relations between objects and provide a tool for identifying the most helpful ones. We show that simple co-occurrence relations can often provide large gains, while in other cases a significant improvement is simply impossible or impractical with either co-occurrence or more precise spatial relations. To better understand these results we then analyze the ability of context to handle different types of false detections, revealing that tested contextual information cannot ameliorate localization errors, severely limiting its gains. These and additional insights further our understanding on where and why utilization of context for object detection succeeds and fails

ThursdayJun 13, 201912:15
Vision and AIRoom 1
Speaker:Eitan Richardson Title:On GANs and GMMsAbstract:opens in new windowin html    pdfopens in new window

A longstanding problem in machine learning is to find unsupervised methods that can learn the statistical structure of high dimensional signals. In recent years, GANs have gained much attention as a possible solution to the problem, and in particular have shown the ability to generate remarkably realistic high resolution sampled images. At the same time, many authors have pointed out that GANs may fail to model the full distribution ("mode collapse") and that using the learned models for anything other than generating samples may be very difficult. In this paper, we examine the utility of GANs in learning statistical models of images by comparing them to perhaps the simplest statistical model, the Gaussian Mixture Model. First, we present a simple method to evaluate generative models based on relative proportions of samples that fall into predetermined bins. Unlike previous automatic methods for evaluating models, our method does not rely on an additional neural network nor does it require approximating intractable computations. Second, we compare the performance of GANs to GMMs trained on the same datasets. While GMMs have previously been shown to be successful in modeling small patches of images, we show how to train them on full sized images despite the high dimensionality. Our results show that GMMs can generate realistic samples (although less sharp than those of GANs) but also capture the full distribution, which GANs fail to do. Furthermore, GMMs allow efficient inference and explicit representation of the underlying statistical structure. Finally, we discuss how GMMs can be used to generate sharp images.

ThursdayMay 30, 201912:15
Vision and AIRoom 1
Speaker:Yael Moses Title:On the Role of Geometry in Geo-LocalizationAbstract:opens in new windowin html    pdfopens in new window

We consider the geo-localization task - finding the pose (position & orientation) of a camera in a large 3D scene from a single image. We aim toexperimentally explore the role of geometry in geo-localization in a convolutional neural network (CNN) solution. We do so by ignoring the often available texture of the scene. We therefore deliberately avoid using texture or rich geometric details and use images projected from a simple 3D model of a city, which we term lean images. Lean images contain solely information that relates to the geometry of the area viewed (edges, faces, or relative depth). We find that the network is capable of estimating the camera pose from the lean images, and it does so not by memorization but by some measure of geometric learning of the geographical area. The main contributions of this work are: (i) providing insight into the role of geometry in the CNN learning process; and (ii) demonstrating the power of CNNs for recovering camera pose using lean images.
This is a joint work with Moti Kadosh  & Ariel Shamir

ThursdayMay 16, 201912:15
Vision and AIRoom 155
Speaker:Gilad Cohen Title:The Connection between DNNs and Classic Classifiers: Generalize, Memorize, or Both?Abstract:opens in new windowin html    pdfopens in new window

This work studies the relationship between the classification performed by deep neural networks (DNNs) and the decision of various classic classifiers, namely k-nearest neighbors (k-NN), support vector machines (SVM), and logistic regression (LR). This is studied at various layers of the network, providing us with new insights on the ability of DNNs to both memorize the training data and generalize to new data at the same time, where k-NN serves as the ideal estimator that perfectly memorizes the data.
First, we show that DNNs' generalization improves gradually along their layers and that memorization of non-generalizing networks happens only at the last layers.
We also observe that the behavior of DNNs compared to the linear classifiers SVM and LR is quite the same on the training and test data regardless of whether the network generalizes. On the other hand, the similarity to k-NN holds only at the absence of overfitting. This suggests that the k-NN behavior of the network on new data is a good sign of generalization. Moreover, this allows us to use existing k-NN theory for DNNs.

ThursdayApr 11, 201912:15
Vision and AIRoom 1
Speaker:Leonid KarlinskyTitle:Few-Shot Object XAbstract:opens in new windowin html    pdfopens in new window
Learning to classify and localize instances of objects that belong to new categories, while training on just one or very few examples, is a long-standing challenge in modern computer vision. This problem is generally referred to as 'few-shot learning'. It is particularly challenging for modern deep-learning based methods, which tend to be notoriously hungry for training data. In this talk I will cover several of our recent research papers offering advances on these problems using example synthesis (hallucination) and metric learning techniques and achieving state-of-the-art results on known and new few-shot benchmarks. In addition to covering the relatively well studied few-shot classification task, I will show how our approaches can address the yet under-studied few-shot localization and multi-label few-shot classification tasks.
ThursdayApr 04, 201912:15
Vision and AIRoom 1
Speaker:Tom Tirer Title:Image Restoration by Iterative Denoising and Backward ProjectionsAbstract:opens in new windowin html    pdfopens in new window

Inverse problems appear in many applications, such as image deblurring, inpainting and super-resolution. The common approach to address them is to design a specific algorithm (or recently - a deep neural network) for each problem. The Plug-and-Play (P&P) framework, which has been recently introduced, allows solving general inverse problems by leveraging the impressive capabilities of existing denoising algorithms. While this fresh strategy has found many applications, a burdensome parameter tuning is often required in order to obtain high-quality results. In this work, we propose an alternative method for solving inverse problems using off-the-shelf denoisers, which requires less parameter tuning (can be also translated into less pre-trained denoising neural networks). First, we transform a typical cost function, composed of fidelity and prior terms, into a closely related, novel optimization problem. Then, we propose an efficient minimization scheme with a plug-and-play property, i.e., the prior term is handled solely by a denoising operation. Finally, we present an automatic tuning mechanism to set the method's parameters. We provide a theoretical analysis of the method, and empirically demonstrate its impressive results for image inpainting, deblurring and super-resolution. For the latter, we also present an image-adaptive learning approach that further improves the results.

ThursdayMar 28, 201912:15
Vision and AIRoom 1
Speaker:Yizhak Ben-ShabatTitle:Classification, Segmentation, and Normal Estimation of 3D Point Clouds using Deep Learning Abstract:opens in new windowin html    pdfopens in new window

Modern robotic and vision systems are often equipped with a direct 3D data acquisition device, e.g. a LiDAR or RGBD camera, which provides a rich 3D point cloud representation of the surroundings. Point clouds have been used successfully for localization and mapping tasks, but their use in semantic understanding has not been fully explored. Recent advances in deep learning methods for images along with the growing availability of 3D point cloud data have fostered the development of new 3D deep learning methods that use point clouds for semantic understanding. However, their unstructured and unordered nature make them an unnatural input to deep learning methods. In this work we propose solutions to three semantic understanding and geometric processing tasks: point cloud classification, segmentation, and normal estimation. We first propose a new global representation for point clouds called the 3D Modified Fisher Vector (3DmFV). The representation is structured and independent of order and sample size. As such, it can be used with 3DmFV-Net, a newly designed 3D CNN architecture for classification. The representation introduces a conceptual change for processing point clouds by using a global and structured spatial distribution. We demonstrate the classification performance on the ModelNet40 CAD dataset and the Sydney outdoor dataset obtained by LiDAR. We then extend the architecture to solve a part segmentation task by performing per point classification. The results here are demonstrated on the ShapeNet dataset. We use the proposed representation to solve a fundamental and practical geometric processing problem of normal estimation using a new 3D CNN (Nesti-Net). To that end, we propose a local multi-scale representation called Multi Scale Point Statistics (MuPS) and show that using structured spatial distributions is also as effective for local multi-scale analysis as for global analysis. We further show that multi-scale data integrates well with a Mixture of Experts (MoE) architecture. The MoE enables the use of semi-supervised scale prediction to determine the appropriate scale for a given local geometry. We evaluate our method on the PCPNet dataset. For all methods we achieved state-of-the-art performance without using an end-to-end learning approach.

ThursdayFeb 07, 201912:15
Vision and AIRoom 1
Speaker:Tavi HalperinTitle:Using visual and auditory cues for audio enhancement Abstract:opens in new windowin html    pdfopens in new window

In this talk I will present two recent works:

1) Dynamic Temporal Alignment of Speech to Lips

Many speech segments in movies are re-recorded in a studio during post-production, to compensate for poor sound quality as recorded on location. Manual alignment of the newly-recorded speech with the original lip movements is a tedious task. We present an audio-to-video alignment method for automating speech to lips alignment, stretching and compressing the audio signal to match the lip movements. This alignment is based on deep audio-visual features, mapping the lips video and the speech signal to a shared representation. Using this shared representation we compute the lip-sync error between every short speech period and every video frame, followed by the determination of the optimal corresponding frame for each short sound period over the entire video clip. We demonstrate successful alignment both quantitatively, using a human perception-inspired metric, as well as qualitatively. The strongest advantage of our audio-to-video approach is in cases where the original voice is unclear, and where a constant shift of the sound can not give perfect alignment. In these cases, state-of-the-art methods will fail.

2) Neural separation of observed and unobserved distributions

Separating mixed distributions is a long standing challenge for machine learning and signal processing. Most current methods either rely on making strong assumptions on the source distributions or rely on having training samples of each source in the mixture. In this work, we introduce a new method - Neural Egg Separation - to tackle the scenario of extracting a signal from an unobserved distribution additively mixed with a signal from an observed distribution. Our method iteratively learns to separate the known distribution from progressively finer estimates of the unknown distribution. In some settings, Neural Egg Separation is initialization sensitive, we therefore introduce Latent Mixture Masking which ensures a good initialization. Extensive experiments on audio and image separation tasks show that our method outperforms current methods that use the same level of supervision, and often achieves similar performance to full supervision.

ThursdayJan 31, 201912:15
Vision and AIRoom 1
Speaker:Heli Ben-Hamu Title:Multi-chart Generative Surface Modeling Abstract:opens in new windowin html    pdfopens in new window

We introduce a 3D shape generative model based on deep neural networks. A new image-like (i.e., tensor) data representation for genus-zero 3D shapes is devised. It is based on the observation that complicated shapes can be well represented by multiple parameterizations (charts), each focusing on a different part of the shape.

The new tensor data representation is used as input to Generative Adversarial Networks for the task of 3D shape generation. The 3D shape tensor representation is based on a multichart structure that enjoys a shape covering property and scale-translation rigidity. Scale-translation rigidity facilitates high quality 3D shape learning and guarantees unique reconstruction.  The multi-chart structure uses as input adataset of 3D shapes (with arbitrary connectivity) and a sparse correspondence between them.

The output of our algorithm is a generative model that learns the shape distribution and is able to generate novel shapes, interpolate shapes, and explore the generated shape space. The effectiveness of the method is demonstrated for the task of anatomic shape generation including human body and bone (teeth) shape generation.

ThursdayJan 24, 201912:15
Vision and AIRoom 1
Speaker:Idan Kligvasser Title:xUnit: Learning a Spatial Activation Function for Efficient Visual InferenceAbstract:opens in new windowin html    pdfopens in new window

In recent years, deep neural networks (DNNs) achieved unprecedented performance in many vision tasks. However, state-of-the-art results are typically achieved by very deep networks, which can reach tens of layers with tens of millions of parameters. To make DNNs implementable on platforms with limited resources, it is necessary to weaken the tradeoff between performance and efficiency. In this work, we propose a new activation unit, which is suitable for both high and low level vision problems. In contrast to the widespread per-pixel activation units, like ReLUs and sigmoids, our unit implements a learnable nonlinear function with spatial connections. This enables the net to capture much more complex features, thus requiring a significantly smaller number of layers in order to reach the same performance. We illustrate the effectiveness of our units through experiments with state-of-the-art nets for classification, denoising, and super resolution. With our approach, we are able to reduce the size of these models by nearly 50% without incurring any degradation in performance.

*Spotlight presentation at CVPR'18 (+ submitted extension)

A joint work with Tamar Rott Shaham and Tomer Michaeli

ThursdayJan 17, 201912:15
Vision and AIRoom 1
Speaker:Tali Treibitz Title:Variational Plane-Sweeping for Multi-Image Alignment Under Extreme Noise and How is that Related to Underwater Robots?Abstract:opens in new windowin html    pdfopens in new window

We tackle the problem of multiple image alignment and 3D reconstruction under extreme noise.
Modern alignment schemes, based on similarity measures, feature matching and optical flow are often pairwise, or assume global alignment. Nevertheless, under extreme noise, the alignment success sharply deteriorates, since each image does not contain enough information. Yet, when multiple images are well aligned, the signal emerges from the stack.
As the problems of alignment and 3D reconstruction are coupled, we constrain the solution by taking into account only alignments that are geometrically feasible and solve for the entire stack simultaneously. The solution is formulated as a variational problem where only a single variable per pixel, the scene's distance, is solved for. Thus, the complexity of the algorithm is independent of the number of images. Our method outperforms state-of-the-art techniques, as indicated by our simulations and by experiments on real-world scenes.
And, finally- I will discuss how this algorithm is related to our current and planned work with underwater robots.

ThursdayJan 10, 201912:15
Vision and AIRoom 1
Speaker:Roey Mechrez Title:Controlling the Latent Space at Inference TimeAbstract:opens in new windowin html    pdfopens in new window

The common practice in image generation is to train a network and fix it to a specific working point. In contrast, in this work we would like to provide control at inference time in a generic way. We propose to add a second training phase, where we train an additional tuning block. The goal of these tuning blocks is to provide us with control over the output image by modifying the latent space representation. I will first present our latest attempt in the domain of generative adversarial, where we aim to improve the quality of the results using the discriminator information -- we name this adversarial feedback loop. Second, I will present Dynamic-Net, where we train a single network such that it can emulate many networks trained with different objectives. The main assumption of both works is that we can learn how to change the latent space in order to achieve a specific goal. Evaluation on many application shows that the latent space can be manipulated, such that, it allows us to have diversified control at inference time.
* Joint work with Firas Shama*, Alon Shoshan* and Lihi Zelnik Manor.

ThursdayJan 03, 201912:15
Vision and AIRoom 1
Speaker:Ofir LindenbaumTitle:Geometry Based Data GenerationAbstract:opens in new windowin html    pdfopens in new windowJOINT MACHINE LEARNING & VISION SEMINAR

Imbalanced data is problematic for many machine learning applications. Some classifiers, for example, can be strongly biased due to diversity of class populations. In unsupervised learning, imbalanced sampling of ground truth clusters can lead to distortions of the learned clusters. Removing points (undersampling) is a simple solution, but this strategy leads to information loss and can decrease generalization performance. We instead focus on data generation to artificially balance the data.
In this talk, we present a novel data synthesis method, which we call SUGAR (Synthesis Using Geometrically Aligned Random-walks) [1], for generating data in its original feature space while following its intrinsic geometry. This geometry is inferred by a diffusion kernel [2] that captures a data-driven manifold and reveals underlying structure in the full range of the data space -- including undersampled regions that can be augmented by new synthesized data.
We demonstrate the potential advantages of the approach by improving both classification and clustering performance on numerous datasets. Finally, we show that this approach is especially useful in biology, where rare subpopulations and gene-interaction relationships are affected by biased sampling.

[1] Lindenbaum, Ofir, Jay S. Stanley III, Guy Wolf, and Smita Krishnaswamy. "Geometry-Based Data Generation." NIPS (2018).

[2] Coifman, Ronald R., and Stéphane Lafon. "Diffusion maps." Applied and computational harmonic analysis (2006).

ThursdayDec 27, 201812:15
Vision and AIRoom 1
Speaker:Greg Shakhnarovich Title:Style Transfer by Relaxed Optimal Transport and Self-SimilarityAbstract:opens in new windowin html    pdfopens in new window

The goal of style transfer algorithms is to render the content of one image using the style of another. We propose Style Transfer by Relaxed Optimal Transport and Self-Similarity (STROTSS), a new optimization-based style transfer algorithm. We extend our method to allow user-specified point-to-point or region-to-region control over visual similarity between the style image and the output. Such guidance can be used to either achieve a particular visual effect or correct errors made by unconstrained style transfer. In order to quantitatively compare our method to prior work, we conduct a large-scale user study designed to assess the style-content tradeoff across settings in style transfer algorithms. Our results indicate that for any desired level of content preservation, our method provides higher quality stylization than prior work.
Joint work with Nick Kolkin and Jason Salavon.

ThursdayDec 20, 201812:15
Vision and AIRoom 1
Speaker:Yedid HoshenTitle:Non-Adversarial Machine Learning for Unsupervised Translation across Languages and Images Abstract:opens in new windowin html    pdfopens in new window

This talk will describe my past and ongoing work on translating images and words between very different datasets without supervision. Humans often do not require supervision to make connections between very different sources of data, but this is still difficult for machines. Recently great progress was made by using adversarial training a powerful yet tricky method. Although adversarial methods have had great success, they have critical failings which significantly limit their breadth of applicability and motivate research into alternative non-adversarial methods. In this talk, I will describe novel non-adversarial methods for unsupervised word translation and for translating images between very different datasets (joint work with Lior Wolf). As image generation models are an important component in our method, I will present a non-adversarial image generation approach, which is often better than current adversarial approaches (joint work with Jitendra Malik).

ThursdayDec 13, 201812:15
Vision and AIRoom 1
Speaker:Tali Dekel Title:Re-rendering RealityAbstract:opens in new windowin html    pdfopens in new window

We all capture the world around us through digital data such as images, videos and sound. However, in many cases, we are interested in certain properties of the data that are either not available or difficult to perceive directly from the input signal. My goal is to "Re-render Reality", i.e., develop algorithms that analyze digital signals and then create a new version of it that allows us to see and hear better. In this talk, I'll present a variety of methodologies aimed at enhancing the way we perceive our world through modified, re-rendered output. These works combine ideas from signal processing, optimization, computer graphics, and machine learning, and address a wide range of applications. More specifically, I'll demonstrate how we can automatically reveal subtle geometric imperfection in images, visualize human motion in 3D, and use visual signals to help us separate and mute interference sound in a video. Finally, I'll discuss some of my future directions and work in progress.

BIO: Tali is a Senior Research Scientist at Google, Cambridge, developing algorithms at the intersection of computer vision and computer graphics. Before Google, she was a Postdoctoral Associate at the Computer Science and Artificial Intelligence Lab (CSAIL) at MIT, working with Prof. William T. Freeman. Tali completed her Ph.D studies at the school of electrical engineering, Tel-Aviv University, Israel, under the supervision of Prof. Shai Avidan, and Prof. Yael Moses. Her research interests include computational photography, image synthesize, geometry and 3D reconstruction.

ThursdayDec 06, 201812:15
Vision and AIRoom 1
Speaker:Yuval BahatTitle:Exploiting Deviations from Ideal Visual RecurrenceAbstract:opens in new windowin html    pdfopens in new window

Visual repetitions are abundant in our surrounding physical world: small image patches tend to reoccur within a natural image, and across different rescaled versions thereof. Similarly, semantic repetitions appear naturally inside an object class within image datasets, as a result of different views and scales of the same object. In my thesis I studied deviations from these expected repetitions, and demonstrated how this information can be exploited to tackle both low-level and high-level vision tasks. These include “blind” image reconstruction tasks (e.g. dehazing, deblurring), image classification confidence estimation, and more.

ThursdayNov 29, 201812:15
Vision and AIRoom 155
Speaker:Yair Weiss Title:Why do deep convolutional networks generalize so poorly to small image transformations?Abstract:opens in new windowin html    pdfopens in new window

Deep convolutional network architectures are often assumed to guarantee generalization for small image translations and deformations. In this paper we show that modern CNNs (VGG16, ResNet50, and InceptionResNetV2) can drastically change their output when an image is translated in the image plane by a few pixels, and that this failure of generalization also happens with other realistic small image transformations. Furthermore,  we see these failures to generalize more frequently in more modern networks. We show that these failures are related to the fact that the architecture of modern CNNs ignores the classical sampling theorem so that generalization is not guaranteed. We also show that biases in the statistics of commonly used image datasets makes it unlikely that CNNs will learn to be invariant to these transformations. Taken together our results suggest that the performance of CNNs in object recognition falls far short of the generalization capabilities of humans.
Joint work with Aharon Azulay

ThursdayNov 22, 201812:15
Vision and AIRoom 1
Speaker:Yehuda Dar Title:System-Aware Compression: Optimizing Imaging Systems from the Compression StandpointAbstract:opens in new windowin html    pdfopens in new window

In typical imaging systems, an image/video is first acquired, then compressed for transmission or storage, and eventually presented to human observers using different and often imperfect display devices. While the resulting quality of the perceived output image may severely be affected by the acquisition and display processes, these degradations are usually ignored in the compression stage, leading to an overall sub-optimal system performance. In this work we propose a compression methodology to optimize the system's end-to-end reconstruction error with respect to the compression bit-cost. Using the alternating direction method of multipliers (ADMM) technique, we show that the design of the new globally-optimized compression reduces to a standard compression of a "system adjusted" signal. Essentially, we propose a new practical framework for the information-theoretic problem of remote source coding. The main ideas of our method are further explained using rate-distortion theory for Gaussian signals. We experimentally demonstrate our framework for image and video compression using the state-of-the-art HEVC standard, adjusted to several system layouts including acquisition and rendering models. The experiments established our method as the best approach for optimizing the system performance at high bit-rates from the compression standpoint.
In addition, we relate the proposed approach also to signal restoration using complexity regularization, where the likelihood of candidate solutions is evaluated based on their compression bit-costs.
Using our ADMM-based approach, we present new restoration methods relying on repeated applications of standard compression techniques. Thus, we restore signals by leveraging state-of-the-art models designed for compression. The presented experiments show good results for image deblurring and inpainting using the JPEG2000 and HEVC compression standards.
* Joint work with Prof. Alfred Bruckstein and Prof. Michael Elad.
** More details about the speaker and his research work are available at



ThursdayJul 05, 201812:15
Vision and AIRoom 1
Speaker:Netalee Efrat Title:Beyond the limitations of sensors and displaysAbstract:opens in new windowin html    pdfopens in new window

In this talk I explore various limitations of sensors and displays, and suggest new ways to overcome them.
These include:

  1. The limited 2D displays of today's screens - I will show how we can design new displays that enable us to see in 3D *without* any glasses.
  2. The limited spatial resolution of images - I will discuss the crucial factors for successful Super-Resolution.
  3. The poor image quality due to motion blur (due to camera motion or scene motion) - I will present a new approach for Blind Non-Uniform Deblurring.
ThursdayJun 28, 201812:15
Vision and AIRoom 1
Speaker:Ariel EphratTitle:Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech SeparationAbstract:opens in new windowin html    pdfopens in new window

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).

Joint work with Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, Bill Freeman and Miki Rubinstein of Google Research.

ThursdayJun 21, 201812:15
Vision and AIRoom 1
Speaker:Danielle Ezuz Title:Reversible Harmonic MapsAbstract:opens in new windowin html    pdfopens in new window

Information transfer between triangle meshes is of great importance in computer graphics and geometry processing. To facilitate this process, a smooth and accurate map is typically required between the two meshes. While such maps can sometimes be computed between nearly-isometric meshes, the more general case of meshes with diverse geometries remains challenging. This talk describes a novel approach for direct map computation between triangle meshes without mapping to an intermediate domain, which optimizes for the harmonicity and reversibility of the forward and backward maps. Our method is general both in the information it can receive as input, e.g. point landmarks, a dense map or a functional map, and in the diversity of the geometries to which it can be applied. We demonstrate that our maps exhibit lower conformal distortion than the state-of-the-art, while succeeding in correctly mapping key features of the input shapes.

ThursdayJun 07, 201812:15
Vision and AIRoom 1
Speaker:Mark Sheinin Title:Leveraging Hidden Structure in Unstructured IlluminationAbstract:opens in new windowin html    pdfopens in new window

Artificial illumination plays a vital role in human civilization. In computer vision, artificial light is extensively used to recover 3D shape, reflectance, and further scene information. However, in most computer vision applications using artificial light, some additional structure is added to the illumination to facilitate the task at hand. In this work, we show that the ubiquitous alternating current (AC) lights already have a valuable inherent structure that stems from bulb flicker. By passively sensing scene flicker, we reveal new scene information which includes: the type of bulbs in the scene, the phases of the electric grid up to city scale, and the light transport matrix. This information yields unmixing of reflections and semi-reflections, nocturnal high dynamic range, and scene rendering with bulbs not observed during acquisition. Moreover, we provide methods that enable capturing scene flicker using almost any off-the-shelf camera, including smartphones.

In underwater imaging, similar structures are added to illumination sources to overcome the limited imaging range. In this setting, we show that by optimizing camera and light positions while taking the effect of light propagation in scattering media into account we achieve superior imaging of underwater scenes while using simple, unstructured illumination.

ThursdayMay 17, 201812:15
Vision and AIRoom 1
Speaker:Matan Sela Title:Deep Semi-supervised 3D Face ReconstructionAbstract:opens in new windowin html    pdfopens in new window

Fast and robust three-dimensional reconstruction of facial geometric structure from a single image is a challenging task with numerous applications in computer vision and graphics. We propose to leverage the power of convolutional neural networks (CNNs) to produce highly detailed face reconstruction directly from a single image. For this purpose, we introduce an end-to-end CNN framework which constructs the shape in a coarse-to-fine fashion. The proposed architecture is composed of two main blocks, a network which recovers the coarse facial geometry (CoarseNet), followed by a CNN which refines the facial features of that geometry (FineNet). To alleviate the lack of training data for face reconstruction, we train our model using synthetic data as well as unlabeled facial images collected from the internet. The proposed model successfully recovers facial geometries from real images, even for faces with extreme expressions and under varying lighting conditions. In this talk, I will summarize three papers that were published at 3DV 2016, CVPR 2017 (as an oral presentation), and ICCV 2017.

Bio: Matan Sela holds a Ph.D in Computer Science from the Technion - Israel Institute of Technology. He received B.Sc. and M.Sc. (both with honors) in electrical engineering, both from The Technion - Israel Institute of Technology in 2012 and 2015, respectively. During summer 2017, he was a research intern at Google, Mountain View, California. His interests are Machine Learning, Computer Vision, Computer Graphics, Geometry Processing and any combination of thereof.

ThursdayMay 10, 201812:15
Vision and AIRoom 1
Speaker:Lihi Zelnik-ManorTitle:Maintaining Internal Image Statistics in Synthesized ImagesAbstract:opens in new windowin html    pdfopens in new window

Recent work has shown impressive success in automatically creating new images with desired properties such as transferring painterly style, modifying facial expressions or manipulating the center of attention of the image. In this talk I will discuss two of the standing challenges in image synthesis and how we tackle them:
- I will describe our efforts in making the synthesized images more photo-realistic.
- I will further show how we can broaden the scope of data that can be used for training synthesis networks, and with that provide a solution to new applications.

ThursdayApr 26, 201812:15
Vision and AIRoom 1
Speaker:Sharon Fogel (Tel-Aviv University) and Hadar Averbuch-Elor (Tel Aviv University and Amazon AI) Title:Clustering-driven Deep Embedding with Pairwise ConstraintsAbstract:opens in new windowin html    pdfopens in new window

Recently, there has been increasing interest to leverage the competence of neural networks to analyze data. In particular, new clustering methods that employ deep embeddings have been presented.

In this talk, we depart from centroid-based models and suggest a new framework, called Clustering-driven deep embedding with PAirwise Constraints (CPAC), for non-parametric clustering using a neural network. We present a clustering-driven embedding based on a Siamese network that encourages pairs of data points to output similar representations in the latent space. Our pair-based model allows augmenting the information with labeled pairs to constitute a semi-supervised framework. Our approach is based on analyzing the losses associated with each pair to refine the set of constraints.
We show that clustering performance increases when using this scheme, even with a limited amount of user queries.
We present state-of-the-art results on different types of datasets and compare our performance to parametric and non-parametric techniques.

ThursdayApr 12, 201812:15
Vision and AIRoom 1
Speaker:Yoav LevineTitle:Bridging Many-Body Quantum Physics and Deep Learning via Tensor NetworksAbstract:opens in new windowin html    pdfopens in new window

The harnessing of modern computational abilities for many-body wave-function representations is naturally placed as a prominent avenue in contemporary condensed matter physics. Specifically, highly expressive computational schemes that are able to efficiently represent the entanglement properties which characterize many-particle quantum systems are of interest. In the seemingly unrelated field of machine learning, deep network architectures have exhibited an unprecedented ability to tractably encompass the convoluted dependencies which characterize hard learning tasks such as image classification or speech recognition. However, theory is still lagging behind these rapid empirical advancements, and key questions regarding deep learning architecture design have no adequate answers. In the presented work, we establish a Tensor Network (TN) based common language between the two disciplines, which allows us to offer bidirectional contributions. By showing that many-body wave-functions are structurally equivalent to mappings of convolutional and recurrent arithmetic circuits, we construct their TN descriptions in the form of Tree and Matrix Product State TNs, and bring forth quantum entanglement measures as natural quantifiers of dependencies modeled by such networks. Accordingly, we propose a novel entanglement based deep learning design scheme that sheds light on the success of popular architectural choices made by deep learning practitioners, and suggests new practical prescriptions. Specifically, our analysis provides prescriptions regarding connectivity (pooling geometry) and parameter allocation (layer widths) in deep convolutional networks, and allows us to establish a first of its kind theoretical assertion for the exponential enhancement in long term memory brought forth by depth in recurrent networks. In the other direction, we identify that an inherent re-use of information in state-of-the-art deep learning architectures is a key trait that distinguishes them from TN based representations. Therefore, we suggest a new TN manifestation of information re-use, which enables TN constructs of powerful architectures such as deep recurrent networks and overlapping convolutional networks. This allows us to theoretically demonstrate that the entanglement scaling supported by state-of-the-art deep learning architectures can surpass that of commonly used expressive TNs in one dimension, and can support volume law entanglement scaling in two dimensions with an amount of parameters that is a square root of that required by Restricted Boltzmann Machines. We thus provide theoretical motivation to shift trending neural-network based wave-function representations closer to state-of-the-art deep learning architectures.

ThursdayMar 29, 201812:15
Vision and AIRoom 1
Speaker:Tomer MichaeliTitle:The Perception-Distortion TradeoffAbstract:opens in new windowin html    pdfopens in new window

Image restoration algorithms are typically evaluated by some distortion measure (e.g. PSNR, SSIM, IFC, VIF) or by human opinion scores that quantify perceived perceptual quality. In this work, we prove mathematically that distortion and perceptual quality are at odds with each other. Specifically, we study the optimal probability for correctly discriminating the outputs of an image restoration algorithm from real images. We show that as the mean distortion decreases, this probability must increase (indicating worse perceptual quality). As opposed to the common belief, this result holds true for any distortion measure, and is not only a problem of the PSNR or SSIM criteria. However, as we show experimentally, for some measures it is less severe (e.g. distance between VGG features). We also show that generative-adversarial-nets (GANs) provide a principled way to approach the perception-distortion bound. This constitutes theoretical support to their observed success in low-level vision tasks. Based on our analysis, we propose a new methodology for evaluating image restoration methods, and use it to perform an extensive comparison between recent super-resolution algorithms. Our study reveals which methods are currently closest to the theoretical perception-distortion bound.
* Joint work with Yochai Blau.

SundayMar 18, 201812:15
Vision and AIRoom 1
Speaker:Amir RosenfeldTitle:Striving for Adaptive Representations in Neural NetworksAbstract:opens in new windowin html    pdfopens in new windowPlease note the Special Date!!
Some sub-problems in visual recognition already enjoy very impressive performance. However, the deep-learning solutions that underlie them require large training data, are brittle to domain shift and incur a large cost in parameters for adapting to new domains - all in stark contrast to what is observed in human beings. I will talk about my recent work on this area, including (1) introduce a new dataset on which the strongest of learned representations perform very poorly in mimicking human perceptual similarity (2) discuss recent results hinting that the parameters in neural networks are under-utilized and show an alternative method for transfer learning without forgetting at a small parameter cost and (3) show some recent work on conditional computation, inspired by the psychophysical phenomena of visual priming in humans.
ThursdayMar 15, 201812:15
Vision and AIRoom 1
Speaker:Aviad Levis Title:Three-Dimensional Scatterer TomographyAbstract:opens in new windowin html    pdfopens in new window

Scattering effects in images, including those related to haze, fog, and appearance of clouds, are fundamentally dictated by microphysical characteristics of the scatterers. We define and derive recovery of these characteristics, in a three-dimensional heterogeneous medium. Recovery is based on a novel tomography approach. Multiview (multi-angular) and multi-spectral data are linked to the underlying microphysics using 3D radiative transfer, accounting for multiple-scattering. Despite the nonlinearity of the tomography model, inversion is enabled using a few approximations that we describe. As a case study, we focus on passive remote sensing of the atmosphere, where scatterer retrieval can benefit modeling and forecasting of weather, climate, and pollution.

ThursdayFeb 01, 201812:15
Vision and AIRoom 1
Speaker:Justin Solomon Title:Geometric Optimization Algorithms for Variational ProblemsAbstract:opens in new windowin html    pdfopens in new window

 Variational problems in geometry, fabrication, learning, and related applications lead to structured numerical optimization problems for which generic algorithms exhibit inefficient or unstable performance.  Rather than viewing the particularities of these problems as barriers to scalability, in this talk we embrace them as added structure that can be leveraged to design large-scale and efficient techniques specialized to applications with geometrically structure variables.  We explore this theme through the lens of case studies in surface parameterization, optimal transport, and multi-objective design

ThursdayJan 25, 201812:15
Vision and AIRoom 1
Speaker:Hallel Bunis Title:Caging Polygonal Objects Using Minimalistic Three-Finger HandsAbstract:opens in new windowin html    pdfopens in new window

Multi-finger caging offers a robust approach to object grasping. To securely grasp an object, the fingers are first placed in caging regions surrounding a desired immobilizing grasp. This prevents the object from escaping the hand, and allows for great position uncertainty of the fingers relative to the object. The hand is then closed until the desired immobilizing grasp is reached.

While efficient computation of two-finger caging grasps for polygonal objects is well developed, the computation of three-finger caging grasps has remained a challenging open problem. We will discuss the caging of polygonal objects using three-finger hands that maintain similar triangle finger formations during the grasping process. While the configuration space of such hands is four dimensional, their contact space which represents all two and three finger contacts along the grasped object's boundary forms a two-dimensional stratified manifold.

We will present a caging graph that can be constructed in the hand's relatively simple contact space. Starting from a desired immobilizing grasp of the object by a specific triangular finger formation, the caging graph is searched for the largest formation scale value that ensures a three-finger cage about the object. This value determines the caging regions, and if the formation scale is kept below this value, any finger placement within the caging regions will guarantee a robust object grasping.

ThursdayJan 18, 201812:15
Vision and AIRoom 1
Speaker:Sagie BenaimTitle:One-Sided Unsupervised Domain Mapping via Distance CorrelationsAbstract:opens in new windowin html    pdfopens in new window

In unsupervised domain mapping, the learner is given two unmatched datasets A and B. The goal is to learn a mapping G_AB that translates a sample in A to the analog sample in B. Recent approaches have shown that when learning simultaneously both G_AB and the inverse mapping G_BA, convincing mappings are obtained. In this work, we present a method of learning G_AB without learning G_BA. This is done by learning a mapping that maintains the distance between a pair of samples. Moreover, good mappings are obtained, even by maintaining the distance between different parts of the same sample before and after mapping. We present experimental results that the new method not only allows for one sided mapping learning, but also leads to preferable numerical results over the existing circularity-based constraint.

ThursdayJan 11, 201812:15
Vision and AIRoom 1
Speaker:Oren SalzmanTitle:Computational Challenges and Algorithms in Planning for Robotic SystemsAbstract:opens in new windowin html    pdfopens in new window

In recent years, robots have played an active role in everyday life: medical robots assist in complex surgeries, low-cost commercial robots clean houses and fleets of robots are used to efficiently manage warehouses. A key challenge in these systems is motion planning, where we are interested in planning a collision-free path for a robot in an environment cluttered with obstacles. While the general problem has been studied for several decades now, these new applications introduce an abundance of new challenges.

In this talk I will describe some of these challenges as well as algorithms developed to address them. I will overview general challenges such as compression and graph-search algorithms in the context of motion planning. I will show why traditional Computer Science tools are ill-suited for these problems and introduce alternative algorithms that leverage the unique characteristics of robot motion planning. In addition, I will describe domains-specific challenges such as those that arise when planning for assistive robots and for humanoid robots and overview algorithms tailored for these specific domains.

ThursdayJan 04, 201812:15
Vision and AIRoom 1
Speaker:Guy Gilboa Title:Processing images using nonlinear transformsAbstract:opens in new windowin html    pdfopens in new window
Recent studies of convex functionals and their related nonlinear eigenvalue problems show surprising analogies to harmonic analysis based on classical transforms (e.g. Fourier). In this talk the total-variation transform will be introduced along with some theoretical results. Applications related to image decomposition, texture processing and face fusion will be shown. Extensions to graphs and a new interpretation of gradient descent will also be discussed.
ThursdayDec 28, 201712:15
Vision and AIRoom 1
Speaker:Greg ShakhnarovichTitle:Discriminability loss for learning to generate descriptive image captionsAbstract:opens in new windowin html    pdfopens in new window

Image captioning -- automatic production of text describing a visual scene -- has received a lot of attention recently. However, the objective of captioning, evaluation metrics, and the training protocol remain somewhat unsettled. The general goal seems to be for machines to describe visual signal like humans do. We pursue this goal by incorporating a discriminability loss in training caption generators. This loss is explicitly "aware" of the need for the caption to convey information, rather than appear fluent or reflect word distribution in the human captions. Specifically, the loss in our work is tied to discriminative tasks: producing a referring expression (text that allows a recipient to identify a region in the given image) or producing a discriminative caption which allows the recipient to identify an image within a set of images. In both projects, use of the dscriminability loss does not require any additional human annotations, and relies on collaborative training between the caption generator and a comprehension model, which is a proxy for a human recipient. In experiments on standard benchmarks, we show that adding discriminability objectives not only improves the discriminative quality of the generated image captions, but, perhaps surprisingly, also makes the captions better under a variety of traditional metrics.

Joint work with Ruotian Luo (TTIC), Brian Price and Scott Cohen (Adobe).

ThursdayDec 21, 201712:15
Vision and AIRoom 1
Speaker:Assaf ShocherTitle:“Zero-Shot” Super-Resolution using Deep Internal LearningAbstract:opens in new windowin html    pdfopens in new window

Deep Learning has led to a dramatic leap in Super-Resolution (SR) performance in the past few years. However, being supervised, these SR methods are restricted to specific training data, where the acquisition of the low-resolution (LR) images from their high-resolution (HR) counterparts is predetermined (e.g., bicubic downscaling), without any distracting artifacts (e.g., sensor noise, image compression, non-ideal PSF, etc). Real LR images, however, rarely obey these restrictions, resulting in poor SR results by SotA (State of the Art) methods. In this paper we introduce "Zero-Shot" SR, which exploits the power of Deep Learning, but does not rely on prior training. We exploit the internal recurrence of information inside a single image, and train a small image-specific CNN at test time, on examples extracted solely from the input image itself. As such, it can adapt itself to different settings per image. This allows to perform SR of real old photos, noisy images, biological data, and other images where the acquisition process is unknown or non-ideal. On such images, our method outperforms SotA CNN-based SR methods, as well as previous unsupervised SR methods. To the best of our knowledge, this is the first unsupervised CNN-based SR method.

ThursdayDec 07, 201712:15
Vision and AIRoom 1
Speaker:Dotan KaufmanTitle:Temporal Tessellation: A Unified Approach for Video AnalysisAbstract:opens in new windowin html    pdfopens in new window

We present a general approach to video understanding, inspired by semantic transfer techniques that have been successfully used for 2D image analysis. Our method considers a video to be a 1D sequence of clips, each one associated with its own semantics. The nature of these semantics -- natural language captions or other labels -- depends on the task at hand. A test video is processed by forming correspondences between its clips and the clips of reference videos with known semantics, following which, reference semantics can be transferred to the test video. We describe two matching methods, both designed to ensure that (a) reference clips appear similar to test clips and (b), taken together, the semantics of the selected reference clips is consistent and maintains temporal coherence. We use our method for video captioning on the LSMDC'16 benchmark, video summarization on the SumMe and TVSum benchmarks, Temporal Action Detection on the Thumos2014 benchmark, and sound prediction on the Greatest Hits benchmark. Our method not only surpasses the state of the art, in four out of five benchmarks, but importantly, it is the only single method we know of that was successfully applied to such a diverse range of tasks.

TuesdayDec 05, 201711:00
Vision and AIRoom 290C
Speaker:Amit BermanoTitle:Geometry Processing Methods and Their Real-Life ApplicationsAbstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL DAY AND TIME

Digital geometry processing (DGP) is one of the core topics of computer graphics, and has been an active line of research for over two decades. On one hand, the field introduces theoretical studies in topics such as vector-field design, preservative maps and deformation theory. On the other hand, the tools and algorithms developed by this community are applicable in fields ranging from computer-aided design, to multimedia, to computational biology and medical imaging. Throughout my work, I have sought to bridge the gap between the theoretical aspects of DGP and their applications. In this talk, I will demonstrate how DGP concepts can be leveraged to facilitate real-life applications with the right adaptation. More specifically, I will portray how I have employed deformation theory to support problems in animation and augmented reality. I will share my thoughts and first taken steps to enlist DGP to the aid of machine learning, and perhaps most excitingly, I will discussion my own and the graphics community's contributions to computational fabrication field, as well as my vision for its future.

Bio: Dr. Amit H. Bermano is a postdoctoral Researcher at the Princeton Graphics Group, hosted by Professor Szymon Rusinkiewicz and Professor Thomas Funkhouser. Previously, he was a postdoctoral researcher at Disney Research Zurich in the computational materials group, led by Dr. Bernhard Thomaszewski. He conducted his doctoral studies at ETH Zurich under the supervision of Prof. Dr. Markus Gross, in collaboration with Disney Research Zurich. His Masters and Bachelors degrees were obtained at The Technion - Israel Institute of Technology under the supervision of Prof. Craig Gotsman. His research focuses on connecting the geometry processing field with other fields in computer graphics and vision, mainly by using geometric methods to facilitate other applications. His interests in this context include computational fabrication, animation, augmented reality, medical imaging and machine learning.

ThursdayNov 23, 201712:15
Vision and AIRoom 1
Speaker:Aviv GabbayTitle:Seeing Through Noise: Visually Driven Speaker Separation and EnhancementAbstract:opens in new windowin html    pdfopens in new window

Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments, using a single microphone. For example, video conferences from home or office are disturbed by other voices, TV reporting from city streets is mixed with traffic noise, etc. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. Face motions captured in the video are used to estimate the speaker's voice, which is applied as a filter on the input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model.

In the first part of this talk, I will describe a few techniques to predict speech signals by a silent video of a speaking person. In the second part of the talk, I will propose a method to separate overlapping speech of several people speaking simultaneously (known as the cocktail-party problem), based on the speech predictions generated by video-to-speech system.

WednesdayNov 01, 201711:15
Vision and AIRoom 1
Speaker:Tal HassnerTitle:A Decade of Faces in the WildAbstract:opens in new windowin html    pdfopens in new windowJOINT VISION AND MACHINE LEARNING SEMINAR
Faces are undoubtedly one of the most rigorously studied object classes in computer vision and recognizing faces from their pictures is one of the classic problems of the field. Fueled by applications ranging from biometrics and security to entertainment and commerce, massive research efforts were directed at this problem from both academia and industry. As a result, machine capabilities rose to the point where face recognition systems now claim to surpass even the human visual system. My own work on this problem began nearly a decade ago. At that time, the community shifted its interests from the (largely) solved problem of recognizing faces appearing in controlled, high quality images to images taken in the wild, where no control is assumed over how the faces are viewed. In this talk, I will provide my perspectives on this problem and the solutions proposed to solve it. I will discuss the rationale which drove the design of our methods, their limitations, and breakthroughs. In particular, I will show how classical computer vision methods and, surprisingly, elementary computer graphics, work together with modern deep learning in the design of our state of the art face recognition methods.
MondaySep 04, 201714:00
Vision and AIRoom 1
Speaker:Ita LifshitzTitle:Hand-object interaction: a step towards action recognitionAbstract:opens in new windowin html    pdfopens in new windowNOTE THE UNUSUAL TIME AND DAY

When dealing with a highly variable problem such as action recognition, focusing on a small area, such as the hand's region, makes the problem more manageable, and enables us to invest relatively high amount of resources needed for interpretation in a small but highly informative area of the image. In order to detect this region of interest in the image and properly analyze it, I have built a process that includes several steps, starting with a state of the art hand detector, incorporating both detection of the hand by appearance and by estimation of human body pose. The hand detector is built upon a fully convolutional neural network, detecting hands efficiently and accurately. The human body pose estimation starts with a state of the art head detector and continues with a novel approach where each location in the image votes for the position of each body keypoint, utilizing information from the whole image. Using dense, multi-target votes enables us to compute image-dependent joint keypoint probabilities by looking at consensus voting, and accurately estimates the body pose. Once the detection of the hands is complete, an additional step of segmentation of the hand and fingers is made. In this step each hand pixel in the image is labeled using a dense fully convolutional network. Finally, an additional step is made to segment and identify the held object. Understanding the hand-object interaction is an important step toward understanding the action taking place in the image. These steps enable us to perform fine interpretation of hand-object interaction images as an essential step towards understanding the human-object interaction and recognizing human activities.

ThursdayJul 06, 201712:15
Vision and AIRoom 1
Speaker:Tammy Riklin-Raviv Title:Big data - small training sets: biomedical image analysis bottlenecks, some strategies and applications Abstract:opens in new windowin html    pdfopens in new window

Recent progress in imaging technologies leads to a continuous growth in biomedical data, which can provide better insight into important clinical and biological questions. Advanced machine learning techniques, such as artificial neural networks are brought to bear on addressing fundamental medical image computing challenges such as segmentation, classification and reconstruction, required for meaningful analysis of the data. Nevertheless, the main bottleneck, which is the lack of annotated examples or 'ground truth' to be used for training, still remains.

In my talk, I will give a brief overview on some biomedical image analysis problems we aim to address, and suggest how prior information about the problem at hand can be utilized to compensate for insufficient or even the absence of ground-truth data. I will then present a framework based on deep neural networks for the denoising of Dynamic contrast-enhanced MRI (DCE-MRI) sequences of the brain. DCE-MRI is an imaging protocol where MRI scans are acquired repetitively throughout the injection of a contrast agent, that is mainly used for quantitative assessment of blood-brain barrier (BBB) permeability. BBB dysfunctionality is associated with numerous brain pathologies including stroke, tumor, traumatic brain injury, epilepsy. Existing techniques for DCE-MRI analysis are error-prone as the dynamic scans are subject to non-white, spatially-dependent and anisotropic noise. To address DCE-MRI denoising challenges we use an ensemble of expert DNNs constructed as deep autoencoders, where each is trained on a specific subset of the input space to accommodate different noise characteristics and dynamic patterns. Since clean DCE-MRI sequences (ground truth) for training are not available, we present a sampling scheme, for generating realistic training sets with nonlinear dynamics that faithfully model clean DCE-MRI data and accounts for spatial similarities. The proposed approach has been successfully applied to full and even temporally down-sampled DCE-MRI sequences, from two different databases, of stroke and brain tumor patients, and is shown to favorably compare to state-of-the-art denoising methods.

ThursdayJun 29, 201712:15
Vision and AIRoom 1
Speaker:Shai AvidanTitle:Co-occurrence FilterAbstract:opens in new windowin html    pdfopens in new window
Co-occurrence Filter (CoF) is a boundary preserving filter. It is based on the Bilateral Filter (BF) but instead of using a Gaussian on the range values to preserve edges it relies on a co-occurrence matrix. Pixel values that co-occur frequently in the image (i.e., inside textured regions) will have a high weight in the co-occurrence matrix. This, in turn, means that such pixel pairs will be averaged and hence smoothed, regardless of their intensity differences. On the other hand, pixel values that rarely co-occur (i.e., across texture boundaries) will have a low weight in the co-occurrence matrix. As a result, they will not be averaged and the boundary between them will be preserved. The CoF therefore extends the BF to deal with boundaries, not just edges. It learns co-occurrences directly from the image. We can achieve various filtering results by directing it to learn the co-occurrence matrix from a part of the image, or a different image. We give the definition of the filter, discuss how to use it with color images and show several use cases. Joint work with Roy Jevnisek
ThursdayJun 22, 201712:15
Vision and AIRoom 1
Speaker:Haggai MaronTitle:Convolutional Neural Networks on Surfaces via Seamless Toric CoversAbstract:opens in new windowin html    pdfopens in new window

The recent success of convolutional neural networks (CNNs) for image processing tasks is inspiring research efforts attempting to achieve similar success for geometric tasks. One of the main challenges in applying CNNs to surfaces is defining a natural convolution operator on surfaces. In this paper we present a method for applying deep learning to sphere-type shapes using a global seamless parameterization to a planar flat-torus, for which the convolution operator is well defined. As a result, the standard deep learning framework can be readily applied for learning semantic, high-level properties of the shape. An indication of our success in bridging the gap between images and surfaces is the fact that our algorithm succeeds in learning semantic information from an input of raw low-dimensional feature vectors. 

We demonstrate the usefulness of our approach by presenting two applications: human body segmentation, and automatic landmark detection on anatomical surfaces. We show that our algorithm compares favorably with competing geometric deep-learning algorithms for segmentation tasks, and is able to produce meaningful correspondences on anatomical surfaces where hand-crafted features are bound to fail.

Joint work with: Meirav Galun, Noam Aigerman, Miri Trope, Nadav Dym, Ersin Yumer, Vladimir G. Kim and Yaron Lipman.

ThursdayJun 15, 201712:15
Vision and AIRoom 1
Speaker:Ron KimmelTitle:On Learning Invariants and Representation Spaces of Shapes and FormsAbstract:opens in new windowin html    pdfopens in new window
We study the power of the Laplace Beltrami Operator (LBO) in processing and analyzing geometric information. The decomposition of the LBO at one end, and the heat operator at the other end provide us with efficient tools for dealing with images and shapes. Denoising, segmenting, filtering, exaggerating are just few of the problems for which the LBO provides an efficient solution. We review the optimality of a truncated basis provided by the LBO, and a selection of relevant metrics by which such optimal bases are constructed. Specific example is the scale invariant metric for surfaces that we argue to be a natural selection for the study of articulated shapes and forms. In contrast to geometry understanding there is a new emerging field of deep learning. Learning systems are rapidly dominating the areas of audio, textual, and visual analysis. Recent efforts to convert these successes over to geometry processing indicate that encoding geometric intuition into modeling, training, and testing is a non-trivial task. It appears as if approaches based on geometric understanding are orthogonal to those of data-heavy computational learning. We propose to unify these two methodologies by computationally learning geometric representations and invariants and thereby take a small step towards a new perspective on geometry processing. I will present examples of shape matching, facial surface reconstruction from a single image, reading facial expressions, shape representation, and finally definition and computation of invariant operators and signatures.
ThursdayJun 08, 201712:15
Vision and AIRoom 1
Speaker:Nadav CohenTitle:Expressive Efficiency and Inductive Bias of Convolutional Networks: Analysis and Design through Hierarchical Tensor DecompositionsAbstract:opens in new windowin html    pdfopens in new windowJOINT VISION AND MACHINE LEARNING SEMINAR
The driving force behind convolutional networks - the most successful deep learning architecture to date, is their expressive power. Despite its wide acceptance and vast empirical evidence, formal analyses supporting this belief are scarce. The primary notions for formally reasoning about expressiveness are efficiency and inductive bias. Efficiency refers to the ability of a network architecture to realize functions that require an alternative architecture to be much larger. Inductive bias refers to the prioritization of some functions over others given prior knowledge regarding a task at hand. Through an equivalence to hierarchical tensor decompositions, we study the expressive efficiency and inductive bias of various architectural features in convolutional networks (depth, width, pooling geometry and more). Our results shed light on the demonstrated effectiveness of convolutional networks, and in addition, provide new tools for network design. The talk is based on a series of works published in COLT, ICML, CVPR and ICLR (as well as several new preprints), with collaborators Or Sharir, Ronen Tamari, David Yakira, Yoav Levine and Amnon Shashua.
ThursdayJun 01, 201712:15
Vision and AIRoom 1
Speaker:Nir Sharon Title:Synchronization over Cartan motion groupsAbstract:opens in new windowin html    pdfopens in new window
The mathematical problem of group synchronization deals with the question of how to estimate unknown group elements from a set of their mutual relations. This problem appears as an important step in solving many real-world problems in vision, robotics, tomography, and more. In this talk, we present a novel solution for synchronization over the class of Cartan motion groups, which includes the special important case of rigid motions. Our method is based on the idea of group contraction, an algebraic notion origin in relativistic mechanics.
ThursdayMay 25, 201712:15
Vision and AIRoom 1
Speaker:Rafi MalachTitle:Neuronal "Ignitions" underlying stable representations in a dynamic visual environmentAbstract:opens in new windowin html    pdfopens in new window
The external world is in a constant state of flow- posing a major challenge to neuronal representations of the visual system that necessitate sufficient time for integration and perceptual decisions. In my talk I will discuss the hypothesis that one solution to this challenge is implemented by breaking the neuronal responses into a series of discrete and stable states. I will propose that these stable points are likely implemented through relatively long lasting "ignitions" of recurrent neuronal activity. Such ignitions are a pre-requisite for the emergence of a perceptual image in the mind of the observer. The self-sustained nature of the ignitions endows them with stability despite the dynamically changing inputs. Results from intracranial recordings in patients conducted for clinical diagnostic purposes during rapid stimulus presentations, ecological settings, blinks and saccadic eye movements will be presented in support of this hypothesis.
ThursdayMay 18, 201712:15
Vision and AIRoom 1
Speaker:Michael Elad Title:Regularization by Denoising (RED)Abstract:opens in new windowin html    pdfopens in new window

Image denoising is the most fundamental problem in image enhancement, and it is largely solved: It has reached impressive heights in performance and quality -- almost as good as it can ever get. But interestingly, it turns out that we can solve many other problems using the image denoising "engine". I will describe the Regularization by Denoising (RED) framework: using the denoising engine in defining the regularization of any inverse problem. The idea is to define an explicit image-adaptive regularization functional directly using a high performance denoiser. Surprisingly, the resulting regularizer is guaranteed to be convex, and the overall objective functional is explicit, clear and well-defined. With complete flexibility to choose the iterative optimization procedure for minimizing this functional, RED is capable of incorporating any image denoising algorithm as a regularizer, treat general inverse problems very effectively, and is guaranteed to converge to the globally optimal result.

* Joint work with Peyman Milanfar (Google Research) and Yaniv Romano (EE-Technion).

ThursdayApr 27, 201712:15
Vision and AIRoom 1
Speaker:Tamar FlashTitle:Motion compositionality and timing: combined geometrical and optimization approachesAbstract:opens in new windowin html    pdfopens in new window
In my talk I will discuss several recent research directions that we have taken to explore the different principles underlying the construction and control of complex human upper arm and gait movements. One important topic is motor compositionality, exploring the nature of the motor primitives underlying the construction of complex movements at different levels of the motor hierarchy. The second topic which we focused on is motion timing, investigating what principles dictate the durations of complex sequential behaviors both at the level of the internal timing of different motion segments and the total durations of different types of movement. Finally I will discuss the topic of motor coordination and the mapping between end-effector and joint motions both during arm and leg movements using various dimension reduction approaches. The mathematical models we have used to study the above topics combine geometrical approaches with optimization models to derive motion invariants, optimal control principles and different conservations laws.
ThursdayApr 20, 201712:15
Vision and AIRoom 1
Speaker:Lihi Zelnik-Manor Title:Separating the Wheat from the Chaff in Visual DataAbstract:opens in new windowin html    pdfopens in new window
By far, most of the bits in the world are image and video data. YouTube alone gets 300 hours of video uploaded every minute. Adding to that personal pictures, videos, TV channels and the gazillion of security cameras shooting 24/7 one quickly sees that the amount of visual data being recorded is colossal. In the first part of this talk I will discuss the problem of "saliency prediction" - separating between the important parts of images/videos (the "wheat") from the less important ones (the "chaff"). I will review work done over the last decade and its achievements. In the second part of the talk I will discuss one particular application of saliency prediction that our lab is interested in: making images and videos accessible to the visually impaired. Our plan is to convert images and videos into tactile surfaces that can be "viewed" by touch. As it turns out, saliency estimation and manipulation both play a key factor in this task.
ThursdayApr 06, 201712:15
Vision and AIRoom 1
Speaker:Simon KormanTitle:Occlusion-Aware Template Matching via Consensus Set MaximizationAbstract:opens in new windowin html    pdfopens in new window

We present a novel approach to template matching that is efficient, can handle partial occlusions, and is equipped with provable performance guarantees. A key component of the method is a reduction that transforms the problem of searching a nearest neighbor among N high-dimensional vectors, to searching neighbors among two sets of order sqrt(N) vectors, which can be done efficiently using range search techniques. This allows for a quadratic improvement in search complexity, that makes the method scalable when large search spaces are involved. 
For handling partial occlusions, we develop a hashing scheme based on consensus set maximization within the range search component. The resulting scheme can be seen as a randomized hypothesize-and-test algorithm, that comes with guarantees regarding the number of iterations required for obtaining an optimal solution with high probability. 
The predicted matching rates are validated empirically and the proposed algorithm shows a significant improvement over the state-of-the-art in both speed and robustness to occlusions.
Joint work with Stefano Soatto.

ThursdayMar 30, 201712:15
Vision and AIRoom 1
Speaker:Lior WolfTitle:Unsupervised Cross-Domain Image GenerationAbstract:opens in new windowin html    pdfopens in new window

We study the ecological use of analogies in AI. Specifically, we address the problem of transferring a sample in one domain to an analog sample in another domain. Given two related domains, S and T, we would like to learn a generative function G that maps an input sample from S to the domain T, such that the output of a given representation function f, which accepts inputs in either domains, would remain unchanged. Other than f, the training data is unsupervised and consist of a set of samples from each domain, without any mapping between them. The Domain Transfer Network (DTN) we present employs a compound loss function that includes a multiclass GAN loss, an f preserving component, and a regularizing component that encourages G to map samples from T to themselves. We apply our method to visual domains including digits and face images and demonstrate its ability to generate convincing novel images of previously unseen entities, while preserving their identity.

Joint work with Yaniv Taigman and Adam Polyak

ThursdayFeb 09, 201712:15
Vision and AIRoom 1
Speaker:Tomer MichaeliTitle:Deformation-aware image processingAbstract:opens in new windowin html    pdfopens in new window

Image processing algorithms often involve a data fidelity penalty, which encourages the solution to comply with the input data. Existing fidelity measures (including perceptual ones) are very sensitive to slight misalignments in the locations and shapes of objects. This is in sharp contrast to the human visual system, which is typically indifferent to such variations. In this work, we propose a new error measure, which is insensitive to small smooth deformations and is very simple to incorporate into existing algorithms. We demonstrate our approach in lossy image compression. As we show, optimal encoding under our criterion boils down to determining how to best deform the input image so as to make it "more compressible". Surprisingly, it turns out that very minor deformations (almost imperceptible in some cases) suffice to make a huge visual difference in methods like JPEG and JPEG2000. Thus, by slightly sacrificing geometric integrity, we gain a significant improvement in preservation of visual information.

We also show how our approach can be used to visualize image priors. This is done by determining how images should be deformed so as to best conform to any given image model. By doing so, we highlight the elementary geometric structures to which the prior resonates. Using this method, we reveal interesting behaviors of popular priors, which were not noticed in the past.

Finally, we illustrate how deforming images to possess desired properties can be used for image "idealization" and for detecting deviations from perfect regularity.


Joint work with Tamar Rott Shaham, Tali Dekel, Michal Irani, and Bill Freeman.

ThursdayJan 26, 201712:15
Vision and AIRoom 1
Speaker:Vardan PapyanTitle:Signal Modeling: From Convolutional Sparse Coding to Convolutional Neural NetworksAbstract:opens in new windowin html    pdfopens in new window

Within the wide field of sparse approximation, convolutional sparse coding (CSC) has gained increasing attention in recent years. This model assumes a structured-dictionary built as a union of banded Circulant matrices. Most attention has been devoted to the practical side of CSC, proposing efficient algorithms for the pursuit problem, and identifying applications that benefit from this model. Interestingly, a systematic theoretical understanding of CSC seems to have been left aside, with the assumption that the existing classical results are sufficient.
In this talk we start by presenting a novel analysis of the CSC model and its associated pursuit. Our study is based on the observation that while being global, this model can be characterized and analyzed locally. We show that uniqueness of the representation, its stability with respect to noise, and successful greedy or convex recovery are all guaranteed assuming that the underlying representation is locally sparse. These new results are much stronger and informative, compared to those obtained by deploying the classical sparse theory.
Armed with these new insights, we proceed by proposing a multi-layer extension of this model, ML-CSC, in which signals are assumed to emerge from a cascade of CSC layers. This, in turn, is shown to be tightly connected to Convolutional Neural Networks (CNN), so much so that the forward-pass of the CNN is in fact the Thresholding pursuit serving the ML-CSC model. This connection brings a fresh view to CNN, as we are able to attribute to this architecture theoretical claims such as uniqueness of the representations throughout the network, and their stable estimation, all guaranteed under simple local sparsity conditions. Lastly, identifying the weaknesses in the above scheme, we propose an alternative to the forward-pass algorithm, which is both tightly connected to deconvolutional and recurrent neural networks, and has better theoretical guarantees.

ThursdayJan 19, 201712:15
Vision and AIRoom 1
Speaker:David Held Title:Robots in Clutter: Learning to Understand Environmental ChangesAbstract:opens in new windowin html    pdfopens in new window
Robots today are confined to operate in relatively simple, controlled environments. One reason for this is that current methods for processing visual data tend to break down when faced with occlusions, viewpoint changes, poor lighting, and other challenging but common situations that occur when robots are placed in the real world. I will show that we can train robots to handle these variations by modeling the causes behind visual appearance changes. If robots can learn how the world changes over time, they can be robust to the types of changes that objects often undergo. I demonstrate this idea in the context of autonomous driving, and I will show how we can use this idea to improve performance for every step of the robotic perception pipeline: object segmentation, tracking, velocity estimation, and classification. I will also present some preliminary work on learning to manipulate objects, using a similar framework of learning environmental changes. By learning how the environment can change over time, we can enable robots to operate in the complex, cluttered environments of our daily lives.
ThursdayJan 05, 201712:15
Vision and AIRoom 1
Speaker:Shai AvidanTitle:Taking Pictures in Scattering MediaAbstract:opens in new windowin html    pdfopens in new window
Pictures taken under bad weather conditions or underwater often suffer from low contrast and limited visibility. Restoring colors of images taken in such conditions is extremely important for consumer applications, computer vision tasks, and marine research. The common physical phenomena in these scenarios are scattering and absorption - the imaging is done either under water, or in a medium that contains suspended particles, e.g. dust (haze) and water droplets (fog). As a result, the colors of captured objects are attenuated, as well as veiled by light scattered by the suspended particles. The amount of attenuation and scattering depends on the objects' distance from the camera and therefore the color distortion cannot be globally corrected. We propose a new prior, termed Haze-Line, and use it to correct these types of images. First, we show how it can be used to clean images taken under bad weather conditions such as haze or fog. Then we show how to use it to automatically estimate the air light.Finally, we extend it to deal with underwater images as well. The proposed algorithm is completely automatic and quite efficient in practice. Joint work with Dana Berman (TAU) and Tali Treibitz (U.of Haifa)
ThursdayDec 22, 201612:15
Vision and AIRoom 1
Speaker:Greg Shakhnarovich Title:Image colorization and its role in visual learningAbstract:opens in new windowin html    pdfopens in new window
I will present our recent and ongoing work on fully automatic image colorization. Our approach exploits both low-level and semantic representations during colorization. As many scene elements naturally appear according to multimodal color distributions, we train our model to predict per-pixel color histograms. This intermediate output can be used to automatically generate a color image, or further manipulated prior to image formation to "push" the image in a desired direction. Our system achieves state-of-the-art results under a variety of metrics. Moreover, it provides a vehicle to explore the role the colorization task can play as a proxy for visual understanding, providing a self-supervision mechanism for learning representations. I will describe the ability of our self-supervised network in several contexts, such as classification and semantic segmentation. On VOC segmentation and classification tasks, we present results that are state-of-the-art among methods not using ImageNet labels for pretraining. Joint work with Gustav Larsson and Michael Maire.
ThursdayDec 15, 201612:15
Vision and AIRoom 1
Speaker:Gil Ben-Artzi Title:Calibration of Multi-Camera Systems by Global Constraints on the Motion of SilhouettesAbstract:opens in new windowin html    pdfopens in new window
Computing the epipolar geometry between cameras with very different viewpoints is often problematic as matching points are hard to find. In these cases, it has been proposed to use information from dynamic objects in the scene for suggesting point and line correspondences. We introduce an approach that improves by two orders of magnitude the performance over state-of-the-art methods, by significantly reducing the number of outliers in the putative matches. Our approach is based on (a) a new temporal signature: motion barcode, which is used to recover corresponding epipolar lines across views, and (b) formulation of the correspondences problem as constrained flow optimization, requiring small differences between the coordinates of corresponding points over consecutive frames. Our method was validated on four standard datasets providing accurate calibrations across very different viewpoints.
ThursdayDec 01, 201612:15
Vision and AIRoom 1
Speaker:Michael (Miki) Lustig Title:Applications of Subspace and Low-Rank Methods for Dynamic and Multi-Contrast Magnetic Resonance Imaging Abstract:opens in new windowin html    pdfopens in new window
There has been much work in recent years to develop methods for recovering signals from insufficient data. One very successful direction are subspace methods that constrain the data to live in a lower dimensional space. These approaches are motivated by theoretical results in recovering incomplete low-rank matrices as well as exploiting the natural redundancy of multidimensional signals. In this talk I will present our research group's efforts in this area. I will start with describing a new decomposition that can represent dynamic images as a sum of multi-scale low-rank matrices, which can very efficiently capture spatial and temporal correlations in multiple scales. I will then describe and show results from applications using subspace and low-rank methods for highly accelerated multi-contrast MR imaging and for the purpose of motion correction.
MondayNov 21, 201612:15
Vision and AIRoom 1
Speaker:Emanuele Rodola', Or LitanyTitle:Spectral Approaches to Partial Shape MatchingAbstract:opens in new windowin html    pdfopens in new window
In this talk we will present our recent line of work on (deformable) partial shape correspondence in the spectral domain. We will first introduce Partial Functional Maps (PFM), showing how to robustly formulate the shape correspondence problem under missing geometry with the language of functional maps. We use perturbation analysis to show how removal of shape parts changes the Laplace-Beltrami eigenfunctions, and exploit it as a prior on the spectral representation of the correspondence. We will show further extensions to deal with the presence of clutter (deformable object-in-clutter) and multiple pieces (non-rigid puzzles). In the second part of the talk, we will introduce a novel approach to the same problem which operates completely in the spectral domain, avoiding the cumbersome alternating optimization used in the previous approaches. This allows matching shapes with constant complexity independent of the number of shape vertices, and yields state-of-the-art results on challenging correspondence benchmarks in the presence of partiality and topological noise.
ThursdayNov 10, 201612:15
Vision and AIRoom 1
Speaker:Yedid HoshenTitle:End-to-End Learning: Applications in Speech, Vision and CognitionAbstract:opens in new windowin html    pdfopens in new window

One of the most exciting possibilities opened by deep neural networks is end-to-end learning: the ability to learn tasks without the need for feature engineering or breaking down into sub-tasks. This talk will present three cases illustrating how end-to-end learning can operate in machine perception across the senses (Hearing, Vision) and even for the entire perception-cognition-action cycle.

The talk begins with speech recognition, showing how acoustic models can be learned end-to-end. This approach skips the feature extraction pipeline, carefully designed for speech recognition over decades.

Proceeding to vision, a novel application is described: identification of photographers of wearable video cameras. Such video was previously considered anonymous as it does not show the photographer.

The talk concludes by presenting a new task, encompassing the full perception-cognition-action cycle: visual learning of arithmetic operations using only pictures of numbers. This is done without using or learning the notions of numbers, digits, and operators.

The talk is based on the following papers:

Speech Acoustic Modeling From Raw Multichannel Waveforms, Y. Hoshen, R.J. Weiss, and K.W. Wilson, ICASSP'15

An Egocentric Look at Video Photographer Identity, Y. Hoshen, S. Peleg, CVPR'16

Visual Learning of Arithmetic Operations, Y. Hoshen, S. Peleg, AAAI'16

MondaySep 26, 201614:00
Vision and AIRoom 1
Speaker:Achuta KadambiTitle:From the Optics Lab to Computer Vision Abstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL DAY AND TIME

Computer science and optics are usually studied separately -- separate people, in separate departments, meet at separate conferences. This is changing. The exciting promise of technologies like virtual reality and self-driving cars demand solutions that draw from the best aspects of computer vision, computer graphics, and optics. Previously, it has proved difficult to bridge these communities. For instance, the laboratory setups in optics are often designed to image millimeter-size scenes in a vibration-free darkroom. 

This talk is centered around time of flight imaging, a growing area of research in computational photography. A time of flight camera works by emitting amplitude modulated (AM) light and performing correlations on the reflected light. The frequency of AM is in the radio frequency range (like a Doppler radar system), but the carrier signal is optical, overcoming diffraction limited challenges of full RF systems while providing optical contrast. The obvious use of such cameras is to acquire 3D geometry. By spatially, temporally and spectrally coding light transport we show that it may be possible to go "beyond depth", demonstrating new forms of imaging like photography through scattering media, fast relighting of photographs, real-time tracking of occluded objects in the scene (like an object around a corner), and even the potential to distinguish between biological molecules using fluorescence. We discuss the broader impact of this design paradigm on the future of 3D depth sensors, interferometers, computational photography, medical imaging and many other applications. 

ThursdaySep 08, 201612:15
Vision and AIRoom 1
Speaker:Tali Dekel Title:Exploring and Modifying Spatial Variations in a Single ImageAbstract:opens in new windowin html    pdfopens in new window
Structures and objects, captured in image data, are often idealized by the viewer. For example, buildings may seem to be perfectly straight, or repeating structures such as corn's kernels may seem almost identical. However, in reality, such flawless behavior hardly exists. The goal in this line of work is to detect the spatial imperfection, i.e., departure of objects from their idealized models, given only a single image as input, and to render a new image in which the deviations from the model are either reduced or magnified. Reducing the imperfections allows us to idealize/beautify images, and can be used as a graphic tool for creating more visually pleasing images. Alternatively, increasing the spatial irregularities allow us to reveal useful and surprising information that is hard to visually perceive by the naked eye (such as the sagging of a house's roof). I will consider this problem under two distinct definitions of idealized model: (i) ideal parametric geometries (e.g., line segments, circles), which can be automatically detected in the input image. (ii) perfect repetitions of structures, which relies on the redundancy of patches in a single image. Each of these models has lead to a new algorithm with a wide range of applications in civil engineering, astronomy, design, and materials defects inspection.
ThursdayAug 04, 201611:30
Vision and AIRoom 1
Speaker:Michael RabinovichTitle:Scalable Locally Injective MappingsAbstract:opens in new windowin html    pdfopens in new window
We present a scalable approach for the optimization of flip-preventing energies in the general context of simplicial mappings, and specifically for mesh parameterization. Our iterative minimization is based on the observation that many distortion energies can be optimized indirectly by minimizing a simpler proxy energy and compensating for the difference with a reweighting scheme. Our algorithm is simple to implement and scales to datasets with millions of faces. We demonstrate our approach for the computation of maps that minimize a conformal or isometric distortion energy, both in two and three dimensions. In addition to mesh parameterization, we show that our algorithm can be applied to mesh deformation and mesh quality improvement.
ThursdayJul 21, 201612:15
Vision and AIRoom 1
Speaker:Ethan FetayaTitle:PhD Thesis Defense: Learning with limited supervision Abstract:opens in new windowin html    pdfopens in new window
The task of supervised learning, performing predictions based on a given labeled dataset, is well-understood theoretically and for which many practical algorithms exist. In general, the more complex the hypothesis space is, the larger the amount of samples we will need so that we do not overfit. The main issue is that obtaining a large labeled dataset is a costly and tedious process. An interesting and important question is what can be done when only a small amount of labeled data, or no data, is available. I will go over several approaches, learning with a single positive example, as well as unsupervised representation learning.
MondayJul 18, 201611:30
Vision and AIRoom 290C
Speaker:Emanuel A. LazarTitle:Voronoi topology analysis of structure in spatial point setsAbstract:opens in new windowin html    pdfopens in new window
Atomic systems are regularly studied as large sets of point-like particles, and so understanding how particles can be arranged in such systems is a very natural problem. However, aside from perfect crystals and ideal gases, describing this kind of "structure" in an insightful yet tractable manner can be challenging. Analysis of the configuration space of local arrangements of neighbors, with some help from the Borsuk-Ulam theorem, helps explain limitations of continuous metric approaches to this problem, and motivates the use of Voronoi cell topology. Several short examples from materials research help illustrate strengths of this approach.
ThursdayJul 14, 201612:15
Vision and AIRoom 1
Speaker:Netalee Efrat and Meirav GalunTitle:SIGGRAPH Dry-Runs Abstract:opens in new windowin html    pdfopens in new window

This Thursday we will have two SIGGRAPH rehearsal talks in the Vision Seminar, one by Netalee Efrat  and one by  Meirav Galun. Abstracts are below. Each talk will be about 15 minutes (with NO interruptions), followed by 10 minutes feedback.

Talk1  (Netalee Efrat):   Cinema 3D: Large scale automultiscopic display  

While 3D movies are gaining popularity, viewers in a 3D cinema still need to wear cumbersome glasses in order to enjoy them. Automultiscopic displays provide a better alternative to the display of 3D content, as they present multiple angular images of the same scene without the need for special eyewear. However, automultiscopic displays cannot be directly implemented in a wide cinema setting due to variants of two main problems: (i) The range of angles at which the screen is observed in a large cinema is usually very wide, and there is an unavoidable tradeoff between the range of angular images supported by the display and its spatial or angular resolutions. (ii) Parallax is usually observed only when a viewer is positioned at a limited range of distances from the screen. This work proposes a new display concept, which supports automultiscopic content in a wide cinema setting. It builds on the typical structure of cinemas, such as the fixed seat positions and the fact that different rows are located on a slope at different heights. Rather than attempting to display many angular images spanning the full range of viewing angles in a wide cinema, our design only displays the narrow angular range observed within the limited width of a single seat. The same narrow range content is then replicated to all rows and seats in the cinema. To achieve this, it uses an optical construction based on two sets of parallax barriers, or lenslets, placed in front of a standard screen. This paper derives the geometry of such a display, analyzes its limitations, and demonstrates a proof-of-concept prototype.

*Joint work with Piotr Didyk, Mike Foshey, Wojciech Matusik, Anat Levin

Talk 2  (Meirav Galun):   Accelerated Quadratic Proxy for Geometric Optimization 

We present the Accelerated Quadratic Proxy (AQP) - a simple first order algorithm for the optimization of geometric energies defined over triangular and tetrahedral meshes. The main pitfall encountered in the optimization of geometric energies is slow convergence. We observe that this slowness is in large part due to a Laplacian-like term existing in these energies. Consequently, we suggest to exploit the underlined structure of the energy  and to locally use a quadratic polynomial proxy, whose Hessian is taken to be the Laplacian. This improves stability and convergence, but more importantly allows incorporating acceleration in an almost universal way, that is independent of mesh size and of the specific energy considered. Experiments with AQP show it is rather insensitive to mesh resolution and requires a nearly constant number of iterations to converge; this is in strong contrast to other popular optimization techniques used today such as Accelerated Gradient Descent and Quasi-Newton methods, e.g., L-BFGS.  We have tested AQP for mesh deformation in 2D and 3D as well as for surface parameterization, and found it to provide a considerable speedup over common baseline techniques.

*Joint work with Shahar Kovalsky and Yaron Lipman

ThursdayJun 16, 201612:15
Vision and AIRoom 1
Speaker:Yair Weiss Title:Neural Networks, Graphical Models and Image RestorationAbstract:opens in new windowin html    pdfopens in new window
This is an invited talk I gave last year at a workshop on "Deep Learning for Vision". It discusses some of the history of graphical models and neural networks and speculates on the future of both fields with examples from the particular problem of image restoration.
ThursdayJun 02, 201612:15
Vision and AIRoom 1
Speaker:Omri Azencot Title:Advection-based Function Matching on SurfacesAbstract:opens in new windowin html    pdfopens in new window
A tangent vector field on a surface is the generator of a smooth family of maps from the surface to itself, known as the flow. Given a scalar function on the surface, it can be transported, or advected, by composing it with a vector field's flow. Such transport is exhibited by many physical phenomena, e.g., in fluid dynamics. In this paper, we are interested in the inverse problem: given source and target functions, compute a vector field whose flow advects the source to the target. We propose a method for addressing this problem, by minimizing an energy given by the advection constraint together with a regularizing term for the vector field. Our approach is inspired by a similar method in computational anatomy, known as LDDMM, yet leverages the recent framework of functional vector fields for discretizing the advection and the flow as operators on scalar functions. The latter allows us to efficiently generalize LDDMM to curved surfaces, without explicitly computing the flow lines of the vector field we are optimizing for. We show two approaches for the solution: using linear advection with multiple vector fields, and using non-linear advection with a single vector field. We additionally derive an approximated gradient of the corresponding energy, which is based on a novel vector field transport operator. Finally, we demonstrate applications of our machinery to intrinsic symmetry analysis, function interpolation and map improvement.
WednesdayMay 25, 201611:15
Vision and AIRoom 1
Speaker:Bill Freeman Title:Visually Indicated SoundsAbstract:opens in new windowin html    pdfopens in new windowJOINT SEMINAR WITH MACHINE LEARNING & STATISTICS

Children may learn about the world by pushing, banging, and manipulating things, watching and listening as materials make their distinctive sounds-- dirt makes a thud; ceramic makes a clink. These sounds reveal physical properties of the objects, as well as the force and motion of the physical interaction.

We've explored a toy version of that learning-through-interaction by recording audio and video while we hit many things with a drumstick. We developed an algorithm the predict sounds from silent videos of the drumstick interactions. The algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. We demonstrate that the sounds generated by our model are realistic enough to fool participants in a "real or fake" psychophysical experiment, and that the task of predicting sounds allows our system to learn about material properties in the scene.

Joint work with:
Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson

MondayMay 09, 201614:00
Vision and AIRoom 1
Speaker:Nikos ParagiosTitle:Visual Perception through Hyper GraphsAbstract:opens in new windowin html    pdfopens in new windowNote the unusual day & time
Computational vision, visual computing and biomedical image analysis have made tremendous progress in the past decade. This is mostly due the development of efficient learning and inference algorithms which allow better and richer modeling of visual perception tasks. Hyper-Graph representations are among the most prominent tools to address such perception through the casting of perception as a graph optimization problem. In this talk, we briefly introduce the interest of such representations, discuss their strength and limitations, provide appropriate strategies for their inference learning and present their application to address a variety of problems of visual computing.
ThursdayApr 14, 201612:15
Vision and AIRoom 1
Speaker:Barak Zackay Title:Proper astronomical image processing - Solving the problems of image co-addition and image subtractionAbstract:opens in new windowin html    pdfopens in new window

While co-addition and subtraction of astronomical images stand at the heart of observational astronomy, the existing solutions for them lack rigorous argumentation, are not achieving maximal sensitivity and are often slow. Moreover, there is no widespread agreement on how they should be done, and often different methods are used for different scientific applications. I am going to present rigorous solutions to these problems, deriving them from the most basic statistical principles. These solutions are proved optimal, under well defined and practically acceptable assumptions, and in many cases improve substantially the performance of the most basic operations in astronomy.

For coaddition, we present a coadd image that is:
a) sufficient for any further statistical decision or measurement on the underlying constant sky, making the entire data set redundant.
b) improves both survey speed (by 5-20%) and effective spatial resolution of past and future astronomical surveys.
c) improves substantially imaging through turbulence applications.
d) much faster than many of the currently used coaddition solutions.

For subtraction,  we present a subtraction image that is:
a) optimal for transient detection under the assumption of spatially uniform noise.
b) sufficient for any further statistical decision on the differences between the images, including the identification of cosmic rays and other image artifacts.
c) Free of subtraction artifacts, allowing (for the first time) robust transient identification in real time, opening new avenues for scientific exploration.
d) orders of magnitude faster than past subtraction methods.

ThursdayApr 07, 201612:15
Vision and AIRoom 1
Speaker:Yoni WexlerTitle:Fast Face Recognition with Multi-BatchAbstract:opens in new windowin html    pdfopens in new window

A common approach to face recognition relies on using deep learning  for extracting a signature.  All leading work on the subject use  stupendous amounts of processing power and data. In this work we present a method for efficient and compact learning  of metric embedding.  The core idea allows a more accurate  estimation of the global gradient and hence fast and robust  convergence. In order to avoid the need for huge amounts of data we include an explicit alignment phase into the network, hence greatly reducing  the number of parameters. These insights allow us to efficiently train a compact deep learning model for face recognition in only 12 hours on a single GPU, which can  then fit a mobile device.

Joint work with: Oren Tadmor, Tal Rosenwein, Shai Shalev-Schwartz, Amnon Shashua

ThursdayMar 31, 201612:15
Vision and AIRoom 1
Speaker:Yael Moses Title:Dynamic Scene Analysis Using CrowdCam DataAbstract:opens in new windowin html    pdfopens in new window

Dynamic events such as family gatherings, concerts or sports events are often photographed by a group of people. The set of still images obtained this way is rich in dynamic content. We consider the question of whether such a set of still images, rather than traditional video sequences, can be used for analyzing the dynamic content of the scene. This talk will describe several instances of this problem, their solutions and directions for future studies.

In particular, we will present a method to extend epipolar geometry to predict location of a moving feature in CrowdCam images. The method assumes that the temporal order of the set of images, namely photo-sequencing, is given. We will briefly describe our method to compute photo-sequencing using geometric considerations and rank aggregation.  We will also present a method for identifying the moving regions in a scene, which is a basic component in dynamic scene analysis. Finally, we will consider a new vision of developing collaborative CrowdCam, and a first step toward this goal.

This talk will be based on joint works with Tali Dekel, Adi Dafni, Mor Dar, Lior Talked, Ilan Shimshoni,  and Shai Avidan.

MondayMar 28, 201611:00
Vision and AIRoom 141
Speaker:Dan RavivTitle:Stretchable non-rigid structuresAbstract:opens in new windowin html    pdfopens in new windowPLEASE NOTE UNUSUAL ROOM, DAY and TIME.
Geometrical understanding of bendable and stretchable structures is crucial for many applications where comparison, inference and reconstruction play an important role. Moreover, it is the first step in quantifying normal and abnormal phenomena in non-rigid domains. Moving from Euclidean (straight) distances towards intrinsic (geodesic) measures, revolutionized the way we handle bendable structures, but did not take stretching into account. Human organs, such as the heart, lungs and kidneys, are great examples for such models. In this lecture I will show that stretching can be accounted for in the atom (local) level, in a closed form using higher derivatives of the data. I further show that invariants can play a critical part in modern learning systems, used for statistical analysis of non-rigid structures, and assist in fabricating soft-models. The lecture will be self-contained and no prior knowledge is needed.
ThursdayJan 21, 201612:15
Vision and AIRoom 1
Speaker:Yoav Schechner Title:Clouds in 4DAbstract:opens in new windowin html    pdfopens in new window
The spatially varying and temporally dynamic atmosphere presents significant, exciting and fundamentally new problems for imaging and computer vision. Some problems must tackle the complexity of radiative transfer models in 3D multiply-scattering media, to achieve reconstruction based on the models. This aspect can also be used in other scattering media. Nevertheless, the huge scale of the atmosphere and its dynamics call for multiview imaging using unprecedented distributed camera systems, on the ground or in orbit. These new configurations require generalizations of traditional triangulation, radiometric calibration, background estimation, lens-flare and compression questions. This focus can narrow uncertainties in climate-change forecasts, as we explain.
ThursdayJan 14, 201612:15
Vision and AIRoom 1
Speaker:Oren Friefeld Title:From representation to inference: respecting and exploiting mathematical structures in computer vision and machine learningAbstract:opens in new windowin html    pdfopens in new window

Stochastic analysis of real-world signals consists of 3 main parts: mathematical representation; probabilistic modeling; statistical inference. For it to be effective, we need mathematically-principled and practical computational tools that take into consideration not only each of these components by itself but also their interplay. This is especially true for a large class of computer-vision and machine-learning problems that involve certain mathematical structures; the latter may be a property of the data or encoded in the representation/model to ensure mathematically-desired properties and computational tractability. For concreteness, this talk will center on structures that are geometric, hierarchical, or topological.

Structures present challenges. For example, on nonlinear spaces, most statistical tools are not directly applicable, and, moreover, computations can be expensive. As another example, in mixture models, topological constraints break statistical independence. Once we overcome the difficulties, however, structures offer many benefits. For example, respecting and exploiting the structure of Riemannian manifolds and/or Lie groups yield better probabilistic models that also support consistent synthesis. The latter is crucial for the employment of analysis-by-synthesis inference methods used within, e.g., a generative Bayesian framework. Likewise, imposing a certain structure on velocity fields yields highly-expressive diffeomorphisms that are also simple and computationally tractable; particularly, this facilitates MCMC inference, traditionally viewed as too expensive in this context.

Time permitting, throughout the talk I will also briefly touch upon related applications such as statistical shape models, transfer learning on manifolds, image warping/registration, time warping, superpixels, 3D-scene analysis, nonparametric Bayesian clustering of spherical data, multi-metric learning, and new machine-learning applications of diffeomorphisms. Lastly, we also applied the (largely model-based) ideas above to propose the first learned data augmentation scheme; as it turns out, when compared with the state-of-the-art schemes, this improves the performance of classifiers of the deep-net variety.

ThursdayJan 07, 201612:15
Vision and AIRoom 1
Speaker:Greg ShakhnarovichTitle:Rich Representations for Parsing Visual ScenesAbstract:opens in new windowin html    pdfopens in new window

I will describe recent work on building and using rich representations aimed at automatic analysis of visual scenes. In particular, I will describe methods for semantic segmentation (labeling regions of an image according to the category it belongs to), and on semantic boundary detection (recovering accurate boundaries of semantically meaningful regions, such as those corresponding to objects). We focus on feed-forward architectures for these tasks, leveraging recent advances in the art of training deep neural networks. Our approach aims to shift the burden of inducing desirable constraints from explicit structure in the model to implicit structure inherent in computing richer, context-aware representations. I will describe experiments on standard benchmark data sets that demonstrate the success of this approach.

Joint work with Mohammadreza Mostajabi, Payman Yadollahpour, and Harry Yang.

WednesdayJan 06, 201611:15
Vision and AIRoom 1
Speaker:Karen LivescuTitle:Segmental Sequence Models in the Neural AgeAbstract:opens in new windowin html    pdfopens in new windowJoint Vision and Machine Learning seminar note unusual day/time

Many sequence prediction tasks---such as automatic speech recognition and video analysis---benefit from long-range temporal features.  One way of utilizing long-range information is through segmental (semi-Markov) models such as segmental conditional random fields.  Such models have had some success, but have been constrained by the computational needs of considering all possible segmentations.  We have developed new segmental models with rich features based on neural segment embeddings, trained with discriminative large-margin criteria, that are efficient enough for first-pass decoding.  In our initial work with these models, we have found that they can outperform frame-based HMM/deep network baselines on two disparate tasks, phonetic recognition and sign language recognition from video.  I will present the models and their results on these tasks, as well as (time permitting) related recent work on neural segmental acoustic word embeddings.

This is joint work with Hao Tang, Weiran Wang, Herman Kamper, Taehwan Kim, and Kevin Gimpel

ThursdayDec 31, 201512:15
Vision and AIRoom 1
Speaker:Shai Shalev-Shwartz Title:Deep Learning: The theoretical-practical gapAbstract:opens in new windowin html    pdfopens in new window
I will describe two contradicting lines of work. On one hand, a practical work on autonomous driving I was doing at Mobileye, in which deep learning is one of the key ingredients. On the other hand, recent theoretical works showing very strong hardness of learning results. Bridging this gap is a great challenge. I will describe some approaches toward a solution.
ThursdayDec 24, 201512:15
Vision and AIRoom 1
Speaker:Shai Avidan Title:Best-Buddies Similarity for Robust Template MatchingAbstract:opens in new windowin html    pdfopens in new window
We propose a novel method for template matching in unconstrained environments. Its essence is the Best-Buddies Similarity (BBS), a useful, robust, and parameter-free similarity measure between two sets of points. BBS is based on counting the number of Best-Buddies Pairs (BBPs)- pairs of points in source and target sets, where each point is the nearest neighbor of the other. BBS has several key features that make it robust against complex geometric deformations and high levels of outliers, such as those arising from background clutter and occlusions. We study these properties, provide a statistical analysis that justifies them, and demonstrate the consistent success of BBS on a challenging real world dataset. Joint work with Tali Dekel, Shaul Oron, Miki Rubinstein and Bill Freeman
ThursdayDec 03, 201512:15
Vision and AIRoom 1
Speaker:Ariel ShamirTitle:Creating Visual StoriesAbstract:opens in new windowin html    pdfopens in new window
Similar to text, the amount of visual data in the form of videos and images is growing enormously. One of the key challenges is to understand this data, arrange it, and create content which is semantically meaningful. In this talk I will present several such efforts to "bridge the semantic gap" using humans as "agents": capturing and utilizing eye movements, body movement or gaze direction. This enables re-editing of existing videos, tracking of sports highlights, creating one coherent video from multiple sources, and more.
ThursdayNov 26, 201512:15
Vision and AIRoom 1
Speaker:Nadav Cohen Title:On the Expressive Power of Deep Learning: A Tensor AnalysisAbstract:opens in new windowin html    pdfopens in new window

It has long been conjectured that hypothesis spaces suitable for data that is compositional in nature, such as text or images, may be more efficiently represented with deep hierarchical architectures than with shallow ones.  Despite the vast empirical evidence, formal arguments to date are limited and do not capture the kind of networks used in practice. Using tensor factorization, we derive a universal hypothesis space implemented by an arithmetic circuit over functions applied to local data structures (e.g. image patches). The resulting networks first pass the input through a representation layer, and then proceed with a sequence of layers comprising sum followed by product-pooling, where sum corresponds to the widely used convolution operator. The hierarchical structure of networks is born from factorizations of tensors based on the linear weights of the arithmetic circuits. We show that a shallow network corresponds to a rank-1 decomposition, whereas a deep network corresponds to a Hierarchical Tucker (HT) decomposition. Log-space computation for numerical stability transforms the networks into SimNets.

In its basic form, our main theoretical result shows that the set of polynomially sized rank-1 decomposable tensors has measure zero in the parameter space of polynomially sized HT decomposable tensors. In deep learning terminology, this amounts to saying that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require an exponential size if one wishes to implement (or approximate) them with a shallow network. Our construction and theory shed new light on various practices and ideas employed by the deep learning community, and in that sense bear a paradigmatic contribution as well.

Joint work with Or Sharir and Amnon Shashua.

ThursdayNov 19, 201512:15
Vision and AIRoom 1
Speaker:Alex Bronstein Title:Learning to hashAbstract:opens in new windowin html    pdfopens in new window

In view of the recent huge interest in image classification and object recognition problems and the spectacular success of deep learning and random forests in solving these tasks, it seems astonishing that much more modest efforts are being invested into related, and often more difficult, problems of image and multimodal content-based retrieval, and, more generally, similarity assessment in large-scale databases. These problems, arising as primitives in many computer vision tasks, are becoming increasingly important in the era of exponentially increasing information. Semantic and similarity-preserving hashing methods have recently received considerable attention to address such a need, in part due to their significant memory and computational advantage over other representations.

In this talk, I will overview some of my recent attempts to construct efficient semantic hashing schemes based on deep neural networks and random forests.

Based on joint works with Qiang Qiu, Guillermo Sapiro, Michael Bronstein, and Jonathan Masci.

ThursdayNov 12, 201512:15
Vision and AIRoom 1
Speaker:Nathan Srebro Title:Optimization, Regularization and Generalization in Multilayer NetworksAbstract:opens in new windowin html    pdfopens in new windowJoint Machine Learning & Vision Seminar

What is it that enables learning with multi-layer networks?  What causes the network to generalize well?  What makes it possible to optimize the error, despite the problem being hard in the worst case?  In this talk I will attempt to address these questions and relate between them, highlighting the important role of optimization in deep learning.  I will then use the insight to suggest studying novel optimization methods, and will present Path-SGD, a novel optimization approach for multi-layer RELU networks that yields better optimization and better generalization.

Joint work with Behnam Neyshabur, Ryota Tomioka and Russ Salakhutdinov.

ThursdayOct 22, 201512:15
Vision and AIRoom 1
Speaker:Michael Bronstein Title:Deep learning on geometric dataAbstract:opens in new windowin html    pdfopens in new window
The past decade in computer vision research has witnessed the re-emergence of "deep learning" and in particular, convolutional neural network techniques, allowing to learn task-specific features from examples and achieving a breakthrough in performance in a wide range of applications. However, in the geometry processing and computer graphics communities, these methods are practically unknown. One of the reasons stems from the facts that 3D shapes (typically modeled as Riemannian manifolds) are not shift-invariant spaces, hence the very notion of convolution is rather elusive. In this talk, I will show some recent works from our group trying to bridge this gap. Specifically, I will show the construction of intrinsic convolutional neural networks on meshes and point clouds, with applications such as finding dense correspondence between deformable shapes and shape retrieval.
ThursdayJul 02, 201512:15
Vision and AIRoom 1
Speaker:Kyros KutulakosTitle:Transport-Aware CamerasAbstract:opens in new windowin html    pdfopens in new window

Conventional cameras record all light falling onto their sensor regardless of the path that light followed to get there. In this talk I will present an emerging family of video cameras that can be programmed to record just a fraction of the light coming from a controllable source, based on the actual 3D path followed. Live video from these cameras offers a very unconventional view of our everyday world in which refraction and scattering can be selectively blocked or enhanced, visual structures too subtle to notice with the naked eye can become apparent, and object appearance can depend on depth.
I will discuss the unique optical properties and power efficiency of  these "transport-aware" cameras, as well as their use for 3D shape acquisition, robust time-of-flight imaging, material analysis, and scene understanding. Last but not least, I will discuss their potential to become our field's "outdoor Kinect" sensor---able to operate robustly even in direct sunlight with very low power.

Kyros Kutulakos is a Professor of Computer Science at the University of Toronto. He received his PhD degree from the University of Wisconsin-Madison in 1994 and his BS degree from the University of Crete in 1988, both in Computer Science. In addition to the University of Toronto, he has held appointments at the University of Rochester (1995-2001) and Microsoft Research Asia (2004-05 and 2011-12). He is the recipient of an Alfred P. Sloan Fellowship, an Ontario Premier's Research Excellence Award, a Marr Prize in 1999, a Marr Prize Honorable Mention in 2005, and three other paper awards (CVPR 1994, ECCV 2006, CVPR 2014). He also served as Program Co-Chair of CVPR 2003, ICCP 2010 and ICCV 2013.

ThursdayJun 18, 201512:15
Vision and AIRoom 1
Speaker:Marc TeboulleTitle:Elementary Algorithms for High Dimensional Structured OptimizationAbstract:opens in new windowin html    pdfopens in new window

Many scientific and engineering problems are challenged by the fact they involve functions of a very large number of variables. Such problems arise naturally in signal recovery, image processing, learning theory, etc. In addition to the numerical difficulties due to the so-called curse of dimensionality, the resulting optimization problems are often nonsmooth and nonconvex.

We shall survey some of our recent results, illustrating how these difficulties may be handled in the context of well-structured optimization models, highlighting the ways in which problem structures and data information can be beneficially exploited to devise and analyze simple and efficient algorithms.

ThursdayJun 04, 201512:15
Vision and AIRoom 1
Speaker:Rene VidalTitle:Algebraic, Sparse and Low Rank Subspace ClusteringAbstract:opens in new windowin html    pdfopens in new window
In the era of data deluge, the development of methods for discovering structure in high-dimensional data is becoming increasingly important. Traditional approaches such as PCA often assume that the data is sampled from a single low-dimensional manifold. However, in many applications in signal/image processing, machine learning and computer vision, data in multiple classes lie in multiple low-dimensional subspaces of a high-dimensional ambient space. In this talk, I will present methods from algebraic geometry, sparse representation theory and rank minimization for clustering and classification of data in multiple low-dimensional subspaces. I will show how these methods can be extended to handle noise, outliers as well as missing data. I will also present applications of these methods to video segmentation and face clustering.
WednesdayMay 20, 201513:00
Vision and AIRoom 1
Speaker:Thomas BroxTitle:Will ConvNets render computer vision research obsolete?Abstract:opens in new windowin html    pdfopens in new window
Deep learning based on convolutional network architectures has revolutionized the field of visual recognition in the last two years. There is hardly a classification task left, where ConvNets do not define the state-of-the-art. Outside recognition, deep learning seems to be of lesser importance, yet this could be a fallacy. In this talk I will present our recent work on convolutional networks and show that they can learn to solve computer vision problems that are not typically assigned to the field of recognition. I will present a network that has learned to be good on descriptor matching, another one can create new images of chairs, and I show two networks that have learned to estimate optical flow. I will conclude with some arguments why, despite all this, computer vision will stay a serious research field.
ThursdayMay 14, 201512:15
Vision and AIRoom 141
Speaker:Guy Ben-YosefTitle:Full interpretation of minimal imagesAbstract:opens in new windowin html    pdfopens in new windowPlease note unusual location.

The goal in this work is to produce ‘full interpretation’ for object images, namely to identify and localize all semantic features and parts that are recognized by human observers. We develop a novel approach and tools to study this challenging task, by dividing the interpretation of the complete object to interpretation of so-called 'minimal recognizable configurations’, namely severely reduced but recognizable local regions, that are minimal in the sense that any further reduction would turn them unrecognizable. We show that for the task of full interpretation, such minimal images have unique properties, which make them particularly useful.

For modeling interpretation, we identify primitive components and relations that play a useful role in the interpretation of minimal images by humans, and incorporate them in a structured prediction algorithm. The structure elements can be point, contour, or region primitives, while relations between them range from standard unary and binary potentials based on relative location, to more complex and high dimensional relations. We show experimental results and match them to human performance. We discuss implications of ‘full’ interpretation for difficult visual tasks, such as recognizing human activities or interactions.

ThursdayApr 02, 201512:15
Vision and AIRoom 1
Speaker:Yonatan WexlerTitle:Machine Learning In Your PocketAbstract:opens in new windowin html    pdfopens in new window

The field of Machine Learning has been making huge strides recently. Problems such as visual recognition and classification, that were believed to be open only a few years ago, now seem solvable. The best performers use Artificial Neural Networks, in their reincarnation as "Deep Learning", where huge networks are trained over lots of data. One bottleneck in current schemes is the huge amount of required computation during both training and testing. This limits the usability of these methods when power is an issue, such as with wearable devices.

As a step towards deeper understanding of deep learning mechanisms, I will show how correct conditioning of the back-propagation training iterations results in a much improved convergence. This reduces training time, providing better results. It also allows us to train smaller models, that are harder to optimize.

In this talk I will also discuss the challenges - and describe some of the solutions - in applying Machine Learning on a mobile device that can fit your pocket. The OrCam is a wearable camera that speaks to you. It reads anything, learns and recognizes faces, and much more. It is ready to help through the day, all with a simple pointing gesture. It is already improving the lives of many blind and visually impaired people.

ThursdayMar 26, 201512:15
Vision and AIRoom 1
Speaker:Lior WolfTitle:Image Annotation using Deep Learning and Fisher VectorsAbstract:opens in new windowin html    pdfopens in new window
We present a system for solving the holy grail of computer vision -- matching images and text and describing an image by an automatically generated text. Our system is based on combining deep learning tools for images and text, namely Convolutional Neural Networks, word2vec, and Recurrent Neural Networks, with a classical computer vision tool, the Fisher Vector. The Fisher Vector is modified to support hybrid distributions that are a much better fit for the text data. Our method proves to be extremely potent and we outperform by a significant margin all concurrent methods.
ThursdayMar 19, 201512:15
Vision and AIRoom 1
Speaker:Simon KormanTitle:Inverting RANSAC: Global Model Detection via Inlier Rate EstimationAbstract:opens in new windowin html    pdfopens in new window
This work presents a novel approach for detecting inliers in a given set of correspondences (matches). It does so without explicitly identifying any consensus set, based on a method for inlier rate estimation (IRE). Given such an estimator for the inlier rate, we also present an algorithm that detects a globally optimal transformation. We provide a theoretical analysis of the IRE method using a stochastic generative model on the continuous spaces of matches and transformations. This model allows rigorous investigation of the limits of our IRE method for the case of 2D translation, further giving bounds and insights for the more general case. Our theoretical analysis is validated empirically and is shown to hold in practice for the more general case of 2D affinities. In addition, we show that the combined framework works on challenging cases of 2D homography estimation, with very few and possibly noisy inliers, where RANSAC generally fails. Joint work with Roee Litman, Alex Bronstein and Shai Avidan.
ThursdayJan 29, 201512:15
Vision and AIRoom 1
Speaker:Avishay Gal-Yam and Barak ZackayTitle:New ways to look at the skyAbstract:opens in new windowin html    pdfopens in new window
We present a general review of astronomical observation, with emphasis on the ways it differs from conventional imaging or photography. We then describe emerging trends in this area driven mainly by advances in detector technology and computing power. Having set a broad context, we then describe the new multiplexed imaging technique we have developed. This method uses the sparseness of typical astronomical data in order to image large areas of target sky using a physically small detector.
MondayJan 26, 201514:00
Vision and AIRoom 141
Speaker:Greg ShakhnarovichTitle:Feedforward semantic segmentation with zoom-out featuresAbstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL DAY/TIME/ROOM
We introduce a purely feed-forward architecture for semantic segmentation. We map small image elements (superpixels) to rich feature representations extracted from a sequence of nested regions of increasing extent. These regions are obtained by "zooming out" from the superpixel all the way to scene-level resolution. This approach exploits statistical structure in the image and in the label space without setting up explicit structured prediction mechanisms, and thus avoids complex and expensive inference. Instead superpixels are classified by a feedforward multilayer network. Our architecture achieves new state of the art performance in semantic segmentation, obtaining 64.4% average accuracy on the PASCAL VOC 2012 test set. Joint work with Mohammadreza Mostajabi and Payman Yadollahpour.
MondayJan 12, 201514:00
Vision and AIRoom 141
Speaker:Karen LivescuTitle:Multi-view representation learning: A tutorial introduction and applications to speech and languageAbstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL ROOM, DAY, TIME

Many types of multi-dimensional data have a natural division into two "views", such as audio and video or images and text.
  Multi-view learning includes a variety of techniques that use multiple views
  of data to learn improved models for each of the views. The views can be multiple measurement modalities (like the examples above) but also can be different types of information extracted from the same source (words + context, document text + links) or any division of the data dimensions into subsets satisfying certain learning assumptions. Theoretical and empirical results show that multi-view  techniques can improve over single-view ones in certain settings. In many  cases multiple views help by reducing noise in some sense (what is noise in one view is not in the other). In this talk, I will focus on multi-view learning of representations (features), especially using canonical correlation analysis (CCA) and related techniques.  I will give a tutorial overview of CCA and its relationship with other techniques such as partial least squares (PLS) and linear discriminant analysis (LDA).  I will also present extensions developed by ourselves and others, such as kernel, deep, and generalized
("many-view") CCA.  Finally, I will give recent results on speech and language tasks, and demonstrate our publicly available code.

Based on joint work with Raman Arora, Weiran Wang, Jeff Bilmes, Galen Andrew, and others.

ThursdayJan 08, 201512:15
Vision and AIRoom 1
Speaker:Tomer MichaeliTitle:Blind deblurring and blind super-resolution using internal patch recurrenceAbstract:opens in new windowin html    pdfopens in new window

Small image patches tend to recur at multiple scales within high-quality natural images.
This fractal-like behavior has been used in the past for various tasks  including image compression, super-resolution and denoising. In this talk, I will show that this phenomenon  can also be harnessed for "blind deblurring"  and for "blind  super-resolution", that is, for removing blur or increasing resolution without a-priori knowledge of the associated blur kernel. It turns out that the cross-scale patch recurrence property is strong only in images taken under ideal imaging conditions, but significantly diminishes when the imaging conditions deviate from ideal ones. Therefore, the deviations from ideal patch recurrence actually provide information on the unknown camera blur kernel.
More specifically, we show that the correct blur kernel is the one which maximizes the  similarity between patches across scales of the image. Extensive experiments  indicate that our approach leads to state of the art results, both in deblurring and in super-resolution.

Joint work with Michal Irani.

ThursdayJan 01, 201512:15
Vision and AIRoom 1
Speaker:Tal HassnerTitle:Towards Dense Correspondences Between Any Two ImagesAbstract:opens in new windowin html    pdfopens in new window

We present a practical method for establishing dense correspondences between two images with similar  content, but possibly different 3D scenes. One of the challenges in designing  such a system is the local scale differences of objects appearing in the two  images. Previous methods often considered only small subsets of image pixels; matching only pixels for which stable scales may be reliably estimated. More recently, others have considered dense correspondences, but with substantial costs  associated with generating, storing and matching scale invariant descriptors.
Our work here is motivated by the observation that pixels in the image have contexts -- the pixels around them -- which may be exploited in order to estimate local scales reliably and repeatably. In practice, we demonstrate that scales estimated in sparse interest points may be propagated to neighboring pixels where this information cannot be reliably determined. Doing so allows scale invariant descriptors to be extracted anywhere in the image, not just in detected interest points. As a consequence, accurate dense correspondences are obtained even between very different images, with little computational costs beyond those required by existing methods.

This is joint work with Moria Tau from the Open University of Israel

ThursdayDec 25, 201413:00
Vision and AIRoom 1
Speaker:Hadar ElorTitle:RingIt: Ring-ordering Casual Photos of a Dynamic EventAbstract:opens in new windowin html    pdfopens in new window
The multitude of cameras constantly present nowadays redefined the meaning of capturing an event and the meaning of sharing this event with others. The images are frequently uploaded to a common platform, and the image-navigation challenge naturally arises. In this talk I will present RingIt a novel technique to sort an unorganized set of casual photographs taken along a general ring, where the cameras capture a dynamic event in the center of the ring. We assume a nearly instantaneous event, e.g., an interesting moment in a performance captured by the digital cameras and smartphones of the surrounding crowd. The ordering method extracts the K-nearest neighbors (KNN) of each image from a rough all-pairs dissimilarity estimate. The KNN dissimilarities are refined to form a sparse Weighted Laplacian, and a spectral analysis reveals the spatial ordering of the images, allowing for a sequential display of the captured object.
ThursdayDec 11, 201412:15
Vision and AIRoom 1
Speaker:Boaz NadlerTitle: Edge Detection under computational constraints: a sublinear approachAbstract:opens in new windowin html    pdfopens in new window
Edge Detection is an important task in image analysis. Various applications require real-time detection of long edges in large noisy images. Motivated by such settings, in this talk we'll address the following question: How well can one detect long edges under severe computational constraints, that allow only a fraction of all image pixels to be processed ? We present fundamental lower bounds on edge detection in this setup, a sublinear algorithm for long edge detection and a theoretical analysis of the inevitable tradeoff between its detection performance and the allowed computational budget. The competitive performance of our algorithm will be illustrated on both simulated and real images. Joint work with Inbal Horev, Meirav Galun, Ronen Basri (Weizmann) and Ery Arias-Castro (UCSD).
ThursdayDec 04, 201412:00
Vision and AIRoom 1
Speaker:Shai AvidanTitle:Extended Lucas-Kanade TrackingAbstract:opens in new windowin html    pdfopens in new window
Lucas-Kanade (LK) is a classic tracking algorithm exploiting target structural constraints thorough template matching. Extended Lucas Kanade or ELK casts the original LK algorithm as a maximum likelihood optimization and then extends it by considering pixel object / background likelihoods in the optimization. Template matching and pixel-based object / background segregation are tied together by a unified Bayesian framework. In this framework two log-likelihood terms related to pixel object / background affiliation are introduced in addition to the standard LK template matching term. Tracking is performed using an EM algorithm, in which the E-step corresponds to pixel object/background inference, and the M-step to parameter optimization. The final algorithm, implemented using a classifier for object / background modeling and equipped with simple template update and occlusion handling logic, is evaluated on two challenging data-sets containing 50 sequences each. The first is a recently published benchmark where ELK ranks 3rd among 30 tracking methods evaluated. On the second data-set of vehicles undergoing severe view point changes ELK ranks in 1st place outperforming state-of-the-art methods. Joint work with Shaul Oron (Tel-Aviv University) and Aharon Bar-Hillel (Microsoft).
ThursdayNov 27, 201412:00
Vision and AIRoom 1
Speaker:Fred HamprechtTitle:Joint segmentation and tracking, and new unsolved problemsAbstract:opens in new windowin html    pdfopens in new window

On my last visit in 2012, I posed a number of open questions, including how to achieve joint segmentation and tracking, and how to obtain uncertainty estimates for a segmentation.

Some of these questions we have been able to solve [Schiegg ICCV 2013, Schiegg Bioinformatics  2014, Fiaschi CVPR 2014] and I would like to report on this progress.

Given that I will be at Weizmann for another four months, I will also pose new open questions  on image processing problems that require a combination of combinatorial  optimization and (structured) learning, as an invitation to work together.

ThursdayNov 20, 201412:00
Vision and AIRoom 1
Speaker:Marina AltermanTitle:Vision Through Random Refractive DistortionsAbstract:opens in new windowin html    pdfopens in new window
Random dynamic distortions naturally affect images taken through atmospheric turbulence or wavy water. We show how computer vision can function under such effects, and even exploit them, relying on physical, geometric and statistical models of refractive disturbances. We make good use of distortions created by atmospheric turbulence: distorted multi-view videos lead to tomographic reconstruction of large-scale turbulence fields, outdoors. We also demonstrate several approaches to a 'virtual periscope', to view airborne scenes from submerged cameras: (a) multiple submerged views enable stochastic localization of airborne objects in 3D; (b) the wavy water surface (and hence distortion) can be passively estimated instantly, using a special sensor, analogous to modern astronomic telescopes and (c) we show how airborne moving objects can be automatically detected, despite dynamic distortions affecting the entire scene. In all these works, exploiting physical models in new ways leads to novel imaging tasks, while the approaches we take are demonstrated in field experiments.
ThursdayNov 13, 201412:00
Vision and AIRoom 1
Speaker:Barak Zackay Title: Imaging through turbulence a long quest of innovative computational photography in astronomy Abstract:opens in new windowin html    pdfopens in new window

The astronomical community's largest technical challenge is coping with the earths atmosphere.  In this talk, I will present the popular methods for performing scientific measurement from the ground, coping with the time dependant distortions generated by the earths atmosphere.  We will talk about the following topics:

1) Scientific motivation for eliminating the effect of the atmosphere.

2) The statistics of turbulence - the basis for all methods is in deep understanding of the atmospheric turbulence

3) wave-front sensing + adaptive optics -  A way to correct it in hardware.

4) lucky imaging + speckle Interferometry - Ways to computationally extract scientifically valuable data despite the turbulent atmosphere.

TuesdayNov 04, 201410:00
Vision and AIRoom 141
Speaker:Rob FergusTitle:Learning to Discover Efficient Mathematical IdentitiesAbstract:opens in new windowin html    pdfopens in new windowNOTE UNUSUAL TIME AND PLACE
In this talk, I will describe how machine learning techniques can be applied to the discovery of efficient mathematical identities. We introduce an attribute grammar framework for representing symbolic expressions. Given a set of grammar rules we build trees that combine different rules, looking for branches which yield compositions that are analytically equivalent to a target expression, but of lower computational complexity. However, as the size of the trees grows exponentially with the complexity of the target expression, brute force search is impractical for all but the simplest of expressions. Consequently, we explore two learning approaches that are able to learn from simpler expressions to guide the tree search. The first of these is a simple n-gram model, the other being a recursive neural-network. We show how these approaches enable us to derive complex identities, beyond reach of brute-force search, or human derivation. Joint work with Wojciech Zaremba and Karol Kurach.