Publications

2025

148.

Neural correlates of minimal recognizable configurations in the human brain

Casile A., Cordier A., Kim J. G. et al. (2025) Cell Reports. 44, 3, 115429. Abstract

Inferring object identity from incomplete information is a ubiquitous challenge for the visual system. Here, we study the neural mechanisms underlying processing of minimally recognizable configurations (MIRCs) and their subparts, which are unrecognizable (sub-MIRCs). MIRCs and sub-MIRCs are very similar at the pixel level, yet they lead to a dramatic gap in recognition performance. To evaluate how the brain processes such images, we invasively record human neurophysiological responses. Correct identification of MIRCs is associated with a dynamic interplay of feedback and feedforward mechanisms between frontal and temporal areas. Interpretation of sub-MIRC images improves dramatically after exposure to the corresponding full objects. This rapid and unsupervised learning is accompanied by changes in neural responses in the temporal cortex. These results are at odds with purely feedforward models of object recognition and suggest a role for the frontal lobe in providing top-down signals related to object identity in difficult visual tasks.

[All authors]

2023

147.

Human-like scene interpretation by a guided counterstream processing

Ullman S., Assif L., Strugatski A., Vatashsky B. Z., Levi H., Netanyahu A. & Yaari A. (2023) Proceedings of the National Academy of Sciences of the United States of America. 120, 40, e221117912. Abstract

In modeling vision, there has been a remarkable progress in recognizing a range of scene components, but the problem of analyzing full scenes, an ultimate goal of visual perception, is still largely open. To deal with complete scenes, recent work focused on the training of models for extracting the full graph-like structure of a scene. In contrast with scene graphs, humans' scene perception focuses on selected structures in the scene, starting with a limited interpretation and evolving sequentially in a goal-directed manner [G. L. Malcolm, I. I. A. Groen, C. I. Baker, Trends. Cogn. Sci. 20, 843-856 (2016)]. Guidance is crucial throughout scene interpretation since the extraction of full scene representation is often infeasible. Here, we present a model that performs human-like guided scene interpretation, using an iterative bottom-up, top-down processing, in a "counterstream"structure motivated by cortical circuitry. The process proceeds by the sequential application of top-down instructions that guide the interpretation process. The results show how scene structures of interest to the viewer are extracted by an automatically selected sequence of top-down instructions. The model shows two further benefits. One is an inherent capability to deal well with the problem of combinatorial generalization-generalizing broadly to unseen scene configurations, which is limited in current network models [B. Lake, M. Baroni, 35th International Conference on Machine Learning, ICML 2018 (2018)]. The second is the ability to combine visual with nonvisual information at each cycle of the interpretation process, which is a key aspect for modeling human perception as well as advancing AI vision systems.

146.

Attention Based Multi-Label Classification of Diabetic Retinopathy from Optical Coherence Tomography

Segev D., Basri R., Batash T. et al. (2023) 2023 IEEE International Symposium on Biomedical Imaging, ISBI 2023. Abstract

Diabetic Retinopathy (DR) is a common complication of diabetes that, in severe cases, can result in blindness. Accurate clinical treatment is imperative to prevent these cases and relies considerably on an exact diagnosis of the various symptoms of DR. We aim to advance DR diagnosis by providing a practical tool to automatically classify Optical Coherence Tomography (OCT) scans for DR and to identify and localize DR-related morphological features within the scans. Our system obtains raw OCT input and only sparse clinical annotations at the volume level, which can be obtained automatically from routine electronic medical records.We developed a novel neural network architecture, OCT-Transformer, that obtains state-of-the-art classification results compared to previous models and does so with limited training data. We base our architecture on an attention mechanism and show this to be the driving factor for the boost in performance. We additionally use our model to locate pixels within the input scans that explain its classification.

[All authors]

145.

Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models

Doveh S., Arbelle A., Harary S. et al. (2023) Advances in Neural Information Processing Systems. Globerson A., Oh A., Saenko K., Hardt M., Levine S. & Neumann T.(eds.). Vol. 36. p. 76137-76150 Abstract

Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called 'object bias' - their representations behave as 'bags of nouns', mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these 'compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning and pre-training the VL model: (i) the caption quality, or in other words 'image-alignment', of the texts; and (ii) the 'density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors leveraging a standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to ∼ 27% over the base model, up to ∼ 20% over the strongest baseline, and by 6.7% on average.

[All authors]

144.

Teaching Structured Vision & Language Concepts to Vision & Language Models

Doveh S., Arbelle A., Harary S. et al. (2023) Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023. p. 2657-2668 Abstract

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision & Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model. Our code and pretrained models are available at: https://github.com/SivanDoveh/TSVLC

[All authors]

2022

143.

Gaze following requires early visual experience

Zohary E., Harari D., Ullman S., Ben-Zion I., Doron R., Attias S., Porat Y., Sklar A. Y. & Mckyton A. (2022) Proceedings of the National Academy of Sciences - PNAS. 119, 20, e211718411. Abstract

Gaze understanding-a suggested precursor for understanding others' intentions-requires recovery of gaze direction from the observed person's head and eye position. This challenging computation is naturally acquired at infancy without explicit external guidance, but can it be learned later if vision is extremely poor throughout early childhood? We addressed this question by studying gaze following in Ethiopian patients with early bilateral congenital cataracts diagnosed and treated by us only at late childhood. This sight restoration provided a unique opportunity to directly address basic issues on the roles of \u201cnature\u201d and \u201cnurture\u201d in development, as it caused a selective perturbation to the natural process, eliminating some gaze-direction cues while leaving others still available. Following surgery, the patients' visual acuity typically improved substantially, allowing discrimination of pupil position in the eye. Yet, the patients failed to show eye gaze-following effects and fixated less than controls on the eyes-two spontaneous behaviors typically seen in controls. Our model for unsupervised learning of gaze direction explains how head-based gaze following can develop under severe image blur, resembling preoperative conditions. It also suggests why, despite acquiring sufficient resolution to extract eye position, automatic eye gaze following is not established after surgery due to lack of detailed early visual experience. We suggest that visual skills acquired in infancy in an unsupervised manner will be difficult or impossible to acquire when internal guidance is no longer available, even when sufficient image resolution for the task is restored. This creates fundamental barriers to spontaneous vision recovery following prolonged deprivation in early age.

2021

142.

Oculo-retinal dynamics can explain the perception of minimal recognizable configurations

Gruber L. Z., Ullman S. & Ahissar E. (2021) Proceedings of the National Academy of Sciences of the United States of America. 118, 34, e202279211. Abstract

Natural vision is a dynamic and continuous process. Under natural conditions, visual object recognition typically involves continuous interactions between ocular motion and visual contrasts, resulting in dynamic retinal activations. In order to identify the dynamic variables that participate in this process and are relevant for image recognition, we used a set of images that are just above and below the human recognition threshold and whose recognition typically requires >2 s of viewing. We recorded eye movements of participants while attempting to recognize these images within trials lasting 3 s. We then assessed the activation dynamics of retinal ganglion cells resulting from ocular dynamics using a computational model. We found that while the saccadic rate was similar between recognized and unrecognized trials, the fixational ocular speed was significantly larger for unrecognized trials. Interestingly, however, retinal activation level was significantly lower during these unrecognized trials. We used retinal activation patterns and oculomotor parameters of each fixation to train a binary classifier, classifying recognized from unrecognized trials. Only retinal activation patterns could predict recognition, reaching 80% correct classifications on the fourth fixation (on average, ∼2.5 s from trial onset). We thus conclude that the information that is relevant for visual perception is embedded in the dynamic interactions between the oculomotor sequence and the image. Hence, our results suggest that ocular dynamics play an important role in recognition and that understanding the dynamics of retinal activation is crucial for understanding natural vision.

Accepted version

141.

View-tuned and view-invariant face encoding in IT cortex is explained by selected natural image fragments

Nam Y., Sato T., Uchida G., Malakhova E., Ullman S. & Tanifuji M. (2021) Scientific Reports. 11, 7827. Abstract

Humans recognize individual faces regardless of variation in the facial view. The view-tuned face neurons in the inferior temporal (IT) cortex are regarded as the neural substrate for view-invariant face recognition. This study approximated visual features encoded by these neurons as combinations of local orientations and colors, originated from natural image fragments. The resultant features reproduced the preference of these neurons to particular facial views. We also found that faces of one identity were separable from the faces of other identities in a space where each axis represented one of these features. These results suggested that view-invariant face representation was established by combining view sensitive visual features. The face representation with these features suggested that, with respect to view-invariant face representation, the seemingly complex and deeply layered ventral visual pathway can be approximated via a shallow network, comprised of layers of low-level processing for local orientations and colors (V1/V2-level) and the layers which detect particular sets of low-level elements derived from natural image fragments (IT-level).

140.

Detector-Free Weakly Supervised Grounding by Separation

Arbelle A., Doveh S., Alfassy A. et al. (2021) Proceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021. p. 1781-1792 Abstract

Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume an existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing 'text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Our GbS shows an 8.5% accuracy improvement over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a complementary improvement (above 7%) over the detector-based approaches for WSG.

[All authors]

2020

139.

Minimal videos: Trade-off between spatial and temporal information in human and machine vision

Ben-Yosef G., Kreiman G. & Ullman S. (2020) Cognition. 201, 104263. Abstract

Objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by identifying minimal videos: these are short and tiny video clips in which objects, parts, and actions can be reliably recognized, but any reduction in either space or time makes them unrecognizable. Human recognition in minimal videos is invariably accompanied by full interpretation of the internal components of the video. State-of-the-art deep convolutional networks for dynamic recognition cannot replicate human behavior in these configurations. The gap between human and machine vision demonstrated here is due to critical mechanisms for full spatiotemporal interpretation that are lacking in current computational models.

138.

Cakewalk sampling

Patish U. & Ullman S. (2020) AAAI 2020 - 34th AAAI Conference on Artificial Intelligence. Vol. 34. p. 2400-2407 Abstract

We study the task of finding good local optima in combinatorial optimization problems. Although combinatorial optimization is NP-hard in general, locally optimal solutions are frequently used in practice. Local search methods however typically converge to a limited set of optima that depend on their initialization. Sampling methods on the other hand can access any valid solution, and thus can be used either directly or alongside methods of the former type as a way for finding good local optima. Since the effectiveness of this strategy depends on the sampling distribution, we derive a robust learning algorithm that adapts sampling distributions towards good local optima of arbitrary objective functions. As a first use case, we empirically study the efficiency in which sampling methods can recover locally maximal cliques in undirected graphs. Not only do we show how our adaptive sampler outperforms related methods, we also show how it can even approach the performance of established clique algorithms. As a second use case, we consider how greedy algorithms can be combined with our adaptive sampler, and we demonstrate how this leads to superior performance in k-medoid clustering. Together, these findings suggest that our adaptive sampler can provide an effective strategy to combinatorial optimization problems that arise in practice.

137.

VQA with no questions-answers training

Vatashsky B. Z. & Ullman S. (2020) Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. p. 10373-10383 9157617. Abstract

Methods for teaching machines to answer visual questions have made significant progress in recent years, but current methods still lack important human capabilities, including integrating new visual classes and concepts in a modular manner, providing explanations for the answers and handling new domains without explicit examples. We propose a novel method that consists of two main parts: generating a question graph representation, and an answering procedure, guided by the abstract structure of the question graph to invoke an extendable set of visual estimators. Training is performed for the language part and the visual part on their own, but unlike existing schemes, the method does not require any training using images with associated questions and answers. This approach is able to handle novel domains (extended question types and new object classes, properties and relations) as long as corresponding visual estimators are available. In addition, it can provide explanations to its answers and suggest alternatives when questions are not grounded in the image. We demonstrate that this approach achieves both high performance and domain extensibility without any questions-answers training.

2019

136.

Minimal Recognizable Configurations Elicit Category-selective Responses in Higher Order Visual Cortex

Holzinger Y., Ullman S., Harari D., Behrmann M. & Avidan G. (2019) Journal of Cognitive Neuroscience. 31, 9, p. 1354-1367 Abstract

Visual object recognition is performed effortlessly by humans notwithstanding the fact that it requires a series of complex computations, which are, as yet, not well understood. Here, we tested a novel account of the representations used for visual recognition and their neural correlates using fMRI. The rationale is based on previous research showing that a set of representations, termed "minimal recognizable configurations" (MIRCs), which are computationally derived and have unique psychophysical characteristics, serve as the building blocks of object recognition. We contrasted the BOLD responses elicited by MIRC images, derived from different categories (faces, objects, and places), sub-MIRCs, which are visually similar to MIRCs, but, instead, result in poor recognition and scrambled, unrecognizable images. Stimuli were presented in blocks, and participants indicated yes/no recognition for each image. We confirmed that MIRCs elicited higher recognition performance compared to sub-MIRCs for all three categories. Whereas fMRI activation in early visual cortex for both MIRCs and sub-MIRCs of each category did not differ from that elicited by scrambled images, high-level visual regions exhibited overall greater activation for MIRCs compared to sub-MIRCs or scrambled images. Moreover, MIRCs and sub-MIRCs from each category elicited enhanced activation in corresponding category-selective regions including fusiform face area and occipital face area (faces), lateral occipital cortex (objects), and parahippocampal place area and transverse occipital sulcus (places). These findings reveal the psychological and neural relevance of MIRCs and enable us to make progress in developing a more complete account of object recognition.

135.

Labyrinth ice pattern formation induced by near-infrared irradiation

Preis S. G., Chayet H., Katz A., Yashunsky V., Kaner A., Ullman S. & Braslavsky I. (2019) Science Advances. 5, 3, 1598. Abstract

Patterns are broad phenomena that relate to biology, chemistry, and physics. The dendritic growth of crystals is the most well-known ice pattern formation process. Tyndall figures are water-melting patterns that occur when ice absorbs light and becomes superheated. Here, we report a previously undescribed ice and water pattern formation process induced by near-infrared irradiation that heats one phase more than the other in a two-phase system. The pattern formed during the irradiation of ice crystals tens of micrometers thick in solution near equilibrium. Dynamic holes and a microchannel labyrinth then formed in specific regions and were characterized by a typical distance between melted points. We concluded that the differential absorption of water and ice was the driving force for the pattern formation. Heating ice by laser absorption might be useful in applications such as the cryopreservation of biological samples.

134.

Using neuroscience to develop artificial intelligence

Ullman S. (2019) Science. 363, 6428, p. 692-693 Abstract

2018

133.

A model for discovering 'containment' relations

Ullman S., Dorfman N. & Harari D. (2018) Cognition. 183, p. 67-81 Abstract

Rapid developments in the fields of learning and object recognition have been obtained by successfully developing and using methods for learning from a large number of labeled image examples. However, such current methods cannot explain infants' learning of new concepts based on their visual experience, in particular, the ability to learn complex concepts without external guidance, as well as the natural order in which related concepts are acquired. A remarkable example of early visual learning is the category of 'containers' and the notion of 'containment'. Surprisingly, this is one of the earliest spatial relations to be learned, starting already around 3 month of age, and preceding other common relations (e.g., 'support', 'in-between'). In this work we present a model, which explains infants' capacity of learning 'containment' and related concepts by 'just looking', together with their empirical development trajectory. Learning occurs in the model fast and without external guidance, relying only on perceptual processes that are present in the first months of life. Instead of labeled training examples, the system provides its own internal supervision to guide the learning process. We show how the detection of so-called 'paradoxical occlusion' provides natural internal supervision, which guides the system to gradually acquire a range of useful containment-related concepts. Similar mechanisms of using implicit internal supervision can have broad application in other cognitive domains as well as artificial intelligent systems, because they alleviate the need for supplying extensive external supervision, and because they can guide the learning process to extract concepts that are meaningful to the observer, even if they are not by themselves obvious, or salient in the input.

132.

Searching for visual features that explain response variance of face neurons in inferior temporal cortex

Owaki T., Vidal-Naquet M., Nam Y., Uchida G., Sato T., Cateau H., Ullman S. & Tanifuji M. (2018) PLoS ONE. 13, 9, 0201192. Abstract

Despite a large body of research on response properties of neurons in the inferior temporal (IT) cortex, studies to date have not yet produced quantitative feature descriptions that can predict responses to arbitrary objects. This deficit in the research prevents a thorough understanding of object representation in the IT cortex. Here we propose a fragment-based approach for finding quantitative feature descriptions of face neurons in the IT cortex. The development of the proposed method was driven by the assumption that it is possible to recover features from a set of natural image fragments if the set is sufficiently large. To find the feature from the set, we compared object responses predicted from each fragment and responses of neurons to these objects, and search for the fragment that revealed the highest correlation with neural object responses. Prediction of object responses of each fragment was made by normalizing Euclidian distance between the fragment and each object to 0 to 1 such that the smaller distance gives the higher value. The distance was calculated at the space where images were transformed to a local orientation space by a Gabor filter and a local max operation. The method allowed us to find features with a correlation coefficient between predicted and neural responses of 0.68 on average (number of object stimuli, 104) from among 560,000 feature candidates, reliably explaining differential responses among faces as well as a general preference for faces over to non-face objects. Furthermore, predicted responses of the resulting features to novel object images were significantly correlated with neural responses to these images. Identification of features comprising specific, moderately complex combinations of local orientations and colors enabled us to predict responses to upright and inverted faces, which provided a possible mechanism of face inversion effects. (292/300).

131.

Image interpretation above and below the object level

Ben-Yosef G. & Ullman S. (2018) Interface Focus. 8, 4, 20180020. Abstract

Computational models of vision have advanced in recent years at a rapid rate, rivalling in some areas human-level performance. Much of the progress to date has focused on analysing the visual scene at the object level-the recognition and localization of objects in the scene. Human understanding of images reaches a richer and deeper image understanding both 'below' the object level, such as identifying and localizing object parts and sub-parts, as well as 'above' the object level, such as identifying object relations, and agents with their actions and interactions. In both cases, understanding depends on recovering meaningful structures in the image, and their components, properties and inter-relations, a process referred here as 'image interpretation'. In this paper, we describe recent directions, based on human and computer vision studies, towards human-like image interpretation, beyond the reach of current schemes, both below the object level, as well as some aspects of image interpretation at the level of meaningful configurations beyond the recognition of individual objects, and in particular, interactions between two people in close contact. In both cases the recognition process depends on the detailed interpretation of so-called 'minimal images', and at both levels recognition depends on combining 'bottom-up' processing, proceeding from low to higher levels of a processing hierarchy, together with 'top-down' processing, proceeding from high to lower levels stages of visual analysis.

130.

Full interpretation of minimal images

Ben-Yosef G., Assif L. & Ullman S. (2018) Cognition. 171, p. 65-84 Abstract

The goal in this work is to model the process of 'full interpretation' of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low.We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of 'minimal configurations': these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition.

129.

Action Classification via Concepts and Attributes

Rosenfeld A. & Ullman S. (2018) 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR). p. 1499-1505 (trueInternational Conference on Pattern Recognition). Abstract

Classes in natural images tend to follow long tail distributions. This is problematic when there are insufficient training examples for rare classes. This effect is emphasized in compound classes, involving the conjunction of several concepts, such as those appearing in action-recognition datasets. In this paper, we propose to address this issue by learning how to utilize common visual concepts which are readily available. We detect the presence of prominent concepts in images and use them to infer the target labels instead of using visual features directly, combining tools from vision and natural-language processing. We validate our method on the recently introduced HICO dataset reaching a mAP of 31.54% and on the Stanford-40 Actions dataset, where the proposed method outperforms that obtained by direct visual features, obtaining an accuracy 83.12%. Moreover, the method provides for each class a semantically meaningful list of keywords and relevant image regions relating it to its constituent concepts.

2017

128.

Visual Concept Recognition and Localization via Iterative Introspection

Rosenfeld A. & Ullman S. (2017) COMPUTER VISION - ACCV 2016, PT V. Lepetit, Nishino K., Lai SH. & Sato Y.(eds.). p. 264-279 (trueLecture Notes in Computer Science). Abstract

Convolutional neural networks have been shown to develop internal representations, which correspond closely to semantically meaningful objects and parts, although trained solely on class labels. Class Activation Mapping (CAM) is a recent method that makes it possible to easily highlight the image regions contributing to a network's classification decision. We build upon these two developments to enable a network to re-examine informative image regions, which we term introspection. We propose a weakly-supervised iterative scheme, which shifts its center of attention to increasingly discriminative regions as it progresses, by alternating stages of classification and introspection. We evaluate our method and show its effectiveness over a range of several datasets, where we obtain competitive or state-of-the-art results: on Stanford-40 Actions, we set a new state-of the art of 81.74%. On FGVC-Aircraft and the Stanford Dogs dataset, we show consistent improvements over baselines, some of which include significantly more supervision.

127.

A model for interpreting social interactions in local image regions

Ben-Yosef G., Yachin A. & Ullman S. (2017) AAAI Spring Symposium - Technical Report. p. 525-528 Abstract

Understanding social interactions (such as 'hug' or 'fight') is a basic and important capacity of the human visual system, but a challenging and still open problem for modeling. In this work we study visual recognition of social interactions, based on small but recognizable local regions. The approach is based on two novel key components: (i) A given social interaction can be recognized reliably from reduced images (called 'minimal images'), (ii) The recognition of a social interaction depends on identifying components and relations within the minimal image (termed 'interpretation'). We show psychophysics data for minimal images and modeling results for their interpretation. We discuss the integration of minimal configurations in recognizing social interactions in a detailed, high-resolution image.

2016

126.

Hand-object interaction and precise localization in transitive action recognition

Rosenfeld A. & Ullman S. (2016) Proceedings - 2016 13th Conference on Computer and Robot Vision, CRV 2016. Guerrero J.(eds.). p. 148-155 Abstract

Action recognition in still images has seen major improvement in recent years due to advances in human pose estimation, object recognition and stronger feature representations produced by deep neural networks. However, there are still many cases in which performance remains far from that of humans. A major difficulty arises in distinguishing between transitive actions in which the overall actor pose is similar, and recognition therefore depends on details of the grasp and the object, which may be largely occluded. In this paper we demonstrate how recognition is improved by obtaining precise localization of the action-object and consequently extracting details of the object shape together with the actor-object interaction. To obtain exact localization of the action object and its interaction with the actor, we employ a coarse-to-fine approach which combines semantic segmentation and contextual features, in successive stages. We focus on (but are not limited) to face-related actions, a set of actions that includes several currently challenging categories. We present an average relative improvement of 35% over state-of-the art and validate through experimentation the effectiveness of our approach.

125.

Human pose estimation using deep consensus voting

Lifshitz I., Fetaya E. & Ullman S. (2016) Computer Vision - 14th European Conference, ECCV 2016, Proceedings. Leibe B., Matas J., Welling M. & Sebe N.(eds.). p. 246-260 Abstract

In this paper we consider the problem of human pose estimation from a single still image. We propose a novel approach where each location in the image votes for the position of each keypoint using a convolutional neural net. The voting scheme allows us to utilize information from the whole image, rather than rely on a sparse set of keypoint locations. Using dense, multi-target votes, not only produces good keypoint predictions, but also enables us to compute image-dependent joint keypoint probabilities by looking at consensus voting. This differs from most previous methods where joint probabilities are learned from relative keypoint locations and are independent of the image. We finally combine the keypoints votes and joint probabilities in order to identify the optimal pose configuration. We show our competitive performance on the MPII Human Pose and Leeds Sports Pose datasets.

124.

Atoms of recognition in human and computer vision

Ullman S., Assif L., Fetaya E. & Harari D. (2016) Proceedings of the National Academy of Sciences - PNAS. 113, 10, p. 2744-2749 Abstract

Discovering the visual features and representations used by the brain to recognize objects is a central problem in the study of vision. Recently, neural network models of visual object recognition, including biological and deep network models, have shown remarkable progress and have begun to rival human performance in some challenging tasks. These models are trained on image examples and learn to extract features and representations and to use them for categorization. It remains unclear, however, whether the representations and learning processes discovered by current models are similar to those used by the human visual system. Here we show, by introducing and using minimal recognizable images, that the human visual system uses features and processes that are not used by current models and that are critical for recognition. We found by psychophysical studies that at the level of minimal recognizable images a minute change in the image can have a drastic effect on recognition, thus identifying features that are critical for the task. Simulations then showed that current models cannot explain this sensitivity to precise feature configurations and, more generally, do not learn to recognize minimal images at a human level. The role of the features shown here is revealed uniquely at theminimal level, where the contribution of each feature is essential. A full understanding of the learning and use of such features will extend our understanding of visual recognition and its cortical mechanisms and will enhance the capacity of computational models to learn from visual experience and to deal with recognition and detailed image interpretation.

2015

123.

Do you see what I mean? Visual resolution of linguistic ambiguities

Berzak Y., Barbu A., Harari D., Katz B. & Ullman S. (2015) Conference Proceedings - EMNLP 2015. p. 1477-1487 Abstract

Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types.

122.

Visual categorization of social interactions

De La Rosa L. R. S., Choudhery R. N., Curio C., Ullman S., Assif L. & Buelthoff H. H. (2015) Visual Cognition. 22, p. 1233-1271 Abstract

Prominent theories of action recognition suggest that during the recognition of actions the physical patterns of the action is associated with only one action interpretation (e.g., a person waving his arm is recognized as waving). In contrast to this view, studies examining the visual categorization of objects show that objects are recognized in multiple ways (e.g., a VW Beetle can be recognized as a car or a beetle) and that categorization performance is based on the visual and motor movement similarity between objects. Here, we studied whether we find evidence for multiple levels of categorization for social interactions (physical interactions with another person, e.g., handshakes). To do so, we compared visual categorization of objects and social interactions (Experiments 1 and 2) in a grouping task and assessed the usefulness of motor and visual cues (Experiments 3, 4, and 5) for object and social interaction categorization. Additionally, we measured recognition performance associated with recognizing objects and social interactions at different categorization levels (Experiment 6). We found that basic level object categories were associated with a clear recognition advantage compared to subordinate recognition but basic level social interaction categories provided only a little recognition advantage. Moreover, basic level object categories were more strongly associated with similar visual and motor cues than basic level social interaction categories. The results suggest that cognitive categories underlying the recognition of objects and social interactions are associated with different performances. These results are in line with the idea that the same action can be associated with several action interpretations (e.g., a person waving his arm can be recognized as waving or greeting).

121.

A model for full local image interpretation

Ben-Yosef G., Assif L., Harari D. & Ullman S. (2015) Proceedings of the 37th Annual Meeting of the Cognitive Science Society, CogSci 2015. Warlaumont A., Jennings C. D., Noelle D. C., Yoshimi J., Maglio P. P., Dale R. & Matlock T.(eds.). p. 220-225 Abstract

We describe a computational model of humans' ability to provide a detailed interpretation of a scene's components. Humans can identify in an image meaningful components almost everywhere, and identifying these components is an essential part of the visual process, and of understanding the surrounding scene and its potential meaning to the viewer. Detailed interpretation is beyond the scope of current models of visual recognition. Our model suggests that this is a fundamental limitation, related to the fact that existing models rely on feed-forward but limited top-down processing. In our model, a first recognition stage leads to the initial activation of class candidates, which is incomplete and with limited accuracy. This stage then triggers the application of class-specific interpretation and validation processes, which recover richer and more accurate interpretation of the visible scene. We discuss implications of the model for visual interpretation by humans and by computer vision models.

120.

Learning local invariant mahalanobis distances

Fetaya E. & Ullman S. (2015) 32nd International Conference on Machine Learning, ICML 2015. Blei D. & Bach F.(eds.). p. 162-168 Abstract

For many tasks and data types, there are natural transformations to which the data should be invariant or insensitive. For instance, in visual recognition, natural images should be insensitive to rotation and translation. This requirement and its implications have been important in many machine learning applications, and tolerance for image transformations was primarily achieved by using robust feature vectors. In this paper we propose a novel and computationally efficient way to learn a local Mahalanobis metric per datum, and show how we can learn a local invariant metric to any transformation in order to improve performance.

119.

Graph approximation and clustering on a budget

Fetaya E., Shamir O. & Ullman S. (2015) Journal of Machine Learning Research. 38, p. 241-249 Abstract

We consider the problem of learning from a similarity matrix (such as spectral clustering and low-dimensional embedding), when computing pairwise similarities are costly, and only a limited number of entries can be observed. We provide a theoretical analysis using standard notions of graph approximation, significantly generalizing previous results, which focused on spectral clustering with two clusters. We also propose a new algorithmic approach based on adaptive sampling, which experimentally matches or improves on previous methods, while being considerably more general and computationally cheaper.

118.

A Model for Full Local Image Interpretation

Ben-Yosef, G., Assif, L., Harari, D., Ullman, S (2015) The Annual Conference of the Cognitive Science Society - CogSci.

117.

Do You See What I Mean? Visual Resolution of Linguistic Proc. Empirical Methods on Natural Language Processing

Berzak, Y., Barbu et al. (2015)

Link to paper

[All authors]

2014

116.

Anatomy of hierarchy: Feedforward and feedback pathways in macaque visual cortex

Markov N. T., Vezoli J., Chameau P. et al. (2014) Journal of Comparative Neurology. 522, 1, p. 225-259 Abstract

The laminar location of the cell bodies and terminals of interareal connections determines the hierarchical structural organization of the cortex and has been intensively studied. However, we still have only a rudimentary understanding of the connectional principles of feedforward (FF) and feedback (FB) pathways. Quantitative analysis of retrograde tracers was used to extend the notion that the laminar distribution of neurons interconnecting visual areas provides an index of hierarchical distance (percentage of supragranular labeled neurons [SLN]). We show that: 1) SLN values constrain models of cortical hierarchy, revealing previously unsuspected areal relations; 2) SLN reflects the operation of a combinatorial distance rule acting differentially on sets of connections between areas; 3) Supragranular layers contain highly segregated bottom-up and top-down streams, both of which exhibit point-to-point connectivity. This contrasts with the infragranular layers, which contain diffuse bottom-up and top-down streams; 4) Cell filling of the parent neurons of FF and FB pathways provides further evidence of compartmentalization; 5) FF pathways have higher weights, cross fewer hierarchical levels, and are less numerous than FB pathways. Taken together, the present results suggest that cortical hierarchies are built from supra- and infragranular counterstreams. This compartmentalized dual counterstream organization allows point-to-point connectivity in both bottom-up and top-down directions. J. Comp. Neurol. 522:225-259, 2014.

[All authors]

2013

115.

Vision: Are models of object recognition catching up with the brain?

Poggio T. & Ullman S. (2013) Conference Reports: Evolutionary Dynamics And Information Hierarchies In Biological Systems: Aspen Center For Physics Workshop And Cracking The Neural Code: Third Annual Aspen Brain Forums. 1305, 1, p. 72-82 Abstract

Object recognition has been a central yet elusive goal of computational vision. For many years, computer performance seemed highly deficient and unable to emulate the basic capabilities of the human recognition system. Over the past decade or so, computer scientists and neuroscientists have developed algorithms and systems-and models of visual cortex-that have come much closer to human performance in visual identification and categorization. In this personal perspective, we discuss the ongoing struggle of visual models to catch up with the visual cortex, identify key reasons for the relatively rapid improvement of artificial systems and models, and identify open problems for computational vision in this domain.

114.

Extending recognition in a changing environment

Harari D. & Ullman S. (2013) VISAPP 2013 - Proceedings of the International Conference on Computer Vision Theory and Applications. p. 632-640 Abstract

We consider the task of visual recognition of objects and their parts in a dynamic environment, where the appearances, as well as the relative positions between parts, change over time. We start with a model of an object class learned from a limited set of view directions (such as side views of cars or airplanes). The algorithm is then given a video input which contains the object moving and changing its viewing direction. Our aim is to reliably detect the object as it changes beyond its known views, and use the dynamically changing views to extend the initial object model. To achieve this goal, we construct an object model at each time instant by combining two sources: consistency with the measured optical flow, together with similarity to the object model at an earlier time. We introduce a simple new way of updating the object model dynamically by combining approximate nearest neighbors search with kernel density estimation. Unlike tracking-by-detection methods that focus on tracking a specific object over time, we demonstrate how the proposed method can be used for learning, by extending the initial generic object model to cope with novel viewing directions, without further supervision. The results show that the adaptive combination of the initial model with even a single video sequence already provides useful generalization of the class model to novel views.

113.

Learning to Perceive Coherent Objects

Dorfman N., Harari D. & Ullman S. (2013) Cooperative Minds. Wachsmuth I., Knauff M., Pauen M. & Sebanz N.(eds.). p. 394-399 Abstract

Object segregation in a visual scene is a complex perceptual process that relies on the integration of multiple cues. The task is computationally challenging, and even the best performing models fall significantly short of human performance. Infants initially have a surprisingly impoverished set of segregation cues and their ability to perform object segregation in static images is severely limited. Major questions that arise are therefore how the rich set of useful cues is learned, and what initial capacities make this learning possible. Here we present a computational model that initially incorporates only two basic capacities known to exist at an early age: the grouping of image regions by common motion and the detection of motion discontinuities. The model then learns significant aspects of object segregation in static images in an entirely unsupervised manner by observing videos of objects in motion. Implications of the model to infant learning and to the future development of object segregation models are discussed.

112.

Learning to Perceive Coherent Objects

Dorfman, N., Harari, D. And Ullman, S. (2013) Proceedings of the Annual Meeting of the Cognitive Science Society - CogSci, pp 394-399.

[PDF] [Slides] [Project page]

_{*Winner of the 2013 Marr Prize}

2012

111.

From simple innate biases to complex visual concepts

Ullman S., Harari D. & Dorfman N. (2012) Proceedings of the National Academy of Sciences of the United States of America. 109, 44, p. 18215-18220 Abstract

Early in development, infants learn to solve visual problems that are highly challenging for current computational methods. We present a model that deals with two fundamental problems in which the gap between computational difficulty and infant learning is particularly striking: learning to recognize hands and learning to recognize gaze direction. The model is shown a stream of natural videos and learns without any supervision to detect human hands by appearance and by context, as well as direction of gaze, in complex natural scenes. The algorithm is guided by an empirically motivated innate mechanism - the detection of "mover" events in dynamic images, which are the events of a moving image region causing a stationary region to move or change after contact. Mover events provide an internal teaching signal, which is shown to be more effective than alternative cues and sufficient for the efficient acquisition of hand and gaze representations. The implications go beyond the specific tasks, by showing how domain-specific "proto concepts" can guide the system to acquire meaningful concepts, which are significant to the observer but statistically inconspicuous in the sensory input.

110.

Using linking features in learning non-parametric part models

Karlinsky L. & Ullman S. (2012) Computer Vision, ECCV 2012 - 12th European Conference on Computer Vision, Proceedings. PART 3 ed. p. 326-339 Abstract

We present an approach to the detection of parts of highly deformable objects, such as the human body. Instead of using kinematic constraints on relative angles used by most existing approaches for modeling part-to-part relations, we learn and use special observed 'linking' features that support particular pairwise part configurations. In addition to modeling the appearance of individual parts, the current approach adds modeling of the appearance of part-linking, which is shown to provide useful information. For example, configurations of the lower and upper arms are supported by observing corresponding appearances of the elbow or other relevant features. The proposed model combines the support from all the linking features observed in a test image to infer the most likely joint configuration of all the parts of interest. The approach is trained using images with annotated parts, but no a-priori known part connections or connection parameters are assumed, and the linking features are discovered automatically during training. We evaluate the performance of the proposed approach on two challenging human body parts detection datasets, and obtain performance comparable, and in some cases superior, to the state-of-the-art. In addition, the approach generality is shown by applying it without modification to part detection on datasets of animal parts and of facial fiducial points.

2011

109.

A computational model reveals that the face domain of macaque inferotemporal cortex (IT) represents multiple visual features

Owaki T., Vidal-Naquet M., Sato T., Cateau H., Ullman S. & Tanifuji M. (2011) Neuroscience Research. 71, p. E71-E71 Abstract

108.

Basic-level categorization of intermediate complexity fragments reveals top-down effects of expertise in visual perception

Harel A., Ullman S., Harari D. & Bentin S. (2011) Journal of Vision. 11, 8, p. 1-13 Abstract

Visual expertise is usually defined as the superior ability to distinguish between exemplars of a homogeneous category. Here, we ask how real-world expertise manifests at basic-level categorization and assess the contribution of stimulus-driven and top-down knowledge-based factors to this manifestation. Car experts and novices categorized computer-selected image fragments of cars, airplanes, and faces. Within each category, the fragments varied in their mutual information (MI), an objective quantifiable measure of feature diagnosticity. Categorization of face and airplane fragments was similar within and between groups, showing better performance with increasing MI levels. Novices categorized car fragments more slowly than face and airplane fragments, while experts categorized car fragments as fast as face and airplane fragments. The experts' advantage with car fragments was similar across MI levels, with similar functions relating RT with MI level for both groups. Accuracy was equal between groups for cars as well as faces and airplanes, but experts' response criteria were biased toward cars. These findings suggest that expertise does not entail only specific perceptual strategies. Rather, at the basic level, expertise manifests as a general processing advantage arguably involving application of top-down mechanisms, such as knowledge and attention, which helps experts to distinguish between object categories.

2010

107.

From classification to full object interpretation

Ullman S. (2010) Object Categorization. Leonardis A., Schiele B., J. Tarr M. & J. Dickinson S.(eds.). Vol. 9780521887380. p. 288-300 Abstract

Introduction In current classification schemes, the goal is usually to identify category instances in an image, together with their corresponding image locations. However, object recognition goes beyond top-level category labeling: when we see a known object, we not only recognize the complete object, but also identify and localize its parts and subparts at multiple levels. Identifying and localizing parts, called \u201cobject interpretation,\u201d is often necessary for interacting with visible objects in the surrounding environment. In this chapter I will describe a method for obtaining detailed interpretation of the entire object, by identifying and localizing parts at multiple levels. The approach has two main components. The first is the creation of a hierarchical feature representation that is constructed from informative parts and subparts, that are identified during a learning stage. The second is the detection and localization of objects and parts using a two-pass algorithm that is applied to the feature hierarchy. The resulting scheme has two main advantages. First, the overall recognition performance is improved compared with similar nonhierarchical schemes. Second, and more important, the scheme obtains reliable detection and localization of object parts even when the parts are locally ambiguous and cannot be recognized reliably on their own. The second part of the chapter will discuss a possible future direction for improving the performance obtained by current classification methods, by the use of a continuous online model update, in an attempt to narrow the so-called performance gap between computational methods and human performance.

106.

Learning to classify by ongoing feature selection

Levi D. & Ullman S. (2010) Image and Vision Computing. 28, 4, p. 715-723 Abstract

Existing classification algorithms use a set of training examples to select classification features, which are then used for all future applications of the classifier. A major problem with this approach is the selection of a training set: a small set will result in reduced performance, and a large set will require extensive training. In addition, class appearance may change over time requiring an adaptive classification system. In this paper, we propose a solution to these basic problems by developing an on-line feature selection method, which continuously modifies and improves the features used for classification based on the examples provided so far. The method is used for learning a new class, and to continuously improve classification performance as new data becomes available. In ongoing learning, examples are continuously presented to the system, and new features arise from these examples. The method continuously measures the value of the selected features using mutual information, and uses these values to efficiently update the set of selected features when new training information becomes available. The problem is challenging because at each stage the training process uses a small subset of the training data. Surprisingly, with sufficient training data the on-line process reaches the same performance as a scheme that has a complete access to the entire training data.

105.

Using body-anchored priors for identifying actions in single images

Karlinsky L., Dinerstein M. & Ullman S. (2010) Advances in Neural Information Processing Systems 23. Abstract

This paper presents an approach to the visual recognition of human actions using only single images as input. The task is easy for humans but difficult for current approaches to object recognition, because instances of different actions may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized. The proposed approach applies a two-stage interpretation procedure to each training and test image. The first stage produces accurate detection of the relevant body parts of the actor, forming a prior for the local evidence needed to be considered for identifying the action. The second stage extracts features that are anchored to the detected body parts, and uses these features and their feature-to-part relations in order to recognize the action. The body anchored priors we propose apply to a large range of human actions. These priors allow focusing on the relevant regions and relations, thereby significantly simplifying the learning process and increasing recognition performance.

104.

The chains model for detecting parts by their context

Karlinsky L., Dinerstein M., Harari D. & Ullman S. (2010) 2010 Ieee Conference On Computer Vision And Pattern Recognition (Cvpr). p. 25-32 Abstract

Detecting an object part relies on two sources of information - the appearance of the part itself, and the context supplied by surrounding parts. In this paper we consider problems in which a target part cannot be recognized reliably using its own appearance, such as detecting low-resolution hands, and must be recognized using the context of surrounding parts. We develop the 'chains model' which can locate parts of interest in a robust and precise manner, even when the surrounding context is highly variable and deformable. In the proposed model, the relation between context features and the target part is modeled in a non-parametric manner using an ensemble of feature chains leading from parts in the context to the detection target. The method uses the configuration of the features in the image directly rather than through fitting an articulated 3-D model of the object. In addition, the chains are composable, meaning that new chains observed in the test image can be composed of sub-chains seen during training. Consequently, the model is capable of handling object poses which are infrequent, even non-existent, during training. We test the approach in different settings, including object parts detection, as well as complete object detection. The results show the advantages of the chains model for detecting and localizing parts of complex deformable objects.

103.

Time resolved extraction of receptive fields

Vidal-Naquet M. J., Tanifuji M., Maldonado P., Ullman S. & Gruen S. (2010) Neuroscience Research. 68, p. E380-E380 Abstract

102.

Using body-anchored priors for identifying actions in single images

Karlinsky, L. Dinershtein, D. Ullman, S. (2010) Neural Information Processing, 1-9, 2010..

Link to paper

2009

101.

Cortical circuitry implementing graphical models

Litvak S. & Ullman S. (2009) Neural Computation. 21, 11, p. 3010-3056 Abstract

In this letter, we develop and simulate a large-scale network of spiking neurons that approximates the inference computations performed by graphical models. Unlike previous related schemes, which used sum and product operations in either the log or linear domains, the current model uses an inference scheme based on the sum and maximization operations in the log domain. Simulations show that using these operations, a large-scale circuit, which combines populations of spiking neurons as basic building blocks, is capable of finding close approximations to the full mathematical computations performed by graphical models within a few hundred milliseconds. The circuit is general in the sense that it can be wired for any graph structure, it supports multistate variables, and it uses standard leaky integrate-and-fire neuronal units. Following previous work, which proposed relations between graphical models and the large-scale cortical anatomy, we focus on the cortical microcircuitry and propose how anatomical and physiological aspects of the local circuitry may map onto elements of the graphical model implementation. We discuss in particular the roles of three major types of inhibitory neurons (small fast-spiking basket cells, large layer 2/3 basket cells, and double bouquet neurons), subpopulations of strongly interconnected neurons with their unique connectivity patterns in different cortical layers, and the possible role of minicolumns in the realization of the population based maximum operation.

100.

A hierarchical non-parametric method for capturing non-rigid deformations

Ecker A. & Ullman S. (2009) Image and Vision Computing. 27, 1-2, p. 87-98 Abstract

We present a novel approach for measuring image similarity based on the composition of parts. The measure identifies common sub-regions between the images at multiple sizes, and evaluates the amount of deformation required to align the common regions. The scheme allows complex, non-rigid deformation of the images, and penalizes irregular deformations more than coherent shifts of larger sub-parts. The measure is implemented by an algorithm which is a variant of dynamic programming, extended to multi-dimensions, and is using scores measured on a relative scale. The similarity measure is shown to be robust to non-rigid deformations of parts at various positions and scales, and to capture basic characteristics of human similarity judgments.

99.

Unsupervised Feature Optimization (UFO): simultaneous selection of multiple features with their detection parameters

Karlinsky L., Dinerstein M. & Ullman S. (2009) Cvpr: 2009 Ieee Conference On Computer Vision And Pattern Recognition, Vols 1-4. p. 1263-1270 Abstract

Class learning, both supervised and unsupervised, requires feature selection, which includes two main components. The first is the selection of a discriminative subset of features from a larger pool. The second is the selection of detection parameters for each feature to optimize classification performance. In this paper we present a method for the discovery of multiple classification features, their detection parameters and their consistent configurations, in the Ally unsupervised setting. This is achieved by a global optimization of joint consistency between the features as a function of the detection parameters, without assuming any prior parametric model. We demonstrate how the proposed framework can be applied for learning different types of feature parameters, such as detection thresholds and geometric relations, resulting in the unsupervised discovery of informative configurations of objects parts. We test our approach on a wide range of classes and show good results. We also demonstrate how the approach can be used to unsupervisedly separate and learn visually similar subclasses of a single category such as facial views or hand poses. We use the approach to compare various criteria for feature consistency, including Mutual Information, Suspicious Coincidence, L2 and Jaccard index. Finally, we compare our approach to a parametric consistency optimization technique such as pLSA and show significantly better performance.

98.

Learning model complexity in an online environment

Levi D. & Ullman S. (2009) Proceedings of the 2009 Canadian Conference on Computer and Robot Vision, CRV 2009. p. 260-267 Abstract

In this paper we introduce the concept and method for adaptively tuning the model complexity in an online manner as more examples become available. Challenging classification problems in the visual domain (such as recognizing handwriting, faces and human-body images) often require a large number of training examples, which may become available over a long training period. This motivates the development of scalable and adaptive systems which are able to continue learning at any stage and which can efficiently learn from large amounts of data, in an on-line manner. Previous approaches to on-line learning in visual classification have used a fixed parametric model, and focused on continuously improving the model parameters as more data becomes available. Here we propose a new framework which enables online learning algorithms to adjust the complexity of the learned model to the amount of the training data as more examples become available. Since in online learning the training set expands over time, it is natural to allow the learned model to become more complex during the course of learning instead of confining the model to a fixed family of a bounded complexity. Formally, we use a set of parametric classifiers y = ha? (x) where y is the class and x the observed data. The parameter a controls the complexity of the model family. For a fixed a, the training examples are used for the optimal setting of ?. When the amount of data becomes sufficiently large, the value of a is increased, and a more complex model family is used. For evaluation of the proposed approach, we implement an online Support Vector Machine with increasing complexity, and evaluate in a task of handwritten character recognition on the MNIST database.

2008

97.

Image interpretation by a single bottom-up top-down cycle

Epshtein B., Lifshitz I. & Ullman S. (2008) Proceedings of the National Academy of Sciences of the United States of America. 105, 38, p. 14298-14303 Abstract

The human visual system recognizes objects and their constituent parts rapidly and with high accuracy. Standard models of recognition by the visual cortex use feed-forward processing, in which an object's parts are detected before the complete object. However, parts are often ambiguous on their own and require the prior detection and localization of the entire object. We show how a cortical-like hierarchy obtains recognition and localization of objects and parts at multiple levels nearly simultaneously by a single feed-forward sweep from low to high levels of the hierarchy, followed by a feedback sweep from high- to low-level areas.

96.

Distinctive and compact features

Akselrod-Ballin A. & Ullman S. (2008) Image and Vision Computing. 26, 9, p. 1269-1276 Abstract

We consider the problem of extracting features for multi-class recognition problems. The features are required to make fine distinctions between similar classes, combined with tolerance for distortions and missing information. We define and compare two general approaches, both based on maximizing the delivered information for recognition: one divides the problem into multiple binary classification tasks, while the other uses a single multi-class scheme. The two strategies result in markedly different sets of features, which we apply to face identification and detection. We show that the first produces a sparse set of distinctive features that are specific to an individual face, and are highly tolerant to distortions and missing input. The second produces compact features, each shared by about half of the faces, which perform better in general face detection. The results show the advantage of distinctive features for making fine distinctions in a robust manner. They also show that different features are optimal for recognition tasks at different levels of specificity.

95.

Neuronal correlates of "free will" are associated with regional specialization in the human intrinsic/default network

Goldberg I., Ullman S. & Malach R. (2008) Consciousness and Cognition. 17, 3, p. 587-601 Abstract

Recently, we proposed a fundamental subdivision of the human cortex into two complementary networks-an "extrinsic" one which deals with the external environment, and an "intrinsic" one which largely overlaps with the "default mode" system, and deals with internally oriented and endogenous mental processes. Here we tested this hypothesis by contrasting decision making under external and internally-derived conditions. Subjects were presented with an external cue, and were required to either follow an external instruction ("determined" condition) or to ignore it and follow a voluntary decision process ("free-will" condition). Our results show that a well defined component of the intrinsic system-the right inferior parietal cortex-was preferentially activated during the "free-will" condition. Importantly, this activity was significantly higher than the base-line resting state. The results support a self-related role for the intrinsic system and provide clear evidence for both hemispheric and regional specialization in the human intrinsic system.

94.

Class-based feature matching across unrestricted transformations

Bart E. & Ullman S. (2008) IEEE Transactions on Pattern Analysis and Machine Intelligence. 30, 9, p. 1618-1631 Abstract

We develop a novel method for class-based feature matching across large changes in viewing conditions. The method is based on the property that when objects share a similar part, the similarity is preserved across viewing conditions. Given a feature and a training set of object images, we first identify the subset of objects that share this feature. The transformation of the feature's appearance across viewing conditions is determined mainly by properties of the feature, rather than of the object in which it is embedded. Therefore, the transformed feature will be shared by approximately the same set of objects. Based on this consistency requirement, corresponding features can be reliably identified from a set of candidate matches. Unlike previous approaches, the proposed scheme compares feature appearances only in similar viewing conditions, rather than across different viewing conditions. As a result, the scheme is not restricted to locally planar objects or affine transformations. The approach also does not require examples of correct matches. We show that by using the proposed method, a dense set of accurate correspondences can be obtained. Experimental comparisons demonstrate that matching accuracy is significantly improved over previous schemes. Finally, we show that the scheme can be successfully used for invariant object recognition.

93.

Class information predicts activation by object fragments in human object areas

Lerner Y., Epshtein B., Ullman S. & Malach R. (2008) Journal of Cognitive Neuroscience. 20, 7, p. 1189-1206 Abstract

Object-related areas in the ventral visual system in humans are known from imaging studies to be preferentially activated by object images compared with noise or texture patterns. It is unknown, however, which features of the object images are extracted and represented in these areas. Here we tested the extent to which the representation of visual classes used object fragments selected by maximizing the information delivered about the class. We tested functional magnetic resonance imaging blood oxygenation level-dependent activation of highly informative object features in low- and high-level visual areas, compared with noninformative object features matched for low-level image properties. Activation in V1 was similar, but in the lateral occipital area and in the posterior fusiform gyrus, activation by "informative" fragments was significantly higher for three object classes. Behavioral studies also revealed high correlation between performance and fragments information. The results show that an objective class-information measure can predict classification performance and activation in human object-related areas.

92.

From aardvark to zorro: A benchmark for mammal image classification

Fink M. & Ullman S. (2008) International Journal of Computer Vision. 77, 1-3, p. 143-156 Abstract

Current object recognition systems aim at recognizing numerous object classes under limited supervision conditions. This paper provides a benchmark for evaluating progress on this fundamental task. Several methods have recently proposed to utilize the commonalities between object classes in order to improve generalization accuracy. Such methods can be termed interclass transfer techniques. However, it is currently difficult to asses which of the proposed methods maximally utilizes the shared structure of related classes. In order to facilitate the development, as well as the assessment of methods for dealing with multiple related classes, a new dataset including images of several hundred mammal classes, is provided, together with preliminary results of its use. The images in this dataset are organized into five levels of variability, and their labels include information on the objects' identity, location and pose. From this dataset, a classification benchmark has been derived, requiring fine distinctions between 72 mammal classes. It is then demonstrated that a recognition method which is highly successful on the Caltech101, attains limited accuracy on the current benchmark (36.5%). Since this method does not utilize the shared structure between classes, the question remains as to whether interclass transfer methods can increase the accuracy to the level of human performance (90%). We suggest that a labeled benchmark of the type provided, containing a large number of related classes is crucial for the development and evaluation of classification methods which make efficient use of interclass transfer.

91.

A computational model of perceptual fill-in following retinal degeneration

Mcmanus J. N., Ullman S. & Gilbert C. D. (2008) Journal of Neurophysiology. 99, 5, p. 2086-2100 Abstract

The ablation of afferent input results in the reorganization of sensory and motor cortices. In the primary visual cortex (V1), binocular retinal lesions deprive a corresponding cortical region [lesion projection zone (LPZ)] of visual input. Nevertheless, neurons in the LPZ regain responsiveness by shifting their receptive fields (RFs) outside the retinal lesions; this re-emergence of neural activity is paralleled by the perceptual completion of disrupted visual input in human subjects with retinal damage. To determine whether V1 reorganization can account for perceptual fill-in, we developed a neural network model that simulates the cortical remapping in V1. The model shows that RF shifts mediated by the plexus of spatial- and orientation-dependent horizontal connections in V1 can engender filling-in that is both robust and consistent with psychophysical reports of perceptual completion. Our model suggests that V1 reorganization may underlie perceptual fill-in, and it predicts spatial relationships between the original and remapped RFs that can be tested experimentally. More generally, it provides a general explanation for adaptive functional changes following CNS lesions, based on the recruitment of existing cortical connections that are involved in normal integrative mechanisms.

90.

Combined top-down/bottom-up segmentation

Borenstein E. & Ullman S. (2008) IEEE Transactions on Pattern Analysis and Machine Intelligence. 30, 12, p. 2109-2125 Abstract

We construct a segmentation scheme that combines top-down with bottom-up processing. In the proposed scheme, segmentation and recognition are intertwined rather than proceeding in a serial manner. The top-down part applies stored knowledge about object shapes acquired through learning, whereas the bottom-up part creates a hierarchy of segmented regions based on uniformity criteria. Beginning with unsegmented training examples of class and non-class images, the algorithm constructs a bank of class-specific fragments and determines their figure-ground segmentation. This bank is then used to segment novel images in a top-down manner: the fragments are first used to recognize images containing class objects, and then to create a complete cover that best approximates these objects. The resulting segmentation is then integrated with bottom-up multi-scale grouping to better delineate the object boundaries. Our experiments, applied to a large set of four classes (horses, pedestrians, cars, faces), demonstrate segmentation results that surpass those achieved by previous top-down or bottom-up schemes. The main novel aspects of this work are the fragment learning phase, which efficiently learns the figure-ground labeling of segmentation fragments, even in training sets with high object and background variability; combining the top-down segmentation with bottom-up criteria to draw on their relative merits; and the use of segmentation to improve recognition.

89.

Unsupervised Classification and Part Localization by Consistency Amplification

Karlinsky L., Dinerstein M., Levi D. & Ullman S. (2008) Computer Vision - Eccv 2008, Pt Ii, Proceedings. 5303, p. 321-335 Abstract

We present a novel method for unsupervised classification., including the discovery of a new category and precise object and part localization. Given a set of unlabelled images, some of which contain an object of an unknown category, with unknown location and unknown size relative to the background, the method automatically identifies the images that contain the objects, localizes them and their parts, and reliably learns their appearance and geometry for subsequent classification. Current unsupervised methods construct classifiers based on a fixed set of initial features. Instead, we propose a new approach which iteratively extracts new features and re-learns the induced classifier, improving class vs. non-class separation at each iteration. We develop two main tools that allow this iterative combined search. The first is a novel star-like model capable of learning a geometric class representation in the unsupervised setting. The second is learning of "part specific features" that are optimized for parts detection, and which optimally combine different part appearances discovered in the training examples. These novel aspects lead to precise part localization and to improvement in overall classification performance compared with previous methods. We applied our method to multiple object classes from Caltech-101, UIUC and a sub-classification problem from PASCAL. The obtained results are comparable to state-of-the-art supervised classification techniques and superior to state-of-the-art unsupervised approaches previously applied to the same image sets.

88.

Combined model for detecting, localizing and recognizing faces

Karlinsky, L. Dinershtein, M. Levi, D. Ullman, S. (2008) ECCV Workshop on Faces in Real-life Images, pp. 1-14..

2007

87.

Mutual information of image fragments predicts categorization in humans: Electrophysiological and behavioral evidence

Harel A., Ullman S., Epshtein B. & Bentin S. (2007) Vision Research. 47, 15, p. 2010-2020 Abstract

Computational models suggest that features of intermediate complexity (IC) play a central role in object categorization [Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5, 682-687.]. The critical aspect of these features is the amount of mutual information (MI) they deliver. We examined the relation between MI, human categorization and an electrophysiological response to IC features. Categorization performance correlated with MI level as well as with the amplitude of a posterior temporal potential, peaking around 270 ms. Hence, an objective MI measure predicts human object categorization performance and its underlying neural activity. These results demonstrate that informative IC features serve as categorization features in human vision.

86.

Uncovering shared structures in multiclass classification

Amit Y., Fink M., Srebro N. & Ullman S. (2007) p. 17-24 Abstract

This paper suggests a method for multiclass learning with many classes by simultaneously learning shared characteristics common to the classes, and predictors for the classes in terms of these characteristics. We cast this as a convex optimization problem, using trace-norm regularization and study gradient-based optimization both for the linear case and the kernelized setting.

85.

Object recognition and segmentation by a fragment-based hierarchy

Ullman S. (2007) Trends in Cognitive Sciences. 11, 2, p. 58-64 Abstract

How do we learn to recognize visual categories, such as dogs and cats? Somehow, the brain uses limited variable examples to extract the essential characteristics of new visual categories. Here, I describe an approach to category learning and recognition that is based on recent computational advances. In this approach, objects are represented by a hierarchy of fragments that are extracted during learning from observed examples. The fragments are class-specific features and are selected to deliver a high amount of information for categorization. The same fragments hierarchy is then used for general categorization, individual object recognition and object-parts identification. Recognition is also combined with object segmentation, using stored fragments, to provide a top-down process that delineates object boundaries in complex cluttered scenes. The approach is computationally effective and provides a possible framework for categorization, recognition and segmentation in human vision.

84.

Semantic hierarchies for recognizing objects and parts

Epshtein B. & Ullman S. (2007) 2007 Ieee Conference On Computer Vision And Pattern Recognition, Vols 1-8. p. 891-898 Abstract

This paper describes the construction and use of a novel representation for the recognition of objects and their parts, the semantic hierarchy. Its advantages include improved classification performance, accurate detection and localization of object parts and sub-parts, and explicitly identifying the different appearances of each object part. The semantic hierarchy algorithm starts by constructing a minimal feature hierarchy and proceeds by adding semantically equivalent representatives to each node, using the entire hierarchy as a context for determining the identity and locations of added features. Part detection is obtained by a bottom-up top-down cycle. Unlike previous approaches, the semantic hierarchy learns to represent the set of possible appearances of object parts at all levels, and their statistical dependencies. The algorithm is fully automatic and is shown experimentally to substantially improve the recognition of objects and their parts.

83.

Uncovering Shared Structures in Multiclass Classification

Amit, Y., Srebro, N., Ullman, S. And Fink, M (2007) ICML, 227, pp. 17-24..

Link to paper

2006

82.

Online multiclass learning by interclass hypothesis sharing

Fink M., Shalev-Shwartz S., Singer Y. & Ullman S. (2006) ACM International Conference Proceeding Series - Proceedings of the 23rd International Conference on Machine Learning, ICML 2006. p. 313-320 Abstract

We describe a general framework for online multiclass learning based on the notion of hypothesis sharing. In our framework sets of classes are associated with hypotheses. Thus, all classes within a given set share the same hypothesis. This framework includes as special cases commonly used constructions for multiclass categorization such as allocating a unique hypothesis for each class and allocating a single common hypothesis for all classes. We generalize the multiclass Perceptron to our framework and derive a unifying mistake bound analysis. Our construction naturally extends to settings where the number of classes is not known in advance but, rather, is revealed along the online learning process. We demonstrate the merits of our approach by comparing it to previous methods on both synthetic and natural datasets.

81.

Satellite features for the classification of visually similar classes

Epshtein B. & Ullman S. (2006) Proceedings - 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2006. Vol. 2. p. 2079-2086 Abstract

We show that the discrimination between visually similar classes often depends on the detection of so-called 'satellite features'. These are local features which are not informative by themselves, and can only be detected reliably at locations specified relative to other features. This makes satellite features difficult to extract by current classification methods. We describe a novel scheme which can extract discriminative satellite features and use them to distinguish between visually similar classes. The algorithm first searches for a set of features ("anchor features") that can be found in all the similar classes. Such features can be detected because the classes are visually similar. The anchors are used to determine the locations of satellite features, which are extracted during learning and used in classification to distinguish between the similar classes. The algorithm is fully automatic, and is shown to work well for many categories of visually similar classes.

80.

Visual classification by a hierarchy of extended fragments

Ullman S. & Epshtein B. (2006) Toward Category-Level Object Recognition. 4170, p. 321-344 Abstract

The chapter describes visual classification by a hierarchy of semantic fragments. In fragment-based classification, objects within a class are represented by common sub-structures selected during training. The chapter describes two extensions to the basic fragment-based scheme. The first extension is the extraction and use of feature hierarchies. We describe a method that automatically constructs complete feature hierarchies from image examples, and show that features constructed hierarchically are significantly more informative and better for classification compared with similar non-hierarchical features. The second extension is the use of so-called semantic fragments to represent object parts. The goal of a semantic fragment is to represent the different possible appearances of a given object part. The visual appearance of such object parts cart differ substantially, and therefore traditional image similarity-based methods are inappropriate for the task. We show how the method can automatically learn the part structure of a new domain, identify the main parts, and how their appearance changes across objects in the class. We discuss the implications of these extensions to object classification and recognition.

79.

Learning to classify by ongoing feature selection

Levi D. & Ullman S. (2006) Proceedings - Thirteenth International Symposium on Temporal Representation and Reasoning, TIME 2006. Abstract

Existing classification algorithms use a set of training examples to select classification features, which are then used for all future applications of the classifier. A major problem with this approach is the selection of a training set: A small set will result in reduced performance, and a large set will require extensive training. In addition, class appearance may change over time requiring an adaptive classification system. In this paper we propose a solution to these basic problems by developing an on-line feature selection method, which continuously modifies and improves the features used for classification based on the examples provided so far. The method is used for learning a new class, and to continuously improve classification performance as new data becomes available. In ongoing learning, examples are continuously presented to the system, and new features arise from these examples. The method continuously measures the value of the selected features using mutual information, and uses these values to efficiently update the set of selected features when new training information becomes available. The problem is challenging because at each stage the training process uses a small subset of the training data. Surprisingly, with sufficient training data the on-line process reaches the same performance as a scheme that has a complete access to the entire training data.

78.

Online Multiclass Learning by Interclass Hypothesis Sharing

Fink, M., Shalev-Shwartz, S., Singer, Y. And Ullman, S. (2006) ICML, pp. 313-320.

Link to paper

77.

Learning to classify by ongoing feature selection

Levi, D. And Ullman (2006) CRV, Recipient of Best CRV Paper Award.

Link to paper

76.

Satellite Features for the Classification of Visually Similar Classes

Epstein, B. And Ullman, S. (2006) CVPR, pp. 2079-2086.

Link to paper

75.

Object recognition by eliminating distracting information

Bart, E. And Ullman, S. (2006) ICCVG, Warsaw, Poland..

Link to paper

2005

74.

Neuroscience: Rewiring the adult brain

Calford M. B., Chino Y. M., Das A., Eysel U. T., Gilbert C. D., Heinen S. J., Kaas J. H. & Ullman S. (2005) Nature. 438, 7065, p. E3 Abstract

73.

Retinotopic axis specificity and selective clustering of feedback projections from V2 to V1 in the Owl monkey

Shmuel A., Korman M., Sterkin A., Harel M., Ullman S., Malach R. & Grinvald A. (2005) Journal of Neuroscience. 25, 8, p. 2117-2131 Abstract

Cortical maps and feedback connections are ubiquitous features of the visual cerebral cortex. The role of the feedback connections, however, is unclear. This study was aimed at revealing possible organizational relationships between the feedback projections from area V2 and the functional maps of orientation and retinotopy in area V1. Optical imaging of intrinsic signals was combined with cytochrome oxidase histochemistry and connectional anatomy in owl monkeys. Tracer injections were administered at orientation-selective domains in regions of pale and thick cytochrome oxidase stripes adjacent to the border between these stripes. The feedback projections from V2 were found to be more diffuse than the intrinsic horizontal connections within V1, but they nevertheless demonstrated clustering. The clusters of feedback axons projected preferentially to interblob cytochrome oxidase regions. The distribution of preferred orientations of the recipient domains in V1 was broad but appeared biased toward values similar to the preferred orientation of the projecting cells in V2. The global spatial distribution of the feedback projections in V1 was anisotropic. The major axis of anisotropy was systematically parallel to a retinotopic axis in V1 corresponding to the preferred orientation of the cells of origin in V2. We conclude that the feedback connections from V2 to V1 might play a role in enhancing the response in V1 to collinear contour elements.

72.

A hierarchical non-parametric method for capturing non-rigid deformations

Ecker A. & Ullman S. (2005) Proceedings - 2nd Canadian Conference on Computer and Robot Vision, CRV 2005. p. 50-56 Abstract

We present a novel approach for measuring deformations between image patches. Our algorithm is a variant of dynamic programming that is not inherently one-dimensional, and its scores are on a relative scale. The method is based on the combination of similarities between many overlapping sub-patches. The algorithm is designed to be robust to small deformations of parts at various positions and scales.

71.

Cross-generalization: learning novel classes from a single example by feature replacement

Bart E. & Ullman S. (2005) 2005 Ieee Computer Society Conference On Computer Vision And Pattern Recognition, Vol 1, Proceedings. p. 672-679 Abstract

We develop an object classification method that can learn a novel class from a single training example. In this method, experience with already learned classes is used to facilitate the learning of novel classes. Our classification scheme employs features that discriminate between class and non-class images. For a novel class, new features are derived by selecting features that proved useful for already learned classification tasks, and adapting these features to the new classification task. This adaptation is performed by replacing the features from already learned classes with similar features taken from the novel class. A single example of a novel class is sufficient to perform feature adaptation and achieve useful classification performance. Experiments demonstrate that the proposed algorithm can learn a novel class from a single training example, using 10 additional familiar classes. The performance is significantly improved compared to using no feature adaptation. The robustness of the proposed feature adaptation concept is demonstrated by similar performance gains across 107 widely varying object categories.

70.

Identifying semantically equivalent object fragments

Epshtein B. & Ullman S. (2005) 2005 Ieee Computer Society Conference On Computer Vision And Pattern Recognition, Vol 1, Proceedings. p. 2-9 Abstract

We describe a novel technique for identifying semantically equivalent parts in images belonging to the same object class, (e.g. eyes, license plates, aircraft wings etc.). The visual appearance of such object parts can differ substantially, and therefore traditional image similarity-based methods are inappropriate for this task. The technique we propose is based on the use of common context. We first retrieve context ftagments, which consistently appear together with a given input fragment in a stable geometric relation. We then use the context fragments in new images to infer the most likely position of equivalent parts. Given a set of image examples of objects in a class, the method can automatically learn the part structure of the domain - identify the main parts, and how their appearance changes across objects in the class. Two applications of the proposed algorithm are shown: the defection and identification of object parts and object recognition.

69.

Feature hierarchies for object classification

Epshtein B. & Ullman S. (2005) Tenth Ieee International Conference On Computer Vision, Vols 1 And 2, Proceedings. p. 220-227 Abstract

The paper describes a method for automatically extracting informative feature hierarchies for object classification, and shows the advantage of the features constructed hierarchically over previous methods. The extraction process proceeds in a top-down manner informative top-level fragments are extracted first, and by a repeated application of the same feature extraction process the classification fragments are broken down successively into their own optimal components. The hierarchical decomposition terminates with atomic features that cannot be usefully decomposed into simpler features. The entire hierarchy, the different features and sub-features, and their optimal parameters, are learned during a training phase using training examples. Experimental comparisons show that these feature hierarchies are significantly more informative and better for classification compared with similar non-hierarchical features as well as previous methods for using feature hierarchies.

68.

Single-example learning of novel classes using representation by similarity

Bart E. & Ullman S. (2005) Abstract

We describe an object classification method that can learn from a single training example. In this method, a novel class is characterized by its similarity to a number of previously learned, familiar classes. We demonstrate that this similarity is well-preserved across different class instances. As a result, it generalizes well to new instances of the novel class. A simple comparison of the similarity patterns is therefore sufficient to obtain useful classification performance from a single training example. The similarity between the novel class and the familiar classes in the proposed method can be evaluated using a wide variety of existing classification schemes. It can therefore combine the merits of many different classification methods. Experiments on a database of 107 widely varying object classes demonstrate that the proposed method significantly improves the performance of the baseline algorithm.

67.

Learning a novel class from a single example by cross-generalization.

E. Bart And S. Ullman (2005) CVPR, pp. 1063-1069.

66.

A fragment based approach for the characterization of V1 receptive fields

Vidal Naquet, M. Miyakawa, N. et al. (2005) SFN Abstract. [All authors]

65.

A hierarchical non-parametric method for capturing non-rigid transformation

Ecker, A. And Ullman, S. (2005) Canadian Robotics and Vision Conference, pp. 50-56.

Link to paper

64.

Single-example learning of novel classes using representation by similarity

Bart, E. And Ullman, S. (2005) BMVC, Oxford, England .

Link to paper

2004

63.

A novel high-resolution kinetic method for visual field mapping of scotoma in age-related macular degeneration

Zur D., Ben Simon S. G., Loewenstein A., Alster Y., Moisseiev J. & Ullman S. (2004) Ophthalmic Surgery Lasers and Imaging. 35, 5, p. 395-405 Abstract

BACKGROUND AND OBJECTIVE: To examine a new high-resolution kinetic mapping method for scotoma in age-related macular degeneration. PATIENTS AND METHODS: A computer-based program for kinetic visual field mapping was tested in 10 healthy subjects and 14 patients with age-related macular degeneration and fixed preferred retinal locus. The stimulus was presented using a back projector on a screen located 40 cm from the subject. The findings were then compared with static results. RESULTS: Control group mapping revealed good congruency with the anatomic blind spot. Mapping of the 14 patients with age-related macular degeneration was rapid and revealed good accuracy. The average deviation of the mapping border from the anatomic scotoma border was no more than 3.1% of the scotoma radius. Static mapping of 7 of the patients with age-related macular degeneration was longer and revealed lower accuracy. CONCLUSIONS: The proposed method is more rapid, accurate, and consistent than static mapping. It allows accurate mapping of central scotoma with suprathreshold stimulus, and may be used in the future for detecting the early stages of age-related macular degeneration using subthreshold stimulus.

62.

Recognition invariance obtained by extended and invariant features

Ullman S. & Bart E. (2004) Neural Networks. 17, 5-6, p. 833-848 Abstract

In performing recognition, the visual system shows a remarkable capacity to distinguish between significant and immaterial image changes, to learn from examples to recognize new classes of objects, and to generalize from known to novel objects. Here we focus on one aspect of this problem, the ability to recognize novel objects from different viewing directions. This problem of view-invariant recognition is difficult because the image of an object seen from a novel viewing direction can be substantially different from all previously seen images of the same object. We describe an approach to view-invariant recognition that uses extended features to generalize across changes in viewing directions. Extended features are equivalence classes of informative image fragments, which represent object parts under different viewing conditions. This representation is extracted during learning from images of moving objects, and it allows the visual system to generalize from a single view of a novel object, and to compensate for large changes in the viewing direction, without using three-dimensional information. We describe the model, its implementation and performance on natural face images, compare it to alternative approaches, discuss its biological plausibility, and its extension to other aspects of visual recognition. The results of the study suggest that the capacity of the recognition system to generalize to novel conditions in an efficient and flexible manner depends on the ongoing extraction of different families of informative features, acquired for different tasks and different object classes.

61.

Class-based matching of object parts

Bart E. & Ullman S. (2004) IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 2004-January, January, 1384972. Abstract

We develop a novel technique for class-based matching of object parts across large changes in viewing conditions. Given a set of images of objects from a given class under different viewing conditions, the algorithm identifies corresponding regions depicting the same object part in different images. The technique is based on using the equivalence of corresponding features in different viewing conditions. This equivalence-based matching scheme is not restricted to planar components or affine transformations. As a result, it identifies corresponding parts more accurately and under more general conditions than previous methods. The scheme is general and works for a variety of natural object classes. We demonstrate that using the proposed methods, a dense set of accurate correspondences can be obtained. Experimental comparisons to several known techniques are presented. An application to the problem of invariant object recognition is shown, and additional applications to wide-baseline stereo are discussed.

60.

Combining top-down and bottom-up segmentation

Borenstein E., Sharon E. & Ullman S. (2004) IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 2004-January, January, 1384838. Abstract

In this work we show how to combine bottom-up and top-down approaches into a single figure-ground segmentation process. This process provides accurate delineation of object boundaries that cannot be achieved by either the top-down or bottom-up approach alone. The top-down approach uses object representation learned from examples to detect an object in a given input image and provide an approximation to its figure-ground segmentation. The bottom-up approach uses image-based criteria to define coherent groups of pixels that are likely to belong together to either the figure or the background part. The combination provides a final segmentation that draws on the relative merits of both approaches: The result is as close as possible to the top-down approximation, but is also constrained by the bottom-up process to be consistent with significant image discontinuities. We construct a global cost function that represents these top-down and bottom-up requirements. We then show how the global minimum of this function can be efficiently found by applying the sum-product algorithm. This algorithm also provides a confidence map that can be used to identify image regions where additional top-down or bottom-up information may further improve the segmentation. Our experiments show that the results derived from the algorithm are superior to results given by a pure top-down or pure bottom-up approach. The scheme has broad applicability, enabling the combined use of a range of existing bottom-up and top-down segmentations.

59.

Learning to segment

Borenstein E. & Ullman S. (2004) Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 3023, p. 315-328 Abstract

We describe a new approach for learning to perform classbased segmentation using only unsegmented training examples. As in previous methods, we first use training images to extract fragments that contain common object parts. We then show how these parts can be segmented into their figure and ground regions in an automatic learning process. This is in contrast with previous approaches, which required complete manual segmentation of the objects in the training examples. The figure-ground learning combines top-down and bottom-up processes and proceeds in two stages, an initial approximation followed by iterative refinement. The initial approximation produces figure-ground labeling of individual image fragments using the unsegmented training images. It is based on the fact that on average, points inside the object are covered by more fragments than points outside it. The initial labeling is then improved by an iterative refinement process, which converges in up to three steps. At each step, the figure-ground labeling of individual fragments produces a segmentation of complete objects in the training images, which in turn induce a refined figure-ground labeling of the individual fragments. In this manner, we obtain a scheme that starts from unsegmented training images, learns the figure-ground labeling of image fragments, and then uses this labeling to segment novel images. Our experiments demonstrate that the learned segmentation achieves the same level of accuracy as methods using manual segmentation of training images, producing an automatic and robust top-down segmentation.

58.

View-invariant recognition using corresponding object fragments

Bart E., Byvatov E. & Ullman S. (2004) Computer Vision - Eccv 2004, Pt 2. 3022, p. 152-165 Abstract

We develop a novel approach to view-invariant recognition and apply it to the task of recognizing face images under widely separated viewing: directions. Our main contribution is a novel object representation scheme using 'extended fragments' that enables us to achieve a high level of recognition performance and generalization across a wide range of viewing conditions. Extended fragments are equivalence classes of image fragments that represent informative object parts under different viewing conditions. They are extracted automatically from short video sequences during learning. Using this representation, the scheme is unique in its ability to generalize from a single view of a novel object and compensate for a significant change in viewing direction without using 3D information. As a result, novel objects can be recognized from viewing directions from which they were not seen in the past. Experiments demonstrate that the scheme achieves significantly better generalization and recognition performance than previously used methods.

57.

Combining bottom-up and top-down segmentation

E. Borenstein And S. Ullman (2004) CVPR workshop on Perceptual Organization in Computer Vision.

Link to paper

56.

Class-based matching of object parts

E. Bart And S. Ullman (2004) CVPR Workshop on Image and Video Registration.

Link to paper

55.

Image normalization by mutual information

E. Bart And S. Ullman (2004) Bmvc.

Link to paper

2003

54.

Filling-in of retinal scotomas

Zur D. & Ullman S. (2003) Vision Research. 43, 9, p. 971-982 Abstract

In this study we examined the perception of one- and two-dimensional patterns across central retinal scotomas, caused by age-related macular degeneration. In contrast with previous studies of disrupted visual input that used the blind spot and artificial scotomas, the current study used large central scotomas caused by physical retinal damage. Such damage is associated with atrophy and long-term cortical reorganization, and it was therefore unclear whether perceptual completion in the damaged system will be similar to that reported for artificial scotomas and the blind spot. In addition, the scotomas under study were much larger and more central than artificial scotomas for which perceptual completion has been reported. For 1-D line and grating patterns, we found perceptual completion across large central scotomas (up to radius of 7°), which is significantly beyond the range of perceptual completion in artificial scotomas. Gratings completion was better than that of a single line, and increased with bars density. The use of central scotomas allowed us to test the completion of 2-D patterns that are difficult to study in peripheral vision. We found completion of two-dimensional dot arrays over large regions that improved with pattern density and regularity. The results show that in the physically damaged system the range of perceptual completion is increased compared with artificial scotomas, they strongly support the view of an active filling-in process rather than simply ignoring the damaged location, and they show that perceptual completion of physical scotomas is likely to involve cortical processing at multiple levels. We finally discuss implications of the results to the possible use of image enhancement techniques to facilitate the perception of low-vision individuals.

53.

Object recognition with informative features and linear classi cation pound

Vidal-Naquet M. & Ullman S. (2003) Ninth Ieee International Conference On Computer Vision, Vols I And Ii, Proceedings. p. 281-288 Abstract

In this paper we show that efpoundcient object recognition can be obtained by combining informative features with linear classipoundcation. The results demonstrate the superiority of informative class-specipoundc features, as compared with generic type features such as wavelets, for the task of object recognition. We show that information rich features can reach optimal performance with simple linear separation rules, while generic, feature based classipounders require more complex-classipoundcation schemes. This is signipoundcant because efpoundcient and optimal methods have been developed for spaces that allow linear separation. To compare different strategies for feature extraction, we trained and compared classipounders working in feature spaces of the same low dimensionality, using two feature types (image fragments vs. wavelets) and two classipoundcation rules (linear hyperplane and a Bayesian Network). The results show that by maximizing the individual information of the features, it is possible to obtain efpoundcient classipoundcation by a simple linear separating rule, as well as more efpoundcient learning.

52.

Approaches to visual recognition

Ullman, S (2003) Attention and Performance XX, Oxford University Press..

2002

51.

Special issue: Taurine: Discovered 185 years ago and still intrigues the scientific community

Ullman S., Vidal-Naquet M. & Sali E. (2002) Amino Acids. 23, 4, p. 343 Abstract

50.

Measuring and modeling filling-in effects in retinal AMD scotomas

Zur D. & Ullman S. (2002) Journal of Vision. 2, 7, p. 183a Abstract

Retinal scotomas, caused by Age-Related Macular Degeneration (AMD), often damage the center of the visual field, the locus of high acuity, normally used for the detection and recognition of shapes. As a result of the scotomas, the input image is disrupted at the retina level, but the perceived image usually appears continuous, and at the same time distorted in certain ways. Perceptual filling-in phenomena were studied in the past in relation to the blind spot and to artificial scotomas, but not in relation to the large and central scotomas that characterize AMD patients. In this study we modeled the perceptual effects associated with retinal AMD scotomas. First, we empirically presented to patients a variety of specially designed patterns at controlled locations, and the patients responses were collected and analyzed. Second, we implemented a mathematical model and computer simulation of the observed filling-in phenomena. The model was based in part on neurophysiological evidence regarding receptive field expansion at the level of V1 in and around the scotomas regions. Psychophysically, we found that 1-D patterns such as lines and gratings were completed across large scotomas (up to 7 deg). Gratings completion was better than that of single line and increased with frequency up to 6.28 c/deg. For 2-D patterns, we found completion of dot arrays that improved with density and regularity. Computationally, we found that simulating the filling-in effect by convoluting the image with an adaptive local filter, that simulates the effect of receptive field expansion, achieves good image restoration of highly damaged images, and can account for many of the observed phenomena. The results suggest that the filling-in occurs at several levels of the visual pathway, that together can compensate for large and dense scotomas, and obtain filling-in of both simple and complex patterns.

49.

Visual features of intermediate complexity and their use in classification

Ullman S., Vidal-Naquet M. & Sali E. (2002) Nature Neuroscience. 5, 7, p. 682-687 Abstract

The human visual system analyzes shapes and objects in a series of stages in which stimulus features of increasing complexity are extracted and analyzed. The first stages use simple local features, and the image is subsequently represented in terms of larger and more complex features. These include features of intermediate complexity and partial object views. The nature and use of these higher order representations remains an open question in the study of visual processing by the primate cortex. Here we show that intermediate complexity (IC) features are optimal for the basic visual task of classification. Moderately complex features are more informative for classification than very simple or very complex ones, and so they emerge naturally by the simple coding principle of information maximization with respect to a class of images. Our findings suggest a specific role for IC features in visual processing and a principle for their extraction.

48.

Shape-selective stereo processing in human object-related visual areas

Gilaie-Dotan S., Ullman S., Kushnir T. & Malach R. (2002) Human Brain Mapping. 15, 2, p. 67-79 Abstract

Object related areas in the human ventral stream were previously shown to be activated, in a shape-selective manner, by luminance, motion, and texture cues. We report on the preferential activation of these areas by stereo cues defining shape. To assess the relationship of this activation to object recognition, we employed a perceptual stereo effect, which profoundly affects object recognition. The stimuli consisted of stereo-defined line drawings of objects that either protruded in front of a flat background ("front"), or were sunk into the background ("back"). Despite the similarity in the local feature structure of the two conditions, object recognition was superior in the "front" compared to the "back" configuration. We measured both recognition rates and fMRI signal from the human visual cortex while subjects viewed these stimuli. The results reveal shape selective activation from images of objects defined purely by stereoscopic cues in the human ventral stream. Furthermore, they show a significant correlation between recognition and fMRI signal in the object-related occipito-temporal cortex (lateral occipital complex).

47.

Class-specific, top-down segmentation

Borenstein E. & Ullman S. (2002) Computer Vision - Eccv 2002, Pt Ii. 2351, p. 109-122 Abstract

In this paper we present a novel class-based segmentation method, which is guided by a stored representation of the shape of objects within a general class (such as horse images). The approach is different from bottom-up segmentation methods that primarily use the continuity of grey-level, texture, and bounding contours. We show that the method leads to markedly improved segmentation results and can deal with significant variation in shape and varying backgrounds. We discuss the relative merits of class-specific and general image-based segmentation methods and suggest how they can be usefully combined.

2001

46.

A fragment-based approach to object representation and classification

Ullman S., Sali E. & Vidal-Naquet M. (2001) Visual Form 2001 - 4th International Workshop on Visual Form, IWVF4, Proceedings. Arcelli C., di Baja G. S. & Cordella L. P.(eds.). p. 85-100 Abstract

The task of visual classification is the recognition of an object in the image as belonging to a general class of similar objects, such as a face, a car, a dog, and the like. This is a fundamental and natural task for biological visual systems, but it has proven difficult to perform visual classification by artificial computer vision systems. The main reason for this difficulty is the variability of shape within a class: different objects vary widely in appearance, and it is difficult to capture the essential shape features that characterize the members of one category and distinguish them from another, such as dogs from cats. In this paper we describe an approach to classification using a fragment-based representation. In this approach, objects within a class are represented in terms of common image fragments that are used as building blocks for representing a large variety of different objects that belong to a common class. The fragments are selected from a training set of images based on a criterion of maximizing the mutual information of the fragments and the class they represent. For the purpose of classification the fragments are also organized into types, where each type is a collection of alternative fragments, such as different hairline or eye regions for face classification. During classification, the algorithm detects fragments of the different types, and then combines the evidence for the detected fragments to reach a final decision. Experiments indicate that it is possible to trade off the complexity of fragments with the complexity of the combination and decision stage, and this tradeoff is discussed. The method is different from previous part-based methods in using class-specific object fragments of varying complexity, the method of selecting fragments, and the organization into fragment types. Experimental results of detecting face and car views show that the fragment-based approach can generalize well to a variety of novel image views within a class while maintaining low mis-classification error rates. We briefly discuss relationships between the proposed method and properties of parts of the primate visual system involved in object perception.

45.

A fragment-based approach to object representation and classification

Ullman, S., Sali, E. & Vidal-Naquet, M (2001) International Workshop on Visual Form, Berlin: Springer, 85-100..

2000

44.

Object classification using a fragment-based representation

Ullman S. & Sali E. (2000) Biologically Motivated Computer Vision, Proceeding. 1811, p. 73-87 Abstract

The tasks of visual object recognition and classification are natural and effortless for biological visual systems, but exceedingly difficult to replicate in computer vision systems. This difficulty arises from the large variability in images of different objects within a class, and variability in viewing conditions. In this paper we describe a fragment-based method for object classification. In this approach objects within a class are represented in terms of common image fragments, that are used as building blocks for representing a large variety of different objects that belong to a common class, such as a face or a car. Optimal Fragments are selected from a training set of images based on a criterion of maximizing the mutual information of the fragments and the class they represent. For the purpose of classification the fragments are also organized into types, where each type is a collection of alternative fragments, such as different hairline or eye regions for Face classification. During classification, the algorithm detects fragments of the different types, and then combines the evidence for the detected fragments to reach a final decision. The algorithm verifies the proper arrangement of the fragments and the consistency of the viewing conditions primarily by the conjunction of overlapping fragments. The method is different from previous part-based methods in using class-specific overlapping object fragments of varying complexity. and in verifying the consistent arrangement of the fragments primarily by the conjunction of overlapping detected fragments. Experimental results on the detection of face and car views show that the fragment-based approach can generalize well to completely novel image views within a class while maintaining low mis-classification error rates. We briefly discuss relationships between the proposed method and properties of parts of the primate visual system involved in object perception.

1999

43.

Computation of pattern invariance in brain-like structures

Ullman S. & Soloviev S. (1999) Neural Networks. 12, 7-8, p. 1021-1036 Abstract

A fundamental capacity of the perceptual systems and the brain in general is to deal with the novel and the unexpected. In vision, we can effortlessly recognize a familiar object under novel viewing conditions, or recognize a new object as a member of a familiar class, such as a house, a face, or a car. This ability to generalize and deal efficiently with novel stimuli has long been considered a challenging example of brain-like computation that proved extremely difficult to replicate in artificial systems. In this paper we present an approach to generalization and invariant recognition. We focus our discussion on the problem of invariance to position in the visual field, but also sketch how similar principles could apply to other domains. The approach is based on the use of a large repertoire of partial generalizations that are built upon past experience. In the case of shift invariance, visual patterns are described as the conjunction of multiple overlapping image fragments. The invariance to the more primitive fragments is built into the system by past experience. Shift invariance of complex shapes is obtained from the invariance of their constituent fragments. We study by simulations aspects of this shift invariance method and then consider its extensions to invariant perception and classification by brain-like structures.

42.

Combining class-specific fragments for object classification.

Sali, E. & Ullman, S. (1999) In Proc. 10th British Machine Vision Conference, volume 1, 203 - 213..

41.

Detecting object classes by the detection of overlapping 2-D fragments.

Sali, E. & Ullman, S. In: D. Chernikov & T. Szinanyi, (Eds.) (1999) Proceedings of the Workshop on Fundamental Structural Properties in Image and Pattern Analysis, 123-132. Published by OCG, Austrian Computer Society..

1998

40.

Three-dimensional object recognition based on the combination of views

Ullman S. (1998) Cognition. 67, 1-2, p. 21-44 Abstract

Visual object recognition is complicated by the fact that the same 3D object can give rise to a large variety of projected images that depend on the viewing conditions, such as viewing direction, distance, and illumination. This paper describes a computational approach that uses combinations of a small number of object views to deal with the effects of viewing direction. The first part of the paper is an overview of the approach based on previous work. It is then shown that, in agreement with psychophysical evidence, the view-combinations approach can use views of different class members rather than multiple views of a single object, to obtain class-based generalization. A number of extensions to the basic scheme are considered, including the use of non-linear combinations, using 3D versus 2D information, and the role of coarse classification on the way to precise identification. Finally, psychophysical and biological aspects of the view-combination approach are discussed. Compared with approaches that treat object recognition as a symbolic high-level activity, in the view-combination approach the emphasis is on processes that are simpler and pictorial in nature.

39.

Recognizing novel 3-D objects under new illumination and viewing position using a small number of example views or even a single view

Sali E. & Ullman S. (1998) p. 153-161 Abstract

A method is presented for class-based recognition using a small number of example views taken under several different viewing conditions. The main emphasis is on using a small number of examples. Previous work assumed that the set of examples is sufficient to span the entire space of possible objects, and that in generalizing to a new viewing conditions a sufficient number of previous examples under the new conditions will be available to the recognition system. Here we have considerably relaxed these assumptions and consequently obtained good class-based generalization from a small number of examples, even a single example view, for both viewing position and illumination changes. In addition, previous class-based approaches only focused on viewing position changes and did not deal with illumination changes. Here we used a class-based approach that can generalize for both illumination and viewing position changes. The method was applied to face and car model images. New views under viewing position and illumination changes were synthesized from a small number of examples.

38.

Multi-View modeling & synthesis

Brestel C. & Ullman S. (1998) European Signal Processing Conference. 1998-January, Abstract

In this work we introduce a method for generating a complete object representation based on its cover by a set of 2D views. This view-based representation has applications in several areas such as object recognition, computer graphics and video compression. The notion of image-based modeling received considerable attention in recent years. The basic motivation is to avoid the computational-intensive process of acquiring a 3D model followed by rendering. Instead, the approach uses a number of model images taken from different poses of the 3D object or scene as a basis. From these views any pose can be synthesized directly. Prior work used a small number of views to represent the object from a limited range of viewing directions. In the current work, a complete model that allows a presentation of the object from any viewing direction is introduced. A second novel aspect of the current work is a method for obtaining correspondence between views using a special pattern illumination. The correspondence is obtained with complexity of 0(Nlog(N)) where N is the number of pixels in the image.

37.

Generalization to Novel Views: Universal, Class-based, and Model-based Processing

Moses Y. & Ullman S. (1998) International Journal of Computer Vision. 29, 3, p. 233-253 Abstract

A major problem in object recognition is that a novel image of a given object can be different from all previously seen images. Images can vary considerably due to changes in viewing conditions such as viewing position and illumination. In this paper we distinguish between three types of recognition schemes by the level at which generalization to novel images takes place: universal, class, and model-based. The first is applicable equally to all objects, the second to a class of objects, and the third uses known properties of individual objects. We derive theoretical limitations on each of the three generalization levels. For the universal level, previous results have shown that no invariance can be obtained. Here we show that this limitation holds even when the assumptions made on the objects and the recognition functions are relaxed. We also extend the results to changes of illumination direction. For the class level, previous studies presented specific examples of classes of objects for which functions invariant to viewpoint exist. Here, we distinguish between classes that admit such invariance and classes that do not. We demonstrate that there is a tradeoff between the set of objects that can be discriminated by a given recognition function and the set of images from which the recognition function can recognize these objects. Furthermore, we demonstrate that although functions that are invariant to illumination direction do not exist at the universal level, when the objects are restricted to belong to a given class, an invariant function to illumination direction can be defined. A general conclusion of this study is that class-based processing, that has not been used extensively in the past, is often advantageous for dealing with variations due to viewpoint and illuminant changes.

1997

36.

Crete, channels, cells, circuits and computers

Ullman S., Roth A., Thomson A. & Linne M. (1997) Trends in Neurosciences. 20, 2, p. 53-54 Abstract

35.

Face recognition: The problem of compensating for changes in illumination direction

Adini Y., Moses Y. & Ullman S. (1997) IEEE Transactions on Pattern Analysis and Machine Intelligence. 19, 7, p. 721-732 Abstract

A face recognition system must recognize a face from a novel image despite the variations between images of the same face. A common approach to overcoming image variations because of changes in the illumination conditions is to use image representations that are relatively insensitive to these variations. Examples of such representations are edge maps, image intensity derivatives, and images convolved with 2D Gabor-like filters. Here we present an empirical study that evaluates the sensitivity of these representations to changes in illumination, as well as viewpoint and facial expression. Our findings indicated that none of the representations considered is sufficient by itself to overcome image variations because of a change in the direction of illumination. Similar results were obtained for changes due to viewpoint and expression. Image representations that emphasized the horizontal features were found to be less sensitive to changes in the direction of illumination. However, systems based only on such representations failed to recognize up to 20 percent of the faces in our database. Humans performed considerably better under the same conditions. We discuss possible reasons for this superioriority and alternative methods for overcoming illumination effects in recognition.

34.

Object recognition using stochastic optimization

Ullman S. & Zeira A. (1997) Energy Minimization Methods In Computer Vision And Pattern Recognition, Proceedings. 1223, p. 329-344 Abstract

We describe an approach to object recognition in which the image-to-model match is based on stochastic optimization. During the recognition process, an internal model is matched with a novel object view. To compensate for changes in viewing conditions (such as illumination, viewing direction), the model is controlled by a number of parameters. The matching is obtained by seeking a setting of the parameters that minimizes the discrepancy, between the image and the model. The search is performed in our examples in a six-dimensional space with multiple local minima. We developed an efficient minimization method based on the stochastic optimization approach (Mockus 1989). The search is bidirectional (applied to both the model and the image) and avoids the difficult problem of establishing image-to-model correspondence. It proceeds by evolving a population of candidate solutions using simple generation rules, based on the autocorrelation of the search space. We describe the method, its application to objects in several domains (cars, faces, printed symbols), and experimental comparisons with alternative methods, such as simulated annealing.

1996

33.

Learning class regions by the union of ellipsoids

Kositsky M. & Ullmann S. (1996) Track D. Vol. 4. p. 750-757 Abstract

In many classification schemes objects are represented as points in multi-dimensional feature spaces. The classification scheme then attempts to discriminate between regions in the space occupied by objects of different classes. The performance of the classification method often depends on the shape of the class regions, e.g., whether or not they are linearly separable. In many practical cases, class regions have the structure of smooth low-dimensional manifolds. We develop a novel classification scheme that covers each class region by a set of ellipsoids that are oriented along the local orientation of the manifold. The scheme learns the class regions from sequential presentation of samples, and the ellipsoids are created and modified incrementally during the learning. In high dimensional feature spaces the ellipsoids cover can become significantly more efficient than alternative classification schemes.

32.

Spatial context in recognition

Bar M. & Ullman S. (1996) Perception. 25, 3, p. 343-352 Abstract

In recognizing objects and scenes, partial recognition of objects or their parts can be used to guide the recognition of other objects. Here, the role of individual objects in the recognition of complete figures and the influence of contextual information on the identification of ambiguous objects were investigated. Configurations of objects that were placed in either proper or improper spatial relations were used, and response times and error rates in a recognition task were measured. Two main results were obtained. First, proper spatial relations among the objects of a scene decrease response times and error rates in the recognition of individual objects. Second, the presence of objects that have a unique interpretation improves the identification of ambigous objects in the scene. Ambiguous objects were recognized faster and with fewer errors in the presence of clearly recognized objects compared with the same objects in isolation or in improper spatial relations. The implications of these findings for the organization of recognition memory are discussed.

31.

Generalization to novel images in upright and inverted faces

Moses Y., Ullman S. & Edelman S. (1996) Perception. 25, 4, p. 443-461 Abstract

An image of a face depends not only on its shape, but also on the viewpoint, illumination conditions, and facial expression. A face recognition system must overcome the changes in face appearance induced by these factors. Two related questions were investigated: the capacity of the human visual system to generalize the recognition of faces to novel images, and the level at which this generalization occurs. This problem was approached by comparing the identification and generalization capacity for upright and inverted faces. For upright faces, remarkably good generalization to novel conditions was found. For inverted faces, the generalization to novel views was significantly worse for both new illumination and viewpoint, although the performance on the training images was similar to that on the upright condition. The results indicate that at least some of the processes that support generalization across viewpoint and illumination are neither universal (because subjects did not generalize as easily for inverted faces as for upright ones) nor strictly object specific (because in upright faces nearly perfect generalization was possible from a single view, by itself insufficient for building a complete object-specific model). It is proposed that generalization in face recognition occurs at an intermediate level that is applicable to a class of objects, and that at this level upright and inverted faces initially constitute distinct object classes.

1995

30.

Sequence seeking and counter streams: A computational model for bidirectional information flow in the visual cortex

Ullman S. (1995) Cerebral Cortex. 5, 1, p. 1-11 Abstract

A computational model is proposed for some general aspects of information flow in the visual cortex. The basic process, called "sequence seeking," is a search for a sequence of mappings, or transformations, linking source and target patterns. The process has two main characteristics: it is bidirectional, bottom-up as well as top-down, and it explores in parallel a large number of alternative sequences. This operation is performed in a "counter streams" structure, in which multiple sequences are explored along two complementary pathways, an ascending and a descending one, seeking to meet. A biological embodiment of this model in cortical circuitry is proposed. The model serves to account for known aspects of cortical interconnections and to derive new predictions.

1994

29.

Face recognition: The problem of compensating for changes in illumination direction

Moses Y., Adini Y. & Ullman S. (1994) Computer Vision ECCV 1994 - 3rd European Conference on Computer Vision, Proceedings. Eklundh J-O(eds.). Vol. 800. p. 286-296 Abstract

Recognizing faces is a difficult problem due to the generally similar shape of faces combined with the considerable variability in images of the same face under different viewing conditions. In this paper we consider image variation due mainly to illumination conditions. We study several image representations that are often considered insensitive to changes of illumination conditions, such as edge maps, derivatives of the grey-level image, and the image convolved with Gabor filters. For each of these representations, we compare the differences between images of the same face under different imaging conditions, with differences between images of distinct faces. The comparison is performed using a controlled database of faces, in which each of the imaging parameters (illumination, viewing position, and expression) is controlled separately. The main result of these studies is that the variations between the images of the same face due to illumination and viewing directions are almost always larger than image variations due to a change in face identity. For illumination changes, this reversal is ahnost complete except for representations that emphasized the horizontal features. However, even for these representations, systems based only on comparing such representations will fail to recognize up to 30% of the faces in our database. We conclude that these representations are insufficient by themselves to overcome the variation between images due to changes in illumination direction as well as changes due to viewing position and expression.

1993

28.

The Alignment of Objects with Smooth Surfaces

Basri R. & Ullman S. (1993) CVGIP: Image Understanding. 57, 3, p. 331-345 Abstract

This paper examines the recognition of rigid objects bounded by smooth surfaces, using an alignment approach. The projected image of such an object changes during rotation in a manner that is generally difficult to predict. An approach to this problem is suggested, using the 3D surface curvature at the points along the silhouette. The curvature information requires a single number for each point along the objects silhouette, the radial curvature at the point. We have implemented this method and tested it on images of complex 3D objects. Models of the viewed objects were acquired using three images of each object. The implemented scheme was found to give accurate predictions of the objects appearances for large transformations. Using this method, a small number of (viewer-centered) models can be used to predict the new appearance of an object from any given viewpoint.

27.

The alignment of objects with smooth surfaces

Basri R. & Ullman S. (1993) Cvgip-Image Understanding. 57, 3, p. 331-345 Abstract

1992

26.

Limitations of non model-based recognition schemes

Moses Y. & Ullman S. (1992) Computer Vision - ECCV 1992 - 2nd European Conference on Computer Vision, Proceedings. Sandini G.(eds.). p. 820-828 Abstract

Approaches to visual object recognition can be divided into model-based and non modeLbased schemes. In this paper we establish some limitations on non model-based recognition schemes. We show that a consistent non model-based recognition scheme for general objects cannot discriminate between objects. The same result holds even if the recognition function is imperfect, and is allowed to mis-identify each object from a substantial fraction of the viewing directions. We then consider recognition schemes restricted to classes of objects. We define the notion of the discrimination power of a consistent recognition function for a class of objects. The functions discrimination power determines the set of objects that can be discriminated by the recognition function. We show how the properties of a class of objects determine an upper bound on the discrimination power of any consistent recognition function for that class.

25.

LIMITATIONS OF NONMODEL-BASED RECOGNITION SCHEMES

Moses Y. & Ullman S. (1992) Computer Vision - Eccv 92. 588, p. 820-828 Abstract

24.

Low-level aspects of segmentation and recognition.

Ullman S. (1992) Philosophical Transactions of the Royal Society of London Series b-Biological Sciences. 337, 1281, p. 371-378; discussion 379 Abstract

This paper discusses two problems related to three-dimensional object recognition. The first is segmentation and the selection of a candidate object in the image, the second is the recognition of a three-dimensional object from different viewing positions. Regarding segmentation, it is shown how globally salient structures can be extracted from a contour image based on geometrical attributes, including smoothness and contour length. This computation is performed by a parallel network of locally connected neuron-like elements. With respect to the effect of viewing, it is shown how the problem can be overcome by using the linear combinations of a small number of two-dimensional object views. In both problems the emphasis is on methods that are relatively low level in nature. Segmentation is performed using a bottom-up process, driven by the geometry of image contours. Recognition is performed without using explicit three-dimensional models, but by the direct manipulation of two-dimensional images.

1991

23.

Recognition by Linear Combinations of Models

Ullman S. & Basri R. (1991) IEEE Transactions on Pattern Analysis and Machine Intelligence. 13, 10, p. 992-1006 Abstract

Visual object recognition requires the matching of an image with a set of models stored in memory. In this paper, we propose an approach to recognition in which a 3-D object is represented by the linear combination of 2-D images of the object. If M={Mi, M_k} is the set of pictures representing a given object and P is the 2-D image of an object to be recognized, then P is considered to be an instance of M if P= Σ^k_i=1 α_iM_i for some constants a_i*. We show that this approach handles correctly rigid 3-D transformations of objects with sharp as well as smooth boundaries and can also handle nonrigid transformations. The paper is divided into two parts. In the first part, we show that the variety of views depicting the same object under different transformations can often be expressed as the linear combinations of a small number of views. In the second part, we suggest how this linear combination property may be used in the recognition process.

22.

SHORT-RANGE AND LONG-RANGE PROCESSES IN STRUCTURE-FROM-MOTION

Dick M., Ullman S. & Sagi D. (1991) Vision Research. 31, 11, p. 2025-2028 Abstract

Human ability to detect 3-D structure in an array of 2-D moving dots was tested. Under limited exposure time, we found high detection rates only when the 2-D motion was restricted to the spatio-temporal region of short-range motion. Long-range moving dots failed to produce a strong impression of 3-D structure and yielded only weak detection rates. This result is consistent with the view that the processing of long-range motion is more serial than that of short-range motion.

21.

Direct computation of the focus of expansion from velocity field measurements

Guissin R. & Ullman S. (1991) Proceedings of the IEEE Workshop on Visual Motion. p. 146-155 Abstract

A new method for computing the direction of translational motion (focus of expansion) of a moving observer (or camera) in a stationary environment is proposed. The method applies simple 1-D search directly to velocity field measurements of the changing image, for cases of general motion and constrained motion. In the case of general motion, the velocity field is derotated to cancel the velocity of the observed point at the image origin. In principle, a single moving closed contour which encircles the origin is sufficient to recover the focus of expansion. In practice, additional velocity measurements in the image may be incorporated in the computation, for improved robustness in the face of image noise and velocity field inaccuracies.

1990

20.

Reading cursive handwriting by alignment of letter prototypes

Edelman S., Flash T. & Ullman S. (1990) International Journal of Computer Vision. 5, 3, p. 303-331 Abstract

We describe a new approach to the visual recognition of cursive handwriting. An effort is made to attain human-like performance by using a method based on pictorial alignment and on a model of the process of handwriting. The alignment approach permits recognition of character instances that appear embedded in connected strings. A system embodying this approach has been implemented and tested on five different word sets. The performance was stable both across words and across writers. The system exhibited a substantial ability to interpret cursive connected strings without recourse to lexical knowledge.

19.

Self-calibrated collinearity detector

Moses Y., Schechtman G. & Ullman S. (1990) Biological Cybernetics. 63, 6, p. 463-475 Abstract

The human visual system can make remarkably precise spatial judgements. There are reasons to believe that this accuracy is achieved and maintained by using processes that calibrate and correct errors in the system. This work investigate this problem of self-calibration and describes an adaptive system for detecting the collinearity of points and the straightness of lines. The system is initially inaccurate, but, by using an error correction mechanism, it eventually becomes highly accurate. The error correction is performed by a simple self calibration process named proportional multi-gain adjustment. The calibration process adjusts the gain values of the system input units. The process utilizes statistical regularities in the input stimuli. It compensate for errors due to noise in the input units receptive fields location and response functions by ensuring that the average deviation from collinearity offset detected by the system is zero. As a by product of the error correction, the system exhibits also adaptation and aftereffect phenomena, similar to those observed in the human visual system.

18.

BEYOND V1 - PROBLEMS IN INTERMEDIATE-LEVEL VISION

Ullman S. (1990) Signal And Sense: Local And Global Order In Perceptual Maps. p. 143-162 Abstract

1989

17.

Aligning pictorial descriptions: An approach to object recognition

Ullman S. (1989) Cognition. 32, 3, p. 193-254 Abstract

This paper examines the problem of shape-based object recognition, and proposes a new approach, the alignment of pictorial descriptions. The first part of the paper reviews general approaches to visual object recognition, and divides these approaches into three broad classes: invariant properties methods, object decomposition methods, and alignment methods. The second part presents the alignment method. In this approach the recognition process is divided into two stages. The first determines the transformation in space that is necessary to bring the viewed object into alignment with possible object models. This stage can proceed on the basis of minimal information, such as the object's dominant orientation, or a small number of corresponding feature points in the object and model. The second stage determines the model that best matches the viewed object. At this stage, the search is over all the possible object models, but not over their possible views, since the transformation has already been determined uniquely in the alignment stage. The proposed alignment method also uses abstract description, but unlike structural description methods it uses them pictorially, rather than in symbolic structural descriptions.

1988

16.

FROM PIXELS TO PREDICATES - RECENT ADVANCES IN COMPUTATIONAL AND ROBOTIC VISION - PENTLAND,AP

Ullman S. (1988) Contemporary Psychology. 33, 1, p. 48-48 Abstract

1987

15.

Parallel and serial processes in motion detection

Dick M., Ullman S. & Sagi D. (1987) Science. 237, 4813, p. 400-402 Abstract

Apparent motion was used to explore humans' ability to perceive the direction of motion in the visual field. A marked qualitative difference in this ability was found between short- and long-range motion. For short-range motion, the detection of the direction of motion is characterized by parallel operation over a wide visual field (that is, detection performance is independent of the number of objects in an array). When the positional displacement is large relative to an object's size, the direction of motion is detected in a serial manner. The process of detection is limited in this case by the ability to detect other events, such as appearance and disappearance of an object, and the ability to compute their spatio-temporal relations. The results are consistent with a previously suggested division of the motion detection system into short- and long-range processes. The direction of short-range motion can be perceived in parallel (preattentively), whereas long-range motion is attentive and requires more complicated computations. It seems that the detection of long-range motion is a conjunction task, combining the detection of disappearance and appearance.

1986

14.

Are non-directional simple cells constructed from directional subunits?

Richter J. & Ullman S. (1986) Biological Cybernetics. 54, 4-5, p. 313-317 Abstract

Experiments by Schiller et al. have suggested that non-directional edge-specific simple cells are constructed from two directionally selective subunits with opposite preferred direction. This hierarchical notion was based on the fact that the responses of such units to edges moving in opposite directions are spatially displaced with respect to each other. An alternative explanation of the observed response separation is the delay between the responses of the center and surround mechanisms at the retinal level. Measurements of the response separation as a function of stimulus speed support this explanation and argues against the hierarchical notion of Schiller et al.

13.

Artificial intelligence and the neurosciences

Ullman S. (1986) Trends in Neurosciences. 9, C, p. 530-533 Abstract

12.

Artificial intelligence and the brain: Computational studies of the visual system

Ullman S. (1986) Annual Review of Neuroscience. VOL. 9, p. 1-26 Abstract

11.

Non-linearities in cortical simple cells and the possible detection of zero crossings

Richter J. & Ullman S. (1986) Biological Cybernetics. 53, 3, p. 195-202 Abstract

A theory of early visual information processing proposed by Marr and co-workers suggests that simple cortical cells may be involved in the detection of zero crossing in the retinal output. We have tested this theory by using pairs of adjacent edges (staircases stimuli) and recording from edge-specific simple cells in cat striate cortex. The zero crossing hypothesis gives rise for such stimuli to non-obvious predictions that were generally confirmed by the experiment.

1985

10.

Shifts in selective visual attention: Towards the underlying neural circuitry

Koch C. & Ullman S. (1985) Human Neurobiology. 4, 4, p. 219-227 Abstract

Psychophysical and physiological evidence indicated that the visual system of primates and humans has evolved a specialized processing focus moving across the visual scene. This study addresses the question of how simple networks of neuron-like elements can account for a variety of phenomena associated with this shift of selective visual attention. Specifically, we propose the following: (1) A number of elementary features, such as color, orientation, direction of movement, disparity etc. are represented in parallel in different topographical maps, called the early representation. (2) There exists a selective mapping from the early topographic representation into a more central non-topographic representation, such that at any instant the central representation contains the properties of only a single location in the visual scene, the selected location. We suggest that this mapping is the principal expression of early selective visual attention. One function of selective attention is to fuse information from different maps into one coherent whole. (3) Certain selection rules determine which locations will be mapped into the central representation. The major rule, using the conspicuity of locations in the early representation, is implemented using a so-called Winner-Take-All network. Inhibiting the selected location in this network causes an automatic shift towards the next most conspicious location. Additional rules are proximity and similarity preferences. We discuss how these rules can be implemented in neuron-like networks and suggest a possible role for the extensive back-projection from the visual cortex to the LGN.

1984

Rigidity and misperceived motion.

Ullman S. (1984) Perception. 13, 2, p. 219-220 Abstract

Maximizing rigidity: the incremental recovery of 3-D structure from rigid and nonrigid motion.

Ullman S. (1984) Perception. 13, 3, p. 255-274 Abstract

The human visual system can extract 3-D shape information of unfamiliar moving objects from their projected transformations. Computational studies of this capacity have established that 3-D shape can be extracted correctly from a brief presentation, provided the moving objects are rigid. The human visual system requires a longer temporal extension, but it can cope with considerable deviations from rigidity. It is shown how the 3-D structure of rigid as well as nonrigid objects can be recovered by maintaining an internal model of the viewed object and modifying it at each instant by the minimal nonrigid change that is sufficient to account for the observed transformation. The results of applying this incremental rigidity scheme to rigid and nonrigid objects in motion are described and compared with human perception.

1983

The measurement of visual motion. Computational considerations and some neurophysiological implications

Ullman S. (1983) Trends in Neurosciences. 6, C, p. 177-179 Abstract

Visual motion provides useful information about the surrounding environment, which biological visual systems have evolved to extract and utilize. The first problem in analysing visual motion is the measurement of motion; this has proved to be surprisingly difficult. The human visual system appears to solve it efficiently using a combination of at least two different methods. These methods are discussed, together with some unsolved problems and their possible implications for neurophysiological studies.

1982

Adaptation and gain normalization

Ullman S. & Schechtman G. (1982) Proceedings Of The Royal Society Series B-Biological Sciences. 216, 1204, p. 299-313 Abstract

1981

ANALYSIS OF VISUAL-MOTION BY BIOLOGICAL AND COMPUTER-SYSTEMS

Ullman S. (1981) Computer. 14, 8, p. 57-69 Abstract

1980

The effect of similarity between line segments on the correspondence strength in apparent motion.

Ullman S. (1980) Perception. 9, 6, p. 617-626 Abstract

The correspondence between line segments in apparent motion is shown to be affected by the similarity between them. Increase in orientation difference or in length ratio between lines in a competing motion configuration decreases the probability of perceived apparent motion between them. The results suggest the existence of a built-in preference metric that may reflect a measure of matching likelihood between elements in three-dimensional space.

1979

Bandpass channels, zero-crossings, and early visual information processing.

Marr D., Ullman S. & Poggio T. (1979) Journal of the Optical Society of America. 69, 6, p. 914-916 Abstract

Under appropriate conditions zero-crossings of a bandpass signal are very rich in information. The authors examine here the relevance of this result to the early stages of visual information processing, where zero-crossings in the output of independent spatial-frequency-tuned channels may contain sufficient information for much of the subsequent processing.

VISUAL DETECTION OF LIGHT SOURCES.

Ullman S. (1979) Energy Technology Review. 2, p. 81-100 Abstract

Light-source detection as a result of many cooperative computations in human vision is discussed. A method is presented for accomplishing light source detection in the so- called Mondrian world. The method involves a comparison of changes in intensity and changes in the derivative of intensity at region boundaries.

The interpretation of structure from motion.

Ullman S. (1979) Proceedings of the Royal Society B: Biological Sciences. 203, 1153, p. 405-426 Abstract

The interpretation of structure from motion is examined from a computional point of view. The question addressed is how the three dimensional structure and motion of objects can be inferred from the two dimensional transformations of their projected images when no three dimensional information is conveyed by the individual projections. The following scheme is proposed: (i) divide the image into groups of four elements each; (ii) test each group for a rigid interpretation; (iii) combine the results obtained in (ii). It is shown that this scheme will correctly decompose scenes containing arbitrary rigid objects in motion, recovering their three dimensional structure and motion. The analysis is based primarily on the "structure from motion" theorem which states that the structure of four non-coplanar points is recoverable from three orthographic projections. The interpretation scheme is extended to cover perspective projections, and its psychological relevance is discussed.