Publications
2023
In modeling vision, there has been a remarkable progress in recognizing a range of scene components, but the problem of analyzing full scenes, an ultimate goal of visual perception, is still largely open. To deal with complete scenes, recent work focused on the training of models for extracting the full graph-like structure of a scene. In contrast with scene graphs, humans' scene perception focuses on selected structures in the scene, starting with a limited interpretation and evolving sequentially in a goal-directed manner [G. L. Malcolm, I. I. A. Groen, C. I. Baker, Trends. Cogn. Sci. 20, 843-856 (2016)]. Guidance is crucial throughout scene interpretation since the extraction of full scene representation is often infeasible. Here, we present a model that performs human-like guided scene interpretation, using an iterative bottom-up, top-down processing, in a "counterstream"structure motivated by cortical circuitry. The process proceeds by the sequential application of top-down instructions that guide the interpretation process. The results show how scene structures of interest to the viewer are extracted by an automatically selected sequence of top-down instructions. The model shows two further benefits. One is an inherent capability to deal well with the problem of combinatorial generalization-generalizing broadly to unseen scene configurations, which is limited in current network models [B. Lake, M. Baroni, 35th International Conference on Machine Learning, ICML 2018 (2018)]. The second is the ability to combine visual with nonvisual information at each cycle of the interpretation process, which is a key aspect for modeling human perception as well as advancing AI vision systems.
Diabetic Retinopathy (DR) is a common complication of diabetes that, in severe cases, can result in blindness. Accurate clinical treatment is imperative to prevent these cases and relies considerably on an exact diagnosis of the various symptoms of DR. We aim to advance DR diagnosis by providing a practical tool to automatically classify Optical Coherence Tomography (OCT) scans for DR and to identify and localize DR-related morphological features within the scans. Our system obtains raw OCT input and only sparse clinical annotations at the volume level, which can be obtained automatically from routine electronic medical records.We developed a novel neural network architecture, OCT-Transformer, that obtains state-of-the-art classification results compared to previous models and does so with limited training data. We base our architecture on an attention mechanism and show this to be the driving factor for the boost in performance. We additionally use our model to locate pixels within the input scans that explain its classification.
Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision & Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model. Our code and pretrained models are available at: https://github.com/SivanDoveh/TSVLC
Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text, leading to numerous applications such as cross-modal retrieval, visual question answering, captioning, and more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called 'object bias' - their representations behave as 'bags of nouns', mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these 'compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning and pre-training the VL model: (i) the caption quality, or in other words 'image-alignment', of the texts; and (ii) the 'density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors leveraging a standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to ∼ 27% over the base model, up to ∼ 20% over the strongest baseline, and by 6.7% on average.
2022
Gaze understanding-a suggested precursor for understanding others' intentions-requires recovery of gaze direction from the observed person's head and eye position. This challenging computation is naturally acquired at infancy without explicit external guidance, but can it be learned later if vision is extremely poor throughout early childhood? We addressed this question by studying gaze following in Ethiopian patients with early bilateral congenital cataracts diagnosed and treated by us only at late childhood. This sight restoration provided a unique opportunity to directly address basic issues on the roles of \u201cnature\u201d and \u201cnurture\u201d in development, as it caused a selective perturbation to the natural process, eliminating some gaze-direction cues while leaving others still available. Following surgery, the patients' visual acuity typically improved substantially, allowing discrimination of pupil position in the eye. Yet, the patients failed to show eye gaze-following effects and fixated less than controls on the eyes-two spontaneous behaviors typically seen in controls. Our model for unsupervised learning of gaze direction explains how head-based gaze following can develop under severe image blur, resembling preoperative conditions. It also suggests why, despite acquiring sufficient resolution to extract eye position, automatic eye gaze following is not established after surgery due to lack of detailed early visual experience. We suggest that visual skills acquired in infancy in an unsupervised manner will be difficult or impossible to acquire when internal guidance is no longer available, even when sufficient image resolution for the task is restored. This creates fundamental barriers to spontaneous vision recovery following prolonged deprivation in early age.
2021
Natural vision is a dynamic and continuous process. Under natural conditions, visual object recognition typically involves continuous interactions between ocular motion and visual contrasts, resulting in dynamic retinal activations. In order to identify the dynamic variables that participate in this process and are relevant for image recognition, we used a set of images that are just above and below the human recognition threshold and whose recognition typically requires >2 s of viewing. We recorded eye movements of participants while attempting to recognize these images within trials lasting 3 s. We then assessed the activation dynamics of retinal ganglion cells resulting from ocular dynamics using a computational model. We found that while the saccadic rate was similar between recognized and unrecognized trials, the fixational ocular speed was significantly larger for unrecognized trials. Interestingly, however, retinal activation level was significantly lower during these unrecognized trials. We used retinal activation patterns and oculomotor parameters of each fixation to train a binary classifier, classifying recognized from unrecognized trials. Only retinal activation patterns could predict recognition, reaching 80% correct classifications on the fourth fixation (on average, ∼2.5 s from trial onset). We thus conclude that the information that is relevant for visual perception is embedded in the dynamic interactions between the oculomotor sequence and the image. Hence, our results suggest that ocular dynamics play an important role in recognition and that understanding the dynamics of retinal activation is crucial for understanding natural vision.
Humans recognize individual faces regardless of variation in the facial view. The view-tuned face neurons in the inferior temporal (IT) cortex are regarded as the neural substrate for view-invariant face recognition. This study approximated visual features encoded by these neurons as combinations of local orientations and colors, originated from natural image fragments. The resultant features reproduced the preference of these neurons to particular facial views. We also found that faces of one identity were separable from the faces of other identities in a space where each axis represented one of these features. These results suggested that view-invariant face representation was established by combining view sensitive visual features. The face representation with these features suggested that, with respect to view-invariant face representation, the seemingly complex and deeply layered ventral visual pathway can be approximated via a shallow network, comprised of layers of low-level processing for local orientations and colors (V1/V2-level) and the layers which detect particular sets of low-level elements derived from natural image fragments (IT-level).
Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume an existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing 'text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Our GbS shows an 8.5% accuracy improvement over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a complementary improvement (above 7%) over the detector-based approaches for WSG.
2020
Objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by identifying minimal videos: these are short and tiny video clips in which objects, parts, and actions can be reliably recognized, but any reduction in either space or time makes them unrecognizable. Human recognition in minimal videos is invariably accompanied by full interpretation of the internal components of the video. State-of-the-art deep convolutional networks for dynamic recognition cannot replicate human behavior in these configurations. The gap between human and machine vision demonstrated here is due to critical mechanisms for full spatiotemporal interpretation that are lacking in current computational models.
2019
Visual object recognition is performed effortlessly by humans notwithstanding the fact that it requires a series of complex computations, which are, as yet, not well understood. Here, we tested a novel account of the representations used for visual recognition and their neural correlates using fMRI. The rationale is based on previous research showing that a set of representations, termed "minimal recognizable configurations" (MIRCs), which are computationally derived and have unique psychophysical characteristics, serve as the building blocks of object recognition. We contrasted the BOLD responses elicited by MIRC images, derived from different categories (faces, objects, and places), sub-MIRCs, which are visually similar to MIRCs, but, instead, result in poor recognition and scrambled, unrecognizable images. Stimuli were presented in blocks, and participants indicated yes/no recognition for each image. We confirmed that MIRCs elicited higher recognition performance compared to sub-MIRCs for all three categories. Whereas fMRI activation in early visual cortex for both MIRCs and sub-MIRCs of each category did not differ from that elicited by scrambled images, high-level visual regions exhibited overall greater activation for MIRCs compared to sub-MIRCs or scrambled images. Moreover, MIRCs and sub-MIRCs from each category elicited enhanced activation in corresponding category-selective regions including fusiform face area and occipital face area (faces), lateral occipital cortex (objects), and parahippocampal place area and transverse occipital sulcus (places). These findings reveal the psychological and neural relevance of MIRCs and enable us to make progress in developing a more complete account of object recognition.
Patterns are broad phenomena that relate to biology, chemistry, and physics. The dendritic growth of crystals is the most well-known ice pattern formation process. Tyndall figures are water-melting patterns that occur when ice absorbs light and becomes superheated. Here, we report a previously undescribed ice and water pattern formation process induced by near-infrared irradiation that heats one phase more than the other in a two-phase system. The pattern formed during the irradiation of ice crystals tens of micrometers thick in solution near equilibrium. Dynamic holes and a microchannel labyrinth then formed in specific regions and were characterized by a typical distance between melted points. We concluded that the differential absorption of water and ice was the driving force for the pattern formation. Heating ice by laser absorption might be useful in applications such as the cryopreservation of biological samples.
2018
Rapid developments in the fields of learning and object recognition have been obtained by successfully developing and using methods for learning from a large number of labeled image examples. However, such current methods cannot explain infants' learning of new concepts based on their visual experience, in particular, the ability to learn complex concepts without external guidance, as well as the natural order in which related concepts are acquired. A remarkable example of early visual learning is the category of 'containers' and the notion of 'containment'. Surprisingly, this is one of the earliest spatial relations to be learned, starting already around 3 month of age, and preceding other common relations (e.g., 'support', 'in-between'). In this work we present a model, which explains infants' capacity of learning 'containment' and related concepts by 'just looking', together with their empirical development trajectory. Learning occurs in the model fast and without external guidance, relying only on perceptual processes that are present in the first months of life. Instead of labeled training examples, the system provides its own internal supervision to guide the learning process. We show how the detection of so-called 'paradoxical occlusion' provides natural internal supervision, which guides the system to gradually acquire a range of useful containment-related concepts. Similar mechanisms of using implicit internal supervision can have broad application in other cognitive domains as well as artificial intelligent systems, because they alleviate the need for supplying extensive external supervision, and because they can guide the learning process to extract concepts that are meaningful to the observer, even if they are not by themselves obvious, or salient in the input.
Despite a large body of research on response properties of neurons in the inferior temporal (IT) cortex, studies to date have not yet produced quantitative feature descriptions that can predict responses to arbitrary objects. This deficit in the research prevents a thorough understanding of object representation in the IT cortex. Here we propose a fragment-based approach for finding quantitative feature descriptions of face neurons in the IT cortex. The development of the proposed method was driven by the assumption that it is possible to recover features from a set of natural image fragments if the set is sufficiently large. To find the feature from the set, we compared object responses predicted from each fragment and responses of neurons to these objects, and search for the fragment that revealed the highest correlation with neural object responses. Prediction of object responses of each fragment was made by normalizing Euclidian distance between the fragment and each object to 0 to 1 such that the smaller distance gives the higher value. The distance was calculated at the space where images were transformed to a local orientation space by a Gabor filter and a local max operation. The method allowed us to find features with a correlation coefficient between predicted and neural responses of 0.68 on average (number of object stimuli, 104) from among 560,000 feature candidates, reliably explaining differential responses among faces as well as a general preference for faces over to non-face objects. Furthermore, predicted responses of the resulting features to novel object images were significantly correlated with neural responses to these images. Identification of features comprising specific, moderately complex combinations of local orientations and colors enabled us to predict responses to upright and inverted faces, which provided a possible mechanism of face inversion effects. (292/300).
Computational models of vision have advanced in recent years at a rapid rate, rivalling in some areas human-level performance. Much of the progress to date has focused on analysing the visual scene at the object level-the recognition and localization of objects in the scene. Human understanding of images reaches a richer and deeper image understanding both 'below' the object level, such as identifying and localizing object parts and sub-parts, as well as 'above' the object level, such as identifying object relations, and agents with their actions and interactions. In both cases, understanding depends on recovering meaningful structures in the image, and their components, properties and inter-relations, a process referred here as 'image interpretation'. In this paper, we describe recent directions, based on human and computer vision studies, towards human-like image interpretation, beyond the reach of current schemes, both below the object level, as well as some aspects of image interpretation at the level of meaningful configurations beyond the recognition of individual objects, and in particular, interactions between two people in close contact. In both cases the recognition process depends on the detailed interpretation of so-called 'minimal images', and at both levels recognition depends on combining 'bottom-up' processing, proceeding from low to higher levels of a processing hierarchy, together with 'top-down' processing, proceeding from high to lower levels stages of visual analysis.
The goal in this work is to model the process of 'full interpretation' of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low.We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of 'minimal configurations': these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition.
Classes in natural images tend to follow long tail distributions. This is problematic when there are insufficient training examples for rare classes. This effect is emphasized in compound classes, involving the conjunction of several concepts, such as those appearing in action-recognition datasets. In this paper, we propose to address this issue by learning how to utilize common visual concepts which are readily available. We detect the presence of prominent concepts in images and use them to infer the target labels instead of using visual features directly, combining tools from vision and natural-language processing. We validate our method on the recently introduced HICO dataset reaching a mAP of 31.54% and on the Stanford-40 Actions dataset, where the proposed method outperforms that obtained by direct visual features, obtaining an accuracy 83.12%. Moreover, the method provides for each class a semantically meaningful list of keywords and relevant image regions relating it to its constituent concepts.
2017
Convolutional neural networks have been shown to develop internal representations, which correspond closely to semantically meaningful objects and parts, although trained solely on class labels. Class Activation Mapping (CAM) is a recent method that makes it possible to easily highlight the image regions contributing to a network's classification decision. We build upon these two developments to enable a network to re-examine informative image regions, which we term introspection. We propose a weakly-supervised iterative scheme, which shifts its center of attention to increasingly discriminative regions as it progresses, by alternating stages of classification and introspection. We evaluate our method and show its effectiveness over a range of several datasets, where we obtain competitive or state-of-the-art results: on Stanford-40 Actions, we set a new state-of the art of 81.74%. On FGVC-Aircraft and the Stanford Dogs dataset, we show consistent improvements over baselines, some of which include significantly more supervision.
2016
Action recognition in still images has seen major improvement in recent years due to advances in human pose estimation, object recognition and stronger feature representations produced by deep neural networks. However, there are still many cases in which performance remains far from that of humans. A major difficulty arises in distinguishing between transitive actions in which the overall actor pose is similar, and recognition therefore depends on details of the grasp and the object, which may be largely occluded. In this paper we demonstrate how recognition is improved by obtaining precise localization of the action-object and consequently extracting details of the object shape together with the actor-object interaction. To obtain exact localization of the action object and its interaction with the actor, we employ a coarse-to-fine approach which combines semantic segmentation and contextual features, in successive stages. We focus on (but are not limited) to face-related actions, a set of actions that includes several currently challenging categories. We present an average relative improvement of 35% over state-of-the art and validate through experimentation the effectiveness of our approach.
Discovering the visual features and representations used by the brain to recognize objects is a central problem in the study of vision. Recently, neural network models of visual object recognition, including biological and deep network models, have shown remarkable progress and have begun to rival human performance in some challenging tasks. These models are trained on image examples and learn to extract features and representations and to use them for categorization. It remains unclear, however, whether the representations and learning processes discovered by current models are similar to those used by the human visual system. Here we show, by introducing and using minimal recognizable images, that the human visual system uses features and processes that are not used by current models and that are critical for recognition. We found by psychophysical studies that at the level of minimal recognizable images a minute change in the image can have a drastic effect on recognition, thus identifying features that are critical for the task. Simulations then showed that current models cannot explain this sensitivity to precise feature configurations and, more generally, do not learn to recognize minimal images at a human level. The role of the features shown here is revealed uniquely at the minimal level, where the contribution of each feature is essential. A full understanding of the learning and use of such features will extend our understanding of visual recognition and its cortical mechanisms and will enhance the capacity of computational models to learn from visual experience and to deal with recognition and detailed image interpretation.
2015
Prominent theories of action recognition suggest that during the recognition of actions the physical patterns of the action is associated with only one action interpretation (e.g., a person waving his arm is recognized as waving). In contrast to this view, studies examining the visual categorization of objects show that objects are recognized in multiple ways (e.g., a VW Beetle can be recognized as a car or a beetle) and that categorization performance is based on the visual and motor movement similarity between objects. Here, we studied whether we find evidence for multiple levels of categorization for social interactions (physical interactions with another person, e.g., handshakes). To do so, we compared visual categorization of objects and social interactions (Experiments 1 and 2) in a grouping task and assessed the usefulness of motor and visual cues (Experiments 3, 4, and 5) for object and social interaction categorization. Additionally, we measured recognition performance associated with recognizing objects and social interactions at different categorization levels (Experiment 6). We found that basic level object categories were associated with a clear recognition advantage compared to subordinate recognition but basic level social interaction categories provided only a little recognition advantage. Moreover, basic level object categories were more strongly associated with similar visual and motor cues than basic level social interaction categories. The results suggest that cognitive categories underlying the recognition of objects and social interactions are associated with different performances. These results are in line with the idea that the same action can be associated with several action interpretations (e.g., a person waving his arm can be recognized as waving or greeting).
2014
The laminar location of the cell bodies and terminals of interareal connections determines the hierarchical structural organization of the cortex and has been intensively studied. However, we still have only a rudimentary understanding of the connectional principles of feedforward (FF) and feedback (FB) pathways. Quantitative analysis of retrograde tracers was used to extend the notion that the laminar distribution of neurons interconnecting visual areas provides an index of hierarchical distance (percentage of supragranular labeled neurons [SLN]). We show that: 1) SLN values constrain models of cortical hierarchy, revealing previously unsuspected areal relations; 2) SLN reflects the operation of a combinatorial distance rule acting differentially on sets of connections between areas; 3) Supragranular layers contain highly segregated bottom-up and top-down streams, both of which exhibit point-to-point connectivity. This contrasts with the infragranular layers, which contain diffuse bottom-up and top-down streams; 4) Cell filling of the parent neurons of FF and FB pathways provides further evidence of compartmentalization; 5) FF pathways have higher weights, cross fewer hierarchical levels, and are less numerous than FB pathways. Taken together, the present results suggest that cortical hierarchies are built from supra- and infragranular counterstreams. This compartmentalized dual counterstream organization allows point-to-point connectivity in both bottom-up and top-down directions. J. Comp. Neurol. 522:225-259, 2014.
2013
Object recognition has been a central yet elusive goal of computational vision. For many years, computer performance seemed highly deficient and unable to emulate the basic capabilities of the human recognition system. Over the past decade or so, computer scientists and neuroscientists have developed algorithms and systems-and models of visual cortex-that have come much closer to human performance in visual identification and categorization. In this personal perspective, we discuss the ongoing struggle of visual models to catch up with the visual cortex, identify key reasons for the relatively rapid improvement of artificial systems and models, and identify open problems for computational vision in this domain.
*Winner of the 2013 Marr Prize
2012
Early in development, infants learn to solve visual problems that are highly challenging for current computational methods. We present a model that deals with two fundamental problems in which the gap between computational difficulty and infant learning is particularly striking: learning to recognize hands and learning to recognize gaze direction. The model is shown a stream of natural videos and learns without any supervision to detect human hands by appearance and by context, as well as direction of gaze, in complex natural scenes. The algorithm is guided by an empirically motivated innate mechanism - the detection of "mover" events in dynamic images, which are the events of a moving image region causing a stationary region to move or change after contact. Mover events provide an internal teaching signal, which is shown to be more effective than alternative cues and sufficient for the efficient acquisition of hand and gaze representations. The implications go beyond the specific tasks, by showing how domain-specific "proto concepts" can guide the system to acquire meaningful concepts, which are significant to the observer but statistically inconspicuous in the sensory input.
We present an approach to the detection of parts of highly deformable objects, such as the human body. Instead of using kinematic constraints on relative angles used by most existing approaches for modeling part-to-part relations, we learn and use special observed 'linking' features that support particular pairwise part configurations. In addition to modeling the appearance of individual parts, the current approach adds modeling of the appearance of part-linking, which is shown to provide useful information. For example, configurations of the lower and upper arms are supported by observing corresponding appearances of the elbow or other relevant features. The proposed model combines the support from all the linking features observed in a test image to infer the most likely joint configuration of all the parts of interest. The approach is trained using images with annotated parts, but no a-priori known part connections or connection parameters are assumed, and the linking features are discovered automatically during training. We evaluate the performance of the proposed approach on two challenging human body parts detection datasets, and obtain performance comparable, and in some cases superior, to the state-of-the-art. In addition, the approach generality is shown by applying it without modification to part detection on datasets of animal parts and of facial fiducial points.
2011
Visual expertise is usually defined as the superior ability to distinguish between exemplars of a homogeneous category. Here, we ask how real-world expertise manifests at basic-level categorization and assess the contribution of stimulus-driven and top-down knowledge-based factors to this manifestation. Car experts and novices categorized computer-selected image fragments of cars, airplanes, and faces. Within each category, the fragments varied in their mutual information (MI), an objective quantifiable measure of feature diagnosticity. Categorization of face and airplane fragments was similar within and between groups, showing better performance with increasing MI levels. Novices categorized car fragments more slowly than face and airplane fragments, while experts categorized car fragments as fast as face and airplane fragments. The experts' advantage with car fragments was similar across MI levels, with similar functions relating RT with MI level for both groups. Accuracy was equal between groups for cars as well as faces and airplanes, but experts' response criteria were biased toward cars. These findings suggest that expertise does not entail only specific perceptual strategies. Rather, at the basic level, expertise manifests as a general processing advantage arguably involving application of top-down mechanisms, such as knowledge and attention, which helps experts to distinguish between object categories.
2010
Existing classification algorithms use a set of training examples to select classification features, which are then used for all future applications of the classifier. A major problem with this approach is the selection of a training set: a small set will result in reduced performance, and a large set will require extensive training. In addition, class appearance may change over time requiring an adaptive classification system. In this paper, we propose a solution to these basic problems by developing an on-line feature selection method, which continuously modifies and improves the features used for classification based on the examples provided so far. The method is used for learning a new class, and to continuously improve classification performance as new data becomes available. In ongoing learning, examples are continuously presented to the system, and new features arise from these examples. The method continuously measures the value of the selected features using mutual information, and uses these values to efficiently update the set of selected features when new training information becomes available. The problem is challenging because at each stage the training process uses a small subset of the training data. Surprisingly, with sufficient training data the on-line process reaches the same performance as a scheme that has a complete access to the entire training data.
This paper presents an approach to the visual recognition of human actions using only single images as input. The task is easy for humans but difficult for current approaches to object recognition, because instances of different actions may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized. The proposed approach applies a two-stage interpretation procedure to each training and test image. The first stage produces accurate detection of the relevant body parts of the actor, forming a prior for the local evidence needed to be considered for identifying the action. The second stage extracts features that are anchored to the detected body parts, and uses these features and their feature-to-part relations in order to recognize the action. The body anchored priors we propose apply to a large range of human actions. These priors allow focusing on the relevant regions and relations, thereby significantly simplifying the learning process and increasing recognition performance.
Detecting an object part relies on two sources of information - the appearance of the part itself, and the context supplied by surrounding parts. In this paper we consider problems in which a target part cannot be recognized reliably using its own appearance, such as detecting low-resolution hands, and must be recognized using the context of surrounding parts. We develop the 'chains model' which can locate parts of interest in a robust and precise manner, even when the surrounding context is highly variable and deformable. In the proposed model, the relation between context features and the target part is modeled in a non-parametric manner using an ensemble of feature chains leading from parts in the context to the detection target. The method uses the configuration of the features in the image directly rather than through fitting an articulated 3-D model of the object. In addition, the chains are composable, meaning that new chains observed in the test image can be composed of sub-chains seen during training. Consequently, the model is capable of handling object poses which are infrequent, even non-existent, during training. We test the approach in different settings, including object parts detection, as well as complete object detection. The results show the advantages of the chains model for detecting and localizing parts of complex deformable objects.
2009
In this letter, we develop and simulate a large-scale network of spiking neurons that approximates the inference computations performed by graphical models. Unlike previous related schemes, which used sum and product operations in either the log or linear domains, the current model uses an inference scheme based on the sum and maximization operations in the log domain. Simulations show that using these operations, a large-scale circuit, which combines populations of spiking neurons as basic building blocks, is capable of finding close approximations to the full mathematical computations performed by graphical models within a few hundred milliseconds. The circuit is general in the sense that it can be wired for any graph structure, it supports multistate variables, and it uses standard leaky integrate-and-fire neuronal units. Following previous work, which proposed relations between graphical models and the large-scale cortical anatomy, we focus on the cortical microcircuitry and propose how anatomical and physiological aspects of the local circuitry may map onto elements of the graphical model implementation. We discuss in particular the roles of three major types of inhibitory neurons (small fast-spiking basket cells, large layer 2/3 basket cells, and double bouquet neurons), subpopulations of strongly interconnected neurons with their unique connectivity patterns in different cortical layers, and the possible role of minicolumns in the realization of the population based maximum operation.
In this paper we introduce the concept and method for adaptively tuning the model complexity in an online manner as more examples become available. Challenging classification problems in the visual domain (such as recognizing handwriting, faces and human-body images) often require a large number of training examples, which may become available over a long training period. This motivates the development of scalable and adaptive systems which are able to continue learning at any stage and which can efficiently learn from large amounts of data, in an on-line manner. Previous approaches to on-line learning in visual classification have used a fixed parametric model, and focused on continuously improving the model parameters as more data becomes available. Here we propose a new framework which enables online learning algorithms to adjust the complexity of the learned model to the amount of the training data as more examples become available. Since in online learning the training set expands over time, it is natural to allow the learned model to become more complex during the course of learning instead of confining the model to a fixed family of a bounded complexity. Formally, we use a set of parametric classifiers y = ha? (x) where y is the class and x the observed data. The parameter a controls the complexity of the model family. For a fixed a, the training examples are used for the optimal setting of ?. When the amount of data becomes sufficiently large, the value of a is increased, and a more complex model family is used. For evaluation of the proposed approach, we implement an online Support Vector Machine with increasing complexity, and evaluate in a task of handwritten character recognition on the MNIST database.
We present a novel approach for measuring image similarity based on the composition of parts. The measure identifies common sub-regions between the images at multiple sizes, and evaluates the amount of deformation required to align the common regions. The scheme allows complex, non-rigid deformation of the images, and penalizes irregular deformations more than coherent shifts of larger sub-parts. The measure is implemented by an algorithm which is a variant of dynamic programming, extended to multi-dimensions, and is using scores measured on a relative scale. The similarity measure is shown to be robust to non-rigid deformations of parts at various positions and scales, and to capture basic characteristics of human similarity judgments.
Class learning, both supervised and unsupervised, requires feature selection, which includes two main components. The first is the selection of a discriminative subset of features from a larger pool. The second is the selection of detection parameters for each feature to optimize classification performance. In this paper we present a method for the discovery of multiple classification features, their detection parameters and their consistent configurations, in the Ally unsupervised setting. This is achieved by a global optimization of joint consistency between the features as a function of the detection parameters, without assuming any prior parametric model. We demonstrate how the proposed framework can be applied for learning different types of feature parameters, such as detection thresholds and geometric relations, resulting in the unsupervised discovery of informative configurations of objects parts. We test our approach on a wide range of classes and show good results. We also demonstrate how the approach can be used to unsupervisedly separate and learn visually similar subclasses of a single category such as facial views or hand poses. We use the approach to compare various criteria for feature consistency, including Mutual Information, Suspicious Coincidence, L2 and Jaccard index. Finally, we compare our approach to a parametric consistency optimization technique such as pLSA and show significantly better performance.
2008
We consider the problem of extracting features for multi-class recognition problems. The features are required to make fine distinctions between similar classes, combined with tolerance for distortions and missing information. We define and compare two general approaches, both based on maximizing the delivered information for recognition: one divides the problem into multiple binary classification tasks, while the other uses a single multi-class scheme. The two strategies result in markedly different sets of features, which we apply to face identification and detection. We show that the first produces a sparse set of distinctive features that are specific to an individual face, and are highly tolerant to distortions and missing input. The second produces compact features, each shared by about half of the faces, which perform better in general face detection. The results show the advantage of distinctive features for making fine distinctions in a robust manner. They also show that different features are optimal for recognition tasks at different levels of specificity.
We develop a novel method for class-based feature matching across large changes in viewing conditions. The method is based on the property that when objects share a similar part, the similarity is preserved across viewing conditions. Given a feature and a training set of object images, we first identify the subset of objects that share this feature. The transformation of the feature's appearance across viewing conditions is determined mainly by properties of the feature, rather than of the object in which it is embedded. Therefore, the transformed feature will be shared by approximately the same set of objects. Based on this consistency requirement, corresponding features can be reliably identified from a set of candidate matches. Unlike previous approaches, the proposed scheme compares feature appearances only in similar viewing conditions, rather than across different viewing conditions. As a result, the scheme is not restricted to locally planar objects or affine transformations. The approach also does not require examples of correct matches. We show that by using the proposed method, a dense set of accurate correspondences can be obtained. Experimental comparisons demonstrate that matching accuracy is significantly improved over previous schemes. Finally, we show that the scheme can be successfully used for invariant object recognition.
Recently, we proposed a fundamental subdivision of the human cortex into two complementary networks-an "extrinsic" one which deals with the external environment, and an "intrinsic" one which largely overlaps with the "default mode" system, and deals with internally oriented and endogenous mental processes. Here we tested this hypothesis by contrasting decision making under external and internally-derived conditions. Subjects were presented with an external cue, and were required to either follow an external instruction ("determined" condition) or to ignore it and follow a voluntary decision process ("free-will" condition). Our results show that a well defined component of the intrinsic system-the right inferior parietal cortex-was preferentially activated during the "free-will" condition. Importantly, this activity was significantly higher than the base-line resting state. The results support a self-related role for the intrinsic system and provide clear evidence for both hemispheric and regional specialization in the human intrinsic system.
The human visual system recognizes objects and their constituent parts rapidly and with high accuracy. Standard models of recognition by the visual cortex use feed-forward processing, in which an object's parts are detected before the complete object. However, parts are often ambiguous on their own and require the prior detection and localization of the entire object. We show how a cortical-like hierarchy obtains recognition and localization of objects and parts at multiple levels nearly simultaneously by a single feed-forward sweep from low to high levels of the hierarchy, followed by a feedback sweep from high- to low-level areas.
Object-related areas in the ventral visual system in humans are known from imaging studies to be preferentially activated by object images compared with noise or texture patterns. It is unknown, however, which features of the object images are extracted and represented in these areas. Here we tested the extent to which the representation of visual classes used object fragments selected by maximizing the information delivered about the class. We tested functional magnetic resonance imaging blood oxygenation level-dependent activation of highly informative object features in low- and high-level visual areas, compared with noninformative object features matched for low-level image properties. Activation in V1 was similar, but in the lateral occipital area and in the posterior fusiform gyrus, activation by "informative" fragments was significantly higher for three object classes. Behavioral studies also revealed high correlation between performance and fragments information. The results show that an objective class-information measure can predict classification performance and activation in human object-related areas.
The ablation of afferent input results in the reorganization of sensory and motor cortices. In the primary visual cortex (V1), binocular retinal lesions deprive a corresponding cortical region [lesion projection zone (LPZ)] of visual input. Nevertheless, neurons in the LPZ regain responsiveness by shifting their receptive fields (RFs) outside the retinal lesions; this re-emergence of neural activity is paralleled by the perceptual completion of disrupted visual input in human subjects with retinal damage. To determine whether V1 reorganization can account for perceptual fill-in, we developed a neural network model that simulates the cortical remapping in V1. The model shows that RF shifts mediated by the plexus of spatial- and orientation-dependent horizontal connections in V1 can engender filling-in that is both robust and consistent with psychophysical reports of perceptual completion. Our model suggests that V1 reorganization may underlie perceptual fill-in, and it predicts spatial relationships between the original and remapped RFs that can be tested experimentally. More generally, it provides a general explanation for adaptive functional changes following CNS lesions, based on the recruitment of existing cortical connections that are involved in normal integrative mechanisms.
Current object recognition systems aim at recognizing numerous object classes under limited supervision conditions. This paper provides a benchmark for evaluating progress on this fundamental task. Several methods have recently proposed to utilize the commonalities between object classes in order to improve generalization accuracy. Such methods can be termed interclass transfer techniques. However, it is currently difficult to asses which of the proposed methods maximally utilizes the shared structure of related classes. In order to facilitate the development, as well as the assessment of methods for dealing with multiple related classes, a new dataset including images of several hundred mammal classes, is provided, together with preliminary results of its use. The images in this dataset are organized into five levels of variability, and their labels include information on the objects' identity, location and pose. From this dataset, a classification benchmark has been derived, requiring fine distinctions between 72 mammal classes. It is then demonstrated that a recognition method which is highly successful on the Caltech101, attains limited accuracy on the current benchmark (36.5%). Since this method does not utilize the shared structure between classes, the question remains as to whether interclass transfer methods can increase the accuracy to the level of human performance (90%). We suggest that a labeled benchmark of the type provided, containing a large number of related classes is crucial for the development and evaluation of classification methods which make efficient use of interclass transfer.
We present a novel method for unsupervised classification., including the discovery of a new category and precise object and part localization. Given a set of unlabelled images, some of which contain an object of an unknown category, with unknown location and unknown size relative to the background, the method automatically identifies the images that contain the objects, localizes them and their parts, and reliably learns their appearance and geometry for subsequent classification. Current unsupervised methods construct classifiers based on a fixed set of initial features. Instead, we propose a new approach which iteratively extracts new features and re-learns the induced classifier, improving class vs. non-class separation at each iteration. We develop two main tools that allow this iterative combined search. The first is a novel star-like model capable of learning a geometric class representation in the unsupervised setting. The second is learning of "part specific features" that are optimized for parts detection, and which optimally combine different part appearances discovered in the training examples. These novel aspects lead to precise part localization and to improvement in overall classification performance compared with previous methods. We applied our method to multiple object classes from Caltech-101, UIUC and a sub-classification problem from PASCAL. The obtained results are comparable to state-of-the-art supervised classification techniques and superior to state-of-the-art unsupervised approaches previously applied to the same image sets.
We construct a segmentation scheme that combines top-down with bottom-up processing. In the proposed scheme, segmentation and recognition are intertwined rather than proceeding in a serial manner. The top-down part applies stored knowledge about object shapes acquired through learning, whereas the bottom-up part creates a hierarchy of segmented regions based on uniformity criteria. Beginning with unsegmented training examples of class and non-class images, the algorithm constructs a bank of class-specific fragments and determines their figure-ground segmentation. This bank is then used to segment novel images in a top-down manner: the fragments are first used to recognize images containing class objects, and then to create a complete cover that best approximates these objects. The resulting segmentation is then integrated with bottom-up multi-scale grouping to better delineate the object boundaries. Our experiments, applied to a large set of four classes (horses, pedestrians, cars, faces), demonstrate segmentation results that surpass those achieved by previous top-down or bottom-up schemes. The main novel aspects of this work are the fragment learning phase, which efficiently learns the figure-ground labeling of segmentation fragments, even in training sets with high object and background variability; combining the top-down segmentation with bottom-up criteria to draw on their relative merits; and the use of segmentation to improve recognition.
2007
Computational models suggest that features of intermediate complexity (IC) play a central role in object categorization [Ullman, S., Vidal-Naquet, M., & Sali, E. (2002). Visual features of intermediate complexity and their use in classification. Nature Neuroscience, 5, 682-687.]. The critical aspect of these features is the amount of mutual information (MI) they deliver. We examined the relation between MI, human categorization and an electrophysiological response to IC features. Categorization performance correlated with MI level as well as with the amplitude of a posterior temporal potential, peaking around 270 ms. Hence, an objective MI measure predicts human object categorization performance and its underlying neural activity. These results demonstrate that informative IC features serve as categorization features in human vision.
How do we learn to recognize visual categories, such as dogs and cats? Somehow, the brain uses limited variable examples to extract the essential characteristics of new visual categories. Here, I describe an approach to category learning and recognition that is based on recent computational advances. In this approach, objects are represented by a hierarchy of fragments that are extracted during learning from observed examples. The fragments are class-specific features and are selected to deliver a high amount of information for categorization. The same fragments hierarchy is then used for general categorization, individual object recognition and object-parts identification. Recognition is also combined with object segmentation, using stored fragments, to provide a top-down process that delineates object boundaries in complex cluttered scenes. The approach is computationally effective and provides a possible framework for categorization, recognition and segmentation in human vision.
This paper describes the construction and use of a novel representation for the recognition of objects and their parts, the semantic hierarchy. Its advantages include improved classification performance, accurate detection and localization of object parts and sub-parts, and explicitly identifying the different appearances of each object part. The semantic hierarchy algorithm starts by constructing a minimal feature hierarchy and proceeds by adding semantically equivalent representatives to each node, using the entire hierarchy as a context for determining the identity and locations of added features. Part detection is obtained by a bottom-up top-down cycle. Unlike previous approaches, the semantic hierarchy learns to represent the set of possible appearances of object parts at all levels, and their statistical dependencies. The algorithm is fully automatic and is shown experimentally to substantially improve the recognition of objects and their parts.
2006
The chapter describes visual classification by a hierarchy of semantic fragments. In fragment-based classification, objects within a class are represented by common sub-structures selected during training. The chapter describes two extensions to the basic fragment-based scheme. The first extension is the extraction and use of feature hierarchies. We describe a method that automatically constructs complete feature hierarchies from image examples, and show that features constructed hierarchically are significantly more informative and better for classification compared with similar non-hierarchical features. The second extension is the use of so-called semantic fragments to represent object parts. The goal of a semantic fragment is to represent the different possible appearances of a given object part. The visual appearance of such object parts cart differ substantially, and therefore traditional image similarity-based methods are inappropriate for the task. We show how the method can automatically learn the part structure of a new domain, identify the main parts, and how their appearance changes across objects in the class. We discuss the implications of these extensions to object classification and recognition.
We describe a general framework for online multiclass learning based on the notion of hypothesis sharing. In our framework sets of classes are associated with hypotheses. Thus, all classes within a given set share the same hypothesis. This framework includes as special cases commonly used constructions for multiclass categorization such as allocating a unique hypothesis for each class and allocating a single common hypothesis for all classes. We generalize the multiclass Perceptron to our framework and derive a unifying mistake bound analysis. Our construction naturally extends to settings where the number of classes is not known in advance but, rather, is revealed along the online learning process. We demonstrate the merits of our approach by comparing it to previous methods on both synthetic and natural datasets.
2005
Cortical maps and feedback connections are ubiquitous features of the visual cerebral cortex. The role of the feedback connections, however, is unclear. This study was aimed at revealing possible organizational relationships between the feedback projections from area V2 and the functional maps of orientation and retinotopy in area V1. Optical imaging of intrinsic signals was combined with cytochrome oxidase histochemistry and connectional anatomy in owl monkeys. Tracer injections were administered at orientation-selective domains in regions of pale and thick cytochrome oxidase stripes adjacent to the border between these stripes. The feedback projections from V2 were found to be more diffuse than the intrinsic horizontal connections within V1, but they nevertheless demonstrated clustering. The clusters of feedback axons projected preferentially to interblob cytochrome oxidase regions. The distribution of preferred orientations of the recipient domains in V1 was broad but appeared biased toward values similar to the preferred orientation of the projecting cells in V2. The global spatial distribution of the feedback projections in V1 was anisotropic. The major axis of anisotropy was systematically parallel to a retinotopic axis in V1 corresponding to the preferred orientation of the cells of origin in V2. We conclude that the feedback connections from V2 to V1 might play a role in enhancing the response in V1 to collinear contour elements.
We develop an object classification method that can learn a novel class from a single training example. In this method, experience with already learned classes is used to facilitate the learning of novel classes. Our classification scheme employs features that discriminate between class and non-class images. For a novel class, new features are derived by selecting features that proved useful for already learned classification tasks, and adapting these features to the new classification task. This adaptation is performed by replacing the features from already learned classes with similar features taken from the novel class. A single example of a novel class is sufficient to perform feature adaptation and achieve useful classification performance. Experiments demonstrate that the proposed algorithm can learn a novel class from a single training example, using 10 additional familiar classes. The performance is significantly improved compared to using no feature adaptation. The robustness of the proposed feature adaptation concept is demonstrated by similar performance gains across 107 widely varying object categories.
We describe a novel technique for identifying semantically equivalent parts in images belonging to the same object class, (e.g. eyes, license plates, aircraft wings etc.). The visual appearance of such object parts can differ substantially, and therefore traditional image similarity-based methods are inappropriate for this task. The technique we propose is based on the use of common context. We first retrieve context ftagments, which consistently appear together with a given input fragment in a stable geometric relation. We then use the context fragments in new images to infer the most likely position of equivalent parts. Given a set of image examples of objects in a class, the method can automatically learn the part structure of the domain - identify the main parts, and how their appearance changes across objects in the class. Two applications of the proposed algorithm are shown: the defection and identification of object parts and object recognition.
The paper describes a method for automatically extracting informative feature hierarchies for object classification, and shows the advantage of the features constructed hierarchically over previous methods. The extraction process proceeds in a top-down manner informative top-level fragments are extracted first, and by a repeated application of the same feature extraction process the classification fragments are broken down successively into their own optimal components. The hierarchical decomposition terminates with atomic features that cannot be usefully decomposed into simpler features. The entire hierarchy, the different features and sub-features, and their optimal parameters, are learned during a training phase using training examples. Experimental comparisons show that these feature hierarchies are significantly more informative and better for classification compared with similar non-hierarchical features as well as previous methods for using feature hierarchies.
2004
BACKGROUND AND OBJECTIVE: To examine a new high-resolution kinetic mapping method for scotoma in age-related macular degeneration. PATIENTS AND METHODS: A computer-based program for kinetic visual field mapping was tested in 10 healthy subjects and 14 patients with age-related macular degeneration and fixed preferred retinal locus. The stimulus was presented using a back projector on a screen located 40 cm from the subject. The findings were then compared with static results. RESULTS: Control group mapping revealed good congruency with the anatomic blind spot. Mapping of the 14 patients with age-related macular degeneration was rapid and revealed good accuracy. The average deviation of the mapping border from the anatomic scotoma border was no more than 3.1% of the scotoma radius. Static mapping of 7 of the patients with age-related macular degeneration was longer and revealed lower accuracy. CONCLUSIONS: The proposed method is more rapid, accurate, and consistent than static mapping. It allows accurate mapping of central scotoma with suprathreshold stimulus, and may be used in the future for detecting the early stages of age-related macular degeneration using subthreshold stimulus.
In performing recognition, the visual system shows a remarkable capacity to distinguish between significant and immaterial image changes, to learn from examples to recognize new classes of objects, and to generalize from known to novel objects. Here we focus on one aspect of this problem, the ability to recognize novel objects from different viewing directions. This problem of view-invariant recognition is difficult because the image of an object seen from a novel viewing direction can be substantially different from all previously seen images of the same object. We describe an approach to view-invariant recognition that uses extended features to generalize across changes in viewing directions. Extended features are equivalence classes of informative image fragments, which represent object parts under different viewing conditions. This representation is extracted during learning from images of moving objects, and it allows the visual system to generalize from a single view of a novel object, and to compensate for large changes in the viewing direction, without using three-dimensional information. We describe the model, its implementation and performance on natural face images, compare it to alternative approaches, discuss its biological plausibility, and its extension to other aspects of visual recognition. The results of the study suggest that the capacity of the recognition system to generalize to novel conditions in an efficient and flexible manner depends on the ongoing extraction of different families of informative features, acquired for different tasks and different object classes.
We describe a new approach for learning to perform classbased segmentation using only unsegmented training examples. As in previous methods, we first use training images to extract fragments that contain common object parts. We then show how these parts can be segmented into their figure and ground regions in an automatic learning process. This is in contrast with previous approaches, which required complete manual segmentation of the objects in the training examples. The figure-ground learning combines top-down and bottom-up processes and proceeds in two stages, an initial approximation followed by iterative refinement. The initial approximation produces figure-ground labeling of individual image fragments using the unsegmented training images. It is based on the fact that on average, points inside the object are covered by more fragments than points outside it. The initial labeling is then improved by an iterative refinement process, which converges in up to three steps. At each step, the figure-ground labeling of individual fragments produces a segmentation of complete objects in the training images, which in turn induce a refined figure-ground labeling of the individual fragments. In this manner, we obtain a scheme that starts from unsegmented training images, learns the figure-ground labeling of image fragments, and then uses this labeling to segment novel images. Our experiments demonstrate that the learned segmentation achieves the same level of accuracy as methods using manual segmentation of training images, producing an automatic and robust top-down segmentation.
We develop a novel approach to view-invariant recognition and apply it to the task of recognizing face images under widely separated viewing: directions. Our main contribution is a novel object representation scheme using 'extended fragments' that enables us to achieve a high level of recognition performance and generalization across a wide range of viewing conditions. Extended fragments are equivalence classes of image fragments that represent informative object parts under different viewing conditions. They are extracted automatically from short video sequences during learning. Using this representation, the scheme is unique in its ability to generalize from a single view of a novel object and compensate for a significant change in viewing direction without using 3D information. As a result, novel objects can be recognized from viewing directions from which they were not seen in the past. Experiments demonstrate that the scheme achieves significantly better generalization and recognition performance than previously used methods.
2003
In this study we examined the perception of one- and two-dimensional patterns across central retinal scotomas, caused by age-related macular degeneration. In contrast with previous studies of disrupted visual input that used the blind spot and artificial scotomas, the current study used large central scotomas caused by physical retinal damage. Such damage is associated with atrophy and long-term cortical reorganization, and it was therefore unclear whether perceptual completion in the damaged system will be similar to that reported for artificial scotomas and the blind spot. In addition, the scotomas under study were much larger and more central than artificial scotomas for which perceptual completion has been reported. For 1-D line and grating patterns, we found perceptual completion across large central scotomas (up to radius of 7°), which is significantly beyond the range of perceptual completion in artificial scotomas. Gratings completion was better than that of a single line, and increased with bars density. The use of central scotomas allowed us to test the completion of 2-D patterns that are difficult to study in peripheral vision. We found completion of two-dimensional dot arrays over large regions that improved with pattern density and regularity. The results show that in the physically damaged system the range of perceptual completion is increased compared with artificial scotomas, they strongly support the view of an active filling-in process rather than simply ignoring the damaged location, and they show that perceptual completion of physical scotomas is likely to involve cortical processing at multiple levels. We finally discuss implications of the results to the possible use of image enhancement techniques to facilitate the perception of low-vision individuals.
In this paper we show that efpoundcient object recognition can be obtained by combining informative features with linear classipoundcation. The results demonstrate the superiority of informative class-specipoundc features, as compared with generic type features such as wavelets, for the task of object recognition. We show that information rich features can reach optimal performance with simple linear separation rules, while generic, feature based classipounders require more complex-classipoundcation schemes. This is signipoundcant because efpoundcient and optimal methods have been developed for spaces that allow linear separation. To compare different strategies for feature extraction, we trained and compared classipounders working in feature spaces of the same low dimensionality, using two feature types (image fragments vs. wavelets) and two classipoundcation rules (linear hyperplane and a Bayesian Network). The results show that by maximizing the individual information of the features, it is possible to obtain efpoundcient classipoundcation by a simple linear separating rule, as well as more efpoundcient learning.
2002
The human visual system analyzes shapes and objects in a series of stages in which stimulus features of increasing complexity are extracted and analyzed. The first stages use simple local features, and the image is subsequently represented in terms of larger and more complex features. These include features of intermediate complexity and partial object views. The nature and use of these higher order representations remains an open question in the study of visual processing by the primate cortex. Here we show that intermediate complexity (IC) features are optimal for the basic visual task of classification. Moderately complex features are more informative for classification than very simple or very complex ones, and so they emerge naturally by the simple coding principle of information maximization with respect to a class of images. Our findings suggest a specific role for IC features in visual processing and a principle for their extraction.
In this paper we present a novel class-based segmentation method, which is guided by a stored representation of the shape of objects within a general class (such as horse images). The approach is different from bottom-up segmentation methods that primarily use the continuity of grey-level, texture, and bounding contours. We show that the method leads to markedly improved segmentation results and can deal with significant variation in shape and varying backgrounds. We discuss the relative merits of class-specific and general image-based segmentation methods and suggest how they can be usefully combined.
Object related areas in the human ventral stream were previously shown to be activated, in a shape-selective manner, by luminance, motion, and texture cues. We report on the preferential activation of these areas by stereo cues defining shape. To assess the relationship of this activation to object recognition, we employed a perceptual stereo effect, which profoundly affects object recognition. The stimuli consisted of stereo-defined line drawings of objects that either protruded in front of a flat background ("front"), or were sunk into the background ("back"). Despite the similarity in the local feature structure of the two conditions, object recognition was superior in the "front" compared to the "back" configuration. We measured both recognition rates and fMRI signal from the human visual cortex while subjects viewed these stimuli. The results reveal shape selective activation from images of objects defined purely by stereoscopic cues in the human ventral stream. Furthermore, they show a significant correlation between recognition and fMRI signal in the object-related occipito-temporal cortex (lateral occipital complex).
2001
2000
The tasks of visual object recognition and classification are natural and effortless for biological visual systems, but exceedingly difficult to replicate in computer vision systems. This difficulty arises from the large variability in images of different objects within a class, and variability in viewing conditions. In this paper we describe a fragment-based method for object classification. In this approach objects within a class are represented in terms of common image fragments, that are used as building blocks for representing a large variety of different objects that belong to a common class, such as a face or a car. Optimal Fragments are selected from a training set of images based on a criterion of maximizing the mutual information of the fragments and the class they represent. For the purpose of classification the fragments are also organized into types, where each type is a collection of alternative fragments, such as different hairline or eye regions for Face classification. During classification, the algorithm detects fragments of the different types, and then combines the evidence for the detected fragments to reach a final decision. The algorithm verifies the proper arrangement of the fragments and the consistency of the viewing conditions primarily by the conjunction of overlapping fragments. The method is different from previous part-based methods in using class-specific overlapping object fragments of varying complexity. and in verifying the consistent arrangement of the fragments primarily by the conjunction of overlapping detected fragments. Experimental results on the detection of face and car views show that the fragment-based approach can generalize well to completely novel image views within a class while maintaining low mis-classification error rates. We briefly discuss relationships between the proposed method and properties of parts of the primate visual system involved in object perception.
1999
A fundamental capacity of the perceptual systems and the brain in general is to deal with the novel and the unexpected. In vision, we can effortlessly recognize a familiar object under novel viewing conditions, or recognize a new object as a member of a familiar class, such as a house, a face, or a car. This ability to generalize and deal efficiently with novel stimuli has long been considered a challenging example of brain-like computation that proved extremely difficult to replicate in artificial systems. In this paper we present an approach to generalization and invariant recognition. We focus our discussion on the problem of invariance to position in the visual field, but also sketch how similar principles could apply to other domains. The approach is based on the use of a large repertoire of partial generalizations that are built upon past experience. In the case of shift invariance, visual patterns are described as the conjunction of multiple overlapping image fragments. The invariance to the more primitive fragments is built into the system by past experience. Shift invariance of complex shapes is obtained from the invariance of their constituent fragments. We study by simulations aspects of this shift invariance method and then consider its extensions to invariant perception and classification by brain-like structures.
1998
Visual object recognition is complicated by the fact that the same 3D object can give rise to a large variety of projected images that depend on the viewing conditions, such as viewing direction, distance, and illumination. This paper describes a computational approach that uses combinations of a small number of object views to deal with the effects of viewing direction. The first part of the paper is an overview of the approach based on previous work. It is then shown that, in agreement with psychophysical evidence, the view-combinations approach can use views of different class members rather than multiple views of a single object, to obtain class-based generalization. A number of extensions to the basic scheme are considered, including the use of non-linear combinations, using 3D versus 2D information, and the role of coarse classification on the way to precise identification. Finally, psychophysical and biological aspects of the view-combination approach are discussed. Compared with approaches that treat object recognition as a symbolic high-level activity, in the view-combination approach the emphasis is on processes that are simpler and pictorial in nature.
A major problem in object recognition is that a novel image of a given object can be different from all previously seen images. Images can vary considerably due to changes in viewing conditions such as viewing position and illumination. In this paper we distinguish between three types of recognition schemes by the level at which generalization to novel images takes place: universal, class, and model-based. The first is applicable equally to all objects, the second to a class of objects, and the third uses known properties of individual objects. We derive theoretical limitations on each of the three generalization levels. For the universal level, previous results have shown that no invariance can be obtained. Here we show that this limitation holds even when the assumptions made on the objects and the recognition functions are relaxed. We also extend the results to changes of illumination direction. For the class level, previous studies presented specific examples of classes of objects for which functions invariant to viewpoint exist. Here, we distinguish between classes that admit such invariance and classes that do not. We demonstrate that there is a tradeoff between the set of objects that can be discriminated by a given recognition function and the set of images from which the recognition function can recognize these objects. Furthermore, we demonstrate that although functions that are invariant to illumination direction do not exist at the universal level, when the objects are restricted to belong to a given class, an invariant function to illumination direction can be defined. A general conclusion of this study is that class-based processing, that has not been used extensively in the past, is often advantageous for dealing with variations due to viewpoint and illuminant changes.
A method is presented for class-based recognition using a small number of example views taken under several different viewing conditions. The main emphasis is on using a small number of examples. Previous work assumed that the set of examples is sufficient to span the entire space of possible objects, and that in generalizing to a new viewing conditions a sufficient number of previous examples under the new conditions will be available to the recognition system. Here we have considerably relaxed these assumptions and consequently obtained good class-based generalization from a small number of examples, even a single example view, for both viewing position and illumination changes. In addition, previous class-based approaches only focused on viewing position changes and did not deal with illumination changes. Here we used a class-based approach that can generalize for both illumination and viewing position changes. The method was applied to face and car model images. New views under viewing position and illumination changes were synthesized from a small number of examples.
1997
A face recognition system must recognize a face from a novel image despite the variations between images of the same face. A common approach to overcoming image variations because of changes in the illumination conditions is to use image representations that are relatively insensitive to these variations. Examples of such representations are edge maps, image intensity derivatives, and images convolved with 2D Gabor-like filters. Here we present an empirical study that evaluates the sensitivity of these representations to changes in illumination, as well as viewpoint and facial expression. Our findings indicated that none of the representations considered is sufficient by itself to overcome image variations because of a change in the direction of illumination. Similar results were obtained for changes due to viewpoint and expression. Image representations that emphasized the horizontal features were found to be less sensitive to changes in the direction of illumination. However, systems based only on such representations failed to recognize up to 20 percent of the faces in our database. Humans performed considerably better under the same conditions. We discuss possible reasons for this superioriority and alternative methods for overcoming illumination effects in recognition.
We describe an approach to object recognition in which the image-to-model match is based on stochastic optimization. During the recognition process, an internal model is matched with a novel object view. To compensate for changes in viewing conditions (such as illumination, viewing direction), the model is controlled by a number of parameters. The matching is obtained by seeking a setting of the parameters that minimizes the discrepancy, between the image and the model. The search is performed in our examples in a six-dimensional space with multiple local minima. We developed an efficient minimization method based on the stochastic optimization approach (Mockus 1989). The search is bidirectional (applied to both the model and the image) and avoids the difficult problem of establishing image-to-model correspondence. It proceeds by evolving a population of candidate solutions using simple generation rules, based on the autocorrelation of the search space. We describe the method, its application to objects in several domains (cars, faces, printed symbols), and experimental comparisons with alternative methods, such as simulated annealing.
1996
In recognizing objects and scenes, partial recognition of objects or their parts can be used to guide the recognition of other objects. Here, the role of individual objects in the recognition of complete figures and the influence of contextual information on the identification of ambiguous objects were investigated. Configurations of objects that were placed in either proper or improper spatial relations were used, and response times and error rates in a recognition task were measured. Two main results were obtained. First, proper spatial relations among the objects of a scene decrease response times and error rates in the recognition of individual objects. Second, the presence of objects that have a unique interpretation improves the identification of ambigous objects in the scene. Ambiguous objects were recognized faster and with fewer errors in the presence of clearly recognized objects compared with the same objects in isolation or in improper spatial relations. The implications of these findings for the organization of recognition memory are discussed.
An image of a face depends not only on its shape, but also on the viewpoint, illumination conditions, and facial expression. A face recognition system must overcome the changes in face appearance induced by these factors. Two related questions were investigated: the capacity of the human visual system to generalize the recognition of faces to novel images, and the level at which this generalization occurs. This problem was approached by comparing the identification and generalization capacity for upright and inverted faces. For upright faces, remarkably good generalization to novel conditions was found. For inverted faces, the generalization to novel views was significantly worse for both new illumination and viewpoint, although the performance on the training images was similar to that on the upright condition. The results indicate that at least some of the processes that support generalization across viewpoint and illumination are neither universal (because subjects did not generalize as easily for inverted faces as for upright ones) nor strictly object specific (because in upright faces nearly perfect generalization was possible from a single view, by itself insufficient for building a complete object-specific model). It is proposed that generalization in face recognition occurs at an intermediate level that is applicable to a class of objects, and that at this level upright and inverted faces initially constitute distinct object classes.
1995
A computational model is proposed for some general aspects of information flow in the visual cortex. The basic process, called "sequence seeking," is a search for a sequence of mappings, or transformations, linking source and target patterns. The process has two main characteristics: it is bidirectional, bottom-up as well as top-down, and it explores in parallel a large number of alternative sequences. This operation is performed in a "counter streams" structure, in which multiple sequences are explored along two complementary pathways, an ascending and a descending one, seeking to meet. A biological embodiment of this model in cortical circuitry is proposed. The model serves to account for known aspects of cortical interconnections and to derive new predictions.
1993
This paper examines the recognition of rigid objects bounded by smooth surfaces, using an alignment approach. The projected image of such an object changes during rotation in a manner that is generally difficult to predict. An approach to this problem is suggested, using the 3D surface curvature at the points along the silhouette. The curvature information requires a single number for each point along the object\u2019s silhouette, the radial curvature at the point. We have implemented this method and tested it on images of complex 3D objects. Models of the viewed objects were acquired using three images of each object. The implemented scheme was found to give accurate predictions of the objects\u2019 appearances for large transformations. Using this method, a small number of (viewer-centered) models can be used to predict the new appearance of an object from any given viewpoint.
1992
LIMITATIONS OF NONMODEL-BASED RECOGNITION SCHEMES
Approaches to visual object recognition can be divided into model-based and non modeLbased schemes. In this paper we establish some limitations on non model-based recognition schemes. We show that a consistent non model-based recognition scheme for general objects cannot discriminate between objects. The same result holds even if the recognition function is imperfect, and is allowed to mis-identify each object from a substantial fraction of the viewing directions. We then consider recognition schemes restricted to classes of objects. We define the notion of the discrimination power of a consistent recognition function for a class of objects. The function’s discrimination power determines the set of objects that can be discriminated by the recognition function. We show how the properties of a class of objects determine an upper bound on the discrimination power of any consistent recognition function for that class.
This paper discusses two problems related to three-dimensional object recognition. The first is segmentation and the selection of a candidate object in the image, the second is the recognition of a three-dimensional object from different viewing positions. Regarding segmentation, it is shown how globally salient structures can be extracted from a contour image based on geometrical attributes, including smoothness and contour length. This computation is performed by a parallel network of locally connected neuron-like elements. With respect to the effect of viewing, it is shown how the problem can be overcome by using the linear combinations of a small number of two-dimensional object views. In both problems the emphasis is on methods that are relatively low level in nature. Segmentation is performed using a bottom-up process, driven by the geometry of image contours. Recognition is performed without using explicit three-dimensional models, but by the direct manipulation of two-dimensional images.
1991
Visual object recognition requires the matching of an image with a set of models stored in memory. In this paper, we propose an approach to recognition in which a 3-D object is represented by the linear combination of 2-D images of the object. If M={Mi, Mk} is the set of pictures representing a given object and P is the 2-D image of an object to be recognized, then P is considered to be an instance of M if P= Σki=1 αiMi for some constants ai*. We show that this approach handles correctly rigid 3-D transformations of objects with sharp as well as smooth boundaries and can also handle nonrigid transformations. The paper is divided into two parts. In the first part, we show that the variety of views depicting the same object under different transformations can often be expressed as the linear combinations of a small number of views. In the second part, we suggest how this linear combination property may be used in the recognition process.
Human ability to detect 3-D structure in an array of 2-D moving dots was tested. Under limited exposure time, we found high detection rates only when the 2-D motion was restricted to the spatio-temporal region of short-range motion. Long-range moving dots failed to produce a strong impression of 3-D structure and yielded only weak detection rates. This result is consistent with the view that the processing of long-range motion is more serial than that of short-range motion.
1990
We describe a new approach to the visual recognition of cursive handwriting. An effort is made to attain human-like performance by using a method based on pictorial alignment and on a model of the process of handwriting. The alignment approach permits recognition of character instances that appear embedded in connected strings. A system embodying this approach has been implemented and tested on five different word sets. The performance was stable both across words and across writers. The system exhibited a substantial ability to interpret cursive connected strings without recourse to lexical knowledge.
The human visual system can make remarkably precise spatial judgements. There are reasons to believe that this accuracy is achieved and maintained by using processes that calibrate and correct errors in the system. This work investigate this problem of self-calibration and describes an adaptive system for detecting the collinearity of points and the straightness of lines. The system is initially inaccurate, but, by using an error correction mechanism, it eventually becomes highly accurate. The error correction is performed by a simple self calibration process named proportional multi-gain adjustment. The calibration process adjusts the gain values of the system input units. The process utilizes statistical regularities in the input stimuli. It compensate for errors due to noise in the input units receptive fields location and response functions by ensuring that the average deviation from collinearity offset detected by the system is zero. As a by product of the error correction, the system exhibits also adaptation and aftereffect phenomena, similar to those observed in the human visual system.
Keywords: Neurosciences; Physiology
1989
This paper examines the problem of shape-based object recognition, and proposes a new approach, the alignment of pictorial descriptions. The first part of the paper reviews general approaches to visual object recognition, and divides these approaches into three broad classes: invariant properties methods, object decomposition methods, and alignment methods. The second part presents the alignment method. In this approach the recognition process is divided into two stages. The first determines the transformation in space that is necessary to bring the viewed object into alignment with possible object models. This stage can proceed on the basis of minimal information, such as the object's dominant orientation, or a small number of corresponding feature points in the object and model. The second stage determines the model that best matches the viewed object. At this stage, the search is over all the possible object models, but not over their possible views, since the transformation has already been determined uniquely in the alignment stage. The proposed alignment method also uses abstract description, but unlike structural description methods it uses them pictorially, rather than in symbolic structural descriptions.
1988
1987
Apparent motion was used to explore humans' ability to perceive the direction of motion in the visual field. A marked qualitative difference in this ability was found between short- and long-range motion. For short-range motion, the detection of the direction of motion is characterized by parallel operation over a wide visual field (that is, detection performance is independent of the number of objects in an array). When the positional displacement is large relative to an object's size, the direction of motion is detected in a serial manner. The process of detection is limited in this case by the ability to detect other events, such as appearance and disappearance of an object, and the ability to compute their spatio-temporal relations. The results are consistent with a previously suggested division of the motion detection system into short- and long-range processes. The direction of short-range motion can be perceived in parallel (preattentively), whereas long-range motion is attentive and requires more complicated computations. It seems that the detection of long-range motion is a conjunction task, combining the detection of disappearance and appearance.
1986
Experiments by Schiller et al. have suggested that non-directional edge-specific simple cells are constructed from two directionally selective subunits with opposite preferred direction. This hierarchical notion was based on the fact that the responses of such units to edges moving in opposite directions are spatially displaced with respect to each other. An alternative explanation of the observed response separation is the delay between the responses of the center and surround mechanisms at the retinal level. Measurements of the response separation as a function of stimulus speed support this explanation and argues against the hierarchical notion of Schiller et al.
A theory of early visual information processing proposed by Marr and co-workers suggests that simple cortical cells may be involved in the detection of zero crossing in the retinal output. We have tested this theory by using pairs of adjacent edges (staircases stimuli) and recording from edge-specific simple cells in cat striate cortex. The zero crossing hypothesis gives rise for such stimuli to non-obvious predictions that were generally confirmed by the experiment.
1985
Psychophysical and physiological evidence indicated that the visual system of primates and humans has evolved a specialized processing focus moving across the visual scene. This study addresses the question of how simple networks of neuron-like elements can account for a variety of phenomena associated with this shift of selective visual attention. Specifically, we propose the following: (1) A number of elementary features, such as color, orientation, direction of movement, disparity etc. are represented in parallel in different topographical maps, called the early representation. (2) There exists a selective mapping from the early topographic representation into a more central non-topographic representation, such that at any instant the central representation contains the properties of only a single location in the visual scene, the selected location. We suggest that this mapping is the principal expression of early selective visual attention. One function of selective attention is to fuse information from different maps into one coherent whole. (3) Certain selection rules determine which locations will be mapped into the central representation. The major rule, using the conspicuity of locations in the early representation, is implemented using a so-called Winner-Take-All network. Inhibiting the selected location in this network causes an automatic shift towards the next most conspicious location. Additional rules are proximity and similarity preferences. We discuss how these rules can be implemented in neuron-like networks and suggest a possible role for the extensive back-projection from the visual cortex to the LGN.