You are here

Vision and AI

ThursdayOct 12, 202312:15
Vision and AIRoom 1
Speaker:Yossi Gandelsman Title:Interpreting Intermediate Representations in Vision ModelsAbstract:opens in new windowin html    pdfopens in new window
In this talk, I present recent progress in interpreting intermediate representations in vision models. First, I demonstrate the existence of common intermediate representations (neurons) across a wide range of vision models with different architectures, different tasks (generative and discriminative), and different types of supervision (class-supervised, text-supervised, self-supervised). I present an algorithm for finding these universal neurons and show that they can be used for model-to-model translation, enabling various zero-shot inversion-based image manipulations (e.g. shifting, zooming). Second, I analyze the intermediate representations in CLIP, by investigating how they affect the final representation. I show that CLIP image representation can be decomposed as a sum across individual image patches, model layers, and attention heads and that CLIP's text representation can be used to interpret the summands. This decomposition enables an automatic characterization of attention head roles and reveals that some heads capture specific image properties (e.g. location or shape). It also uncovers emergent spatial localization within CLIP. Finally, this understanding helps to remove spurious features from CLIP and to create a strong zero-shot image segmenter. This talk is based on two papers: "Rosetta Neurons: Mining the Common Units in a Model Zoo", and "Interpreting CLIP's Image Representation via Text-Based Decomposition".