אוקטובר 07, 1994 - אוקטובר 07, 2027

  • Date:06ראשוןאוקטובר 2024

    Vision and AI

    More information
    שעה
    12:00 - 13:15
    כותרת
    Reverse Engineering CLIP
    מיקום
    בניין יעקב זיסקינד
    Room 1
    מרצה
    Yossi Gandelsman
    UC Berkeley
    מארגן
    המחלקה למדעי המחשב ומתמטיקה שימושית
    Seminar
    צרו קשר
    פרטים נוספיםShow full text description of *** Please note the unusual day and time***...»
    *** Please note the unusual day and time***
    תקצירShow full text abstract about In this talk, I reverse engineer CLIP, one of the most commo...»
    In this talk, I reverse engineer CLIP, one of the most commonly used computer vision backbones. I analyze how individual model components affect the final CLIP representation. I show that the image representation can be decomposed as a sum across individual image patches, model layers, neurons, and attention heads, and use CLIP’s text representation to interpret the summands.

    When interpreting the attention heads, each head role can be characterized by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches uncovers an emergent spatial localization within CLIP. Finally, the automatic description of the contributions of individual neurons shows polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars).

    The gained understanding of different components allows three main applications: First, the discovered head roles enable the removal of spurious features from CLIP. Second, emergent localization is used for a strong zero-shot image segmenter. Finally, the extracted neuron polysemy allows the mass production of “semantic” adversarial examples by generating images with concepts spuriously correlated to the incorrect class. The results indicate that a scalable understanding of transformer models is attainable and can be used to detect model bugs, repair them, and improve them.  

    BIO:

    Yossi is a computer science PhD at UC Berkeley, advised by Alexei Efros, and a visiting researcher at Meta. Before that, he was a member of the perception team at Google Research (now Google-DeepMind). He completed his M.Sc. at Weizmann Institute, advised by Prof. Michal Irani. His research centers around deep learning, computer vision, and mechanistic interpretability.
    הרצאה