Large-scale text-to-image generative models allow synthesizing of diverse images that convey highly complex visual concepts. However, it remains a challenge o provide users with control over the generated content. In this project, we present a new framework that takes text-to-image synthesis to the realm of image-to-image translation. Given a guidance image and a target text prompt, generates a new image that complies with the target text, while preserving the semantic layout of the source image. Specifically, we observe and empirically demonstrate that fine-grained control over the generated structure can be achieved by manipulating spatial features and their self-attention inside the model, requiring no training or fine-tuning.
Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel "TokenFlow: Consistent Diffusion Features for Consistent Video Editing" 2023. [project page]
Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel "Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation" CVPR 2023. [project page]