Text-to-image (T2I) diffusion/flow models achieve state-of-the-art results in image synthesis. Many works leverage these models for real image editing, where a predominant approach involves inverting the image into its corresponding gaussian-like noise map. However, inversion by itself is often insufficient for structure preserving edits. In our first work in this talk, termed ‘An Edit Friendly DDPM Noise Space’ [1], we present alternative latent noise maps for denoising diffusion probabilistic models (DDPMs) that do not have a standard normal distribution. These noise maps allow for perfect reconstruction of any real image, and lead to structure preserving edits, as we exemplify in our experiments.
In our second work, we tackle the task of text-based video editing using T2I diffusion models. Here the main challenge lies in maintaining the temporal consistency of the original video during the edit. Many methods leverage explicit correspondence mechanisms, which struggle with strong nonrigid motion. In contrast, our method termed ‘Slicedit’ [2], introduces a fundamentally different approach, which is based on the observation that spatiotemporal slices of natural videos exhibit similar characteristics to natural images. Thus, the same T2I diffusion model that is normally used only as a prior on video frames, can also serve as a strong prior for enhancing temporal consistency by applying it on spatiotemporal slices. As we show Sliceditgenerates videos that retain the structure and motion of the original video without relying on explicit correspondence matching while adhering to the target text. Finally, in our most recent work, we will discuss ‘FlowEdit’ [3], a novel text-based image editing method that leverages the increasingly popular flow models without relying on inversion. Our method constructs an ODE that directly maps between the source and target distributions (corresponding to the source and target text prompts) and achieves a lower transport cost than the inversion approach. This leads to state-of-the-art results, as we illustrate with Stable Diffusion 3 and FLUX.
[1] An Edit Friendly DDPM Noise Space: Inversion and Manipulations - CVPR24’ https://arxiv.org/abs/2304.06140
[2] Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices - ICML24’ https://arxiv.org/abs/2405.12211
[3] FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models – under review https://arxiv.org/abs/2412.08629
Bio: Vladimir Kulikov, PhD student at the Technion, under the supervision of Prof. Tomer Michaeli. Currently studying Deep Generative Models with emphasis on Computer Vision.