Table of Contents
Fetching ...

TurboEdit: Instant text-based image editing

Zongze Wu, Nicholas Kolkin, Jonathan Brandt, Richard Zhang, Eli Shechtman

TL;DR

TurboEdit tackles real-time, disentangled real-image editing with few-step diffusion by introducing an encoder-based iterative inversion that reconstructs the input in $8$ NFEs and enables edits in $4$ NFEs. It combines a multi-step inversion framework with long detailed text prompts, local masks, and instruction-based editing driven by LLMs to achieve high fidelity and precise attribute changes. The approach outperforms state-of-the-art multi-step editing methods on both descriptive and instructive prompts, while maintaining background fidelity and identity preservation, and it supports interactive editing speeds suitable for practical use. While powerful, the method relies on a captioning model for prompts and uses rough masks, highlighting societal considerations around image manipulation and the need for safeguards against misuse.

Abstract

We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

TurboEdit: Instant text-based image editing

TL;DR

TurboEdit tackles real-time, disentangled real-image editing with few-step diffusion by introducing an encoder-based iterative inversion that reconstructs the input in NFEs and enables edits in NFEs. It combines a multi-step inversion framework with long detailed text prompts, local masks, and instruction-based editing driven by LLMs to achieve high fidelity and precise attribute changes. The approach outperforms state-of-the-art multi-step editing methods on both descriptive and instructive prompts, while maintaining background fidelity and identity preservation, and it supports interactive editing speeds suitable for practical use. While powerful, the method relies on a captioning model for prompts and uses rough masks, highlighting societal considerations around image manipulation and the need for safeguards against misuse.

Abstract

We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.
Paper Structure (19 sections, 8 equations, 15 figures, 3 tables, 2 algorithms)

This paper contains 19 sections, 8 equations, 15 figures, 3 tables, 2 algorithms.

Figures (15)

  • Figure 1: We present a novel real-time text-based disentangled real image editing method built upon 4-step SDXL Turbo. Our method can handle both realistic and artistic images, supports manual or instruction-based prompt manipulation, and allow users to control the editing strength. We further show multi-attribute editing and continuous editing in Supplementary Fig \ref{['fig:multi_attribute']}.
  • Figure 2: Given an input real image $x_0$, we utilize the LLaVA to generate a detailed caption $c$. Users can modify $c$ to create a new text prompt $c'$. The inversion process begins by feeding the $x_0$, $c$, current time step $t$, and a previously reconstructed image $x_{0,t+1}$ (initialized as a zero matrix) into the inversion network. This network then predicts the noise $\epsilon_t$, which is subsequently input into a frozen SDXL-Turbo model to generate the new reconstruction image $x_{0,t}$. Given the final inverted noise $\epsilon_t$, along with $c$, we can use SDXL-Turbo to create an inversion trajectory and reconstruct $x_{0,0}$, which is very similar to $x_0$. Using the same noises $\epsilon_t$ and slightly different text prompt $c'$, starting from $t=T$ to smaller $t$, the editing trajectory will be very similar to the inversion trajectory, and the generated image will closely resemble the input image, differing only in the specified attribute in $c'$.
  • Figure 3: When presented with a concise source text prompt, minor edits in the text space can lead to substantial layout and structural changes in the image space. Conversely, making small text edits in a detailed text prompt tends to cause more disentangled changes in the image space. The results are from single step image generation with the same random seed. The captions and color-coded modification areas are provided below.
  • Figure 4: Given a detailed source text and corresponding target text, we can interpolating the text embeddings and generate a smooth interpolation in image space even for large structure change.
  • Figure 5: We compare methods using descriptive text prompt as guidance. Despite requiring only four steps, our method outperforms multi-step methods, particularly in scenarios requiring significant structural changes for attributes such as adding a hat or transforming a man into a woman. In contrast, InfEdit and Pix2PixZero struggle with background and identity preservation. Similarly, Ledits and Ledits++ are unable to effectively handle large structural changes, as evidenced by their failure in adding a top hat or transforming a man into a woman.
  • ...and 10 more figures