Table of Contents
Fetching ...

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

Ruibin Li, Ruihuang Li, Song Guo, Lei Zhang

TL;DR

This work tackles the problem that inverted latent noise codes in diffusion‑based image editing remain biased by the source prompt, hindering edits guided by a new target prompt. It analyzes DDIM inversion and formalizes a fixed‑point constraint, then proposes Source Prompt Disentangled Inversion (SPDInv), which turns each inversion step into a fixed‑point search by minimizing $L = \lVert f_{\theta}(z_t) - z_t \rVert_2$ with a pre‑trained diffusion model to obtain a near‑ideal noise $z_T^*$. SPDInv significantly reduces the noise gap $D_{noi}$, improves editing fidelity across multiple engines (P2P, MasaCtrl, PNP), and extends to customized image generation by enabling localized edits with methods like ELITE. The approach yields substantial practical benefits in text‑driven and localized editing scenarios, with known limitations in portrait edits and reliance on existing editing pipelines, suggesting directions for further stability and robustness improvements.

Abstract

Text-driven diffusion models have significantly advanced the image editing performance by using text prompts as inputs. One crucial step in text-driven image editing is to invert the original image into a latent noise code conditioned on the source prompt. While previous methods have achieved promising results by refactoring the image synthesizing process, the inverted latent noise code is tightly coupled with the source prompt, limiting the image editability by target text prompts. To address this issue, we propose a novel method called Source Prompt Disentangled Inversion (SPDInv), which aims at reducing the impact of source prompt, thereby enhancing the text-driven image editing performance by employing diffusion models. To make the inverted noise code be independent of the given source prompt as much as possible, we indicate that the iterative inversion process should satisfy a fixed-point constraint. Consequently, we transform the inversion problem into a searching problem to find the fixed-point solution, and utilize the pre-trained diffusion models to facilitate the searching process. The experimental results show that our proposed SPDInv method can effectively mitigate the conflicts between the target editing prompt and the source prompt, leading to a significant decrease in editing artifacts. In addition to text-driven image editing, with SPDInv we can easily adapt customized image generation models to localized editing tasks and produce promising performance. The source code are available at https://github.com/leeruibin/SPDInv.

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

TL;DR

This work tackles the problem that inverted latent noise codes in diffusion‑based image editing remain biased by the source prompt, hindering edits guided by a new target prompt. It analyzes DDIM inversion and formalizes a fixed‑point constraint, then proposes Source Prompt Disentangled Inversion (SPDInv), which turns each inversion step into a fixed‑point search by minimizing with a pre‑trained diffusion model to obtain a near‑ideal noise . SPDInv significantly reduces the noise gap , improves editing fidelity across multiple engines (P2P, MasaCtrl, PNP), and extends to customized image generation by enabling localized edits with methods like ELITE. The approach yields substantial practical benefits in text‑driven and localized editing scenarios, with known limitations in portrait edits and reliance on existing editing pipelines, suggesting directions for further stability and robustness improvements.

Abstract

Text-driven diffusion models have significantly advanced the image editing performance by using text prompts as inputs. One crucial step in text-driven image editing is to invert the original image into a latent noise code conditioned on the source prompt. While previous methods have achieved promising results by refactoring the image synthesizing process, the inverted latent noise code is tightly coupled with the source prompt, limiting the image editability by target text prompts. To address this issue, we propose a novel method called Source Prompt Disentangled Inversion (SPDInv), which aims at reducing the impact of source prompt, thereby enhancing the text-driven image editing performance by employing diffusion models. To make the inverted noise code be independent of the given source prompt as much as possible, we indicate that the iterative inversion process should satisfy a fixed-point constraint. Consequently, we transform the inversion problem into a searching problem to find the fixed-point solution, and utilize the pre-trained diffusion models to facilitate the searching process. The experimental results show that our proposed SPDInv method can effectively mitigate the conflicts between the target editing prompt and the source prompt, leading to a significant decrease in editing artifacts. In addition to text-driven image editing, with SPDInv we can easily adapt customized image generation models to localized editing tasks and produce promising performance. The source code are available at https://github.com/leeruibin/SPDInv.
Paper Structure (23 sections, 6 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 6 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Illustration of text-driven image editing pipeline.
  • Figure 1: Visualization of inverted noise codes.
  • Figure 2: Pipelines of different inversion methods in text-driven editing. (a) DDIM inversion inverts a real image to a latent noise code, but the inverted noise code often results in large gap of reconstruction $D_{Rec}$ with higher CFG parameters. (b) NTI optimizes the null-text embedding to narrow the gap of reconstruction $D_{Rec}$, while NPI further optimizes the speed of NTI. (c) DirectInv records the differences between the inversion feature and the reconstruction feature, and merges them back to achieve high-quality reconstruction. (d) Our SPDInv aims to minimize the gap of noise $D_{Noi}$, instead of $D_{Rec}$, which can reduce the impact of source prompt on the editing process and thus reduce the artifacts and inconsistent details encountered by the previous methods.
  • Figure 2: The noise gap $D_{noi}$ of inverted noise codes by DDIM inversion and our SPDInv using 100 generated images with captions extracted from COCO2017.
  • Figure 3: An example of image editing with ideal noise code (left) and inverted noise code (right). Source prompt: A spiderman in the city. Target prompt: A spiderman in the city with his left hand up.
  • ...and 9 more figures