Table of Contents
Fetching ...

Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

Yangyang Xu, Wenqi Shao, Yong Du, Haiming Zhu, Yang Zhou, Ping Luo, Shengfeng He

TL;DR

TODInv introduces Task-Oriented Diffusion Inversion, a framework that inverts real images and edits them by optimizing prompt embeddings in the extended space $\mathcal{P}^*$ across U-Net layers and timesteps. By categorizing edits into structure, appearance, and global, TODInv updates only embeddings irrelevant to the current edit, balancing high reconstruction fidelity with precise editability. Empirical results on PIE-Bench and experiments with few-step diffusion models demonstrate superior reconstruction quality and editing performance over state-of-the-art inversion methods, while maintaining efficiency. The approach provides a principled pathway to reliable, controllable text-based editing of real images, with practical applicability across diverse editing tools and diffusion backbones.

Abstract

Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities, yet balancing reconstruction fidelity and editability for real images remains a significant challenge. In this work, we introduce \textbf{T}ask-\textbf{O}riented \textbf{D}iffusion \textbf{I}nversion (\textbf{TODInv}), a novel framework that inverts and edits real images tailored to specific editing tasks by optimizing prompt embeddings within the extended \(\mathcal{P}^*\) space. By leveraging distinct embeddings across different U-Net layers and time steps, TODInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability. This hierarchical editing mechanism categorizes tasks into structure, appearance, and global edits, optimizing only those embeddings unaffected by the current editing task. Extensive experiments on benchmark dataset reveal TODInv's superior performance over existing methods, delivering both quantitative and qualitative enhancements while showcasing its versatility with few-step diffusion model.

Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

TL;DR

TODInv introduces Task-Oriented Diffusion Inversion, a framework that inverts real images and edits them by optimizing prompt embeddings in the extended space across U-Net layers and timesteps. By categorizing edits into structure, appearance, and global, TODInv updates only embeddings irrelevant to the current edit, balancing high reconstruction fidelity with precise editability. Empirical results on PIE-Bench and experiments with few-step diffusion models demonstrate superior reconstruction quality and editing performance over state-of-the-art inversion methods, while maintaining efficiency. The approach provides a principled pathway to reliable, controllable text-based editing of real images, with practical applicability across diverse editing tools and diffusion backbones.

Abstract

Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities, yet balancing reconstruction fidelity and editability for real images remains a significant challenge. In this work, we introduce \textbf{T}ask-\textbf{O}riented \textbf{D}iffusion \textbf{I}nversion (\textbf{TODInv}), a novel framework that inverts and edits real images tailored to specific editing tasks by optimizing prompt embeddings within the extended space. By leveraging distinct embeddings across different U-Net layers and time steps, TODInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability. This hierarchical editing mechanism categorizes tasks into structure, appearance, and global edits, optimizing only those embeddings unaffected by the current editing task. Extensive experiments on benchmark dataset reveal TODInv's superior performance over existing methods, delivering both quantitative and qualitative enhancements while showcasing its versatility with few-step diffusion model.
Paper Structure (26 sections, 10 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 10 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Our TODInv framework seamlessly integrates the inversion process with editing tasks, enabling diverse high-fidelity text-guided edits such as object replacement, object removal, and stylization. The edited images not only retain the original background but also perfectly align with the target prompts.
  • Figure 2: Illustration of original and extended prompt spaces.
  • Figure 3: Overview of our TODInv. Given a real image, we first encode the image to the initial latent code $z_0$ using the encoder of Stable Diffusion. In timestep $t$, we get the latent code $z_{t}$ based on latent code $z_{t-1}$ and fixed source prompt embedding $p$ using Eq. \ref{['eq.ddim_inversion2']}, but bring the approximation error. Then we use $z_{t}$ to predict latent code $z^{\prime}_{t}$ and minimize their distance by optimizing specific prompt embeddings according to the edit class. The final latent code $z_T$ can be cooperated with various editing methods, with the renewed the target prompts using Eq. \ref{['eq.ddim_inversion9']} (the blue arrows)). Note that only the structure of "cake" is edited in this example, which belongs to structure edit, We only optimize the appearance-related prompt embeddings (denoted by the colorful boxes without grids). For more detailed illustration on how to select the optimization layers, please see in Fig. \ref{['fig:select']}.
  • Figure 4: We categorize all kinds of editing tasks into three classes and divide different layers of U-Net into structure and appearance layers according to their resolutions. For each kind of editing, we only optimize the prompt embeddings that are irrelevant to this editing.
  • Figure 5: Qualitative comparison with various inversion methods using P2P editing method.
  • ...and 7 more figures