Table of Contents
Fetching ...

Exploring Iterative Manifold Constraint for Zero-shot Image Editing

Maomao Li, Yu Li, Yunfei Liu, Dong Xu

TL;DR

This work tackles the trade-off between editability and fidelity in text-guided image editing by challenging the prevailing inversion-then-editing approach. It introduces ZZEdit, which locates a pivot latent $z_p$ along the inversion trajectory and employs a ZigZag process to gradually inject target guidance while preserving the pivot’s structure, framed as iterative manifold constraint between ${\mathcal{M}}_{p}$ and ${\mathcal{M}}_{p-1}$. The method integrates with existing editing pipelines (e.g., P2P and PnP) and demonstrates superior performance across quantitative metrics and qualitative visual results on PIE-Bench, reducing fidelity errors and improving editing consistency. The practical impact is a more reliable, plug-in editing paradigm for diffusion models that enables finer control without extensive fine-tuning or masking requirements.

Abstract

Editability and fidelity are two essential demands for text-driven image editing, which expects that the editing area should align with the target prompt and the rest remain unchanged separately. The current cutting-edge editing methods usually obey an "inversion-then-editing" pipeline, where the input image is inverted to an approximate Gaussian noise ${z}_T$, based on which a sampling process is conducted using the target prompt. Nevertheless, we argue that it is not a good choice to use a near-Gaussian noise as a pivot for further editing since it would bring plentiful fidelity errors. We verify this by a pilot analysis, discovering that intermediate-inverted latents can achieve a better trade-off between editability and fidelity than the fully-inverted ${z}_T$. Based on this, we propose a novel zero-shot editing paradigm dubbed ZZEdit, which first locates a qualified intermediate-inverted latent marked as ${z}_p$ as a better editing pivot, which is sufficient-for-editing while structure-preserving. Then, a ZigZag process is designed to execute denoising and inversion alternately, which progressively inject target guidance to ${z}_p$ while preserving the structure information of $p$ step. Afterwards, to achieve the same step number of inversion and denoising, we execute a pure sampling process under the target prompt. Essentially, our ZZEdit performs iterative manifold constraint between the manifold of $M_{p}$ and $M_{p-1}$, leading to fewer fidelity errors. Extensive experiments highlight the effectiveness of ZZEdit in diverse image editing scenarios compared with the "inversion-then-editing" pipeline.

Exploring Iterative Manifold Constraint for Zero-shot Image Editing

TL;DR

This work tackles the trade-off between editability and fidelity in text-guided image editing by challenging the prevailing inversion-then-editing approach. It introduces ZZEdit, which locates a pivot latent along the inversion trajectory and employs a ZigZag process to gradually inject target guidance while preserving the pivot’s structure, framed as iterative manifold constraint between and . The method integrates with existing editing pipelines (e.g., P2P and PnP) and demonstrates superior performance across quantitative metrics and qualitative visual results on PIE-Bench, reducing fidelity errors and improving editing consistency. The practical impact is a more reliable, plug-in editing paradigm for diffusion models that enables finer control without extensive fine-tuning or masking requirements.

Abstract

Editability and fidelity are two essential demands for text-driven image editing, which expects that the editing area should align with the target prompt and the rest remain unchanged separately. The current cutting-edge editing methods usually obey an "inversion-then-editing" pipeline, where the input image is inverted to an approximate Gaussian noise , based on which a sampling process is conducted using the target prompt. Nevertheless, we argue that it is not a good choice to use a near-Gaussian noise as a pivot for further editing since it would bring plentiful fidelity errors. We verify this by a pilot analysis, discovering that intermediate-inverted latents can achieve a better trade-off between editability and fidelity than the fully-inverted . Based on this, we propose a novel zero-shot editing paradigm dubbed ZZEdit, which first locates a qualified intermediate-inverted latent marked as as a better editing pivot, which is sufficient-for-editing while structure-preserving. Then, a ZigZag process is designed to execute denoising and inversion alternately, which progressively inject target guidance to while preserving the structure information of step. Afterwards, to achieve the same step number of inversion and denoising, we execute a pure sampling process under the target prompt. Essentially, our ZZEdit performs iterative manifold constraint between the manifold of and , leading to fewer fidelity errors. Extensive experiments highlight the effectiveness of ZZEdit in diverse image editing scenarios compared with the "inversion-then-editing" pipeline.
Paper Structure (19 sections, 9 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 19 sections, 9 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: We propose a novel zero-shot editing paradigm dubbed ZZEdit, which demonstrates a more subtle editability and fidelity over the commonly employed "inversion-then-editing" pipeline. Moreover, it seamlessly integrates with contemporary text-driven image editing methods, such as P2P p2p (with DDIM inversion ddim or Null-text inversion null) and PnP pnp (with DDIM inversion), enhancing their capabilities.
  • Figure 2: Left: The trajectory of the "inversion-then-editing" pipeline and our ZZEdit. (a) The former invertes ${\bm z}_0$ to ${\bm z}_T$ using $\mathcal{P}_{src}$, and then carry out denoising under $\mathcal{P}_{tgt}$. (b) The latter first locates a qualified intermediate-inverted latent marked as ${\bm z}_p$ as a better editing pivot, which is sufficient-for-editing while structure-preserving. Then, a ZigZag process is proposed to mildly perform target guidance by alternately executing one-step denoising and inversion by $K$ times. Afterwards, a pure denoising process is leveraged for the equal step of inversion and denoising. Right: Manifold illustration of "inversion-then-editing" pipeline and our ZZEdit at the step $p$ and $p-1$. (c) The former shows noticeable fidelity lost between the denoised latent $\Tilde{{\bm z}}_p$ and the ideal one ${{\bm z}}_p^*$ when reconstructing semantics from a noisy manifold ${\mathcal{M}}_T$. (d) The latter conducts iterative manifold constraint on ${{\bm z}}_p$, to which target guidance is progressively injected without ruining the structure information of ${{\bm z}}_p$. The corresponding ${{\bm z}}_p^K$ is closer to the optimal point $\Tilde{{\bm z}}_p^*$ for the next pure denoising process.
  • Figure 3: The cross-attention maps between different inverted latents ${\bm z}_t$ and the target prompt $\mathcal{P}_{tgt}$.
  • Figure 4: Ablation study of ZZEdit on P2P p2pw/ DDIM inversion. The first row displays the results of using different inverted ${\bm z}_t$ as editing pivot without ZigZag process. The second row shows the performance of using the ZigZag process additionally. Our method first locates a suitable pivot ${\bm z}_{p}$ (marked with purple) and then mildly performs target guidance, yielding the most elegant results.
  • Figure 5: Qualitative ablation on our ZigZag process with P2P p2p and PnP pnp, which mildly enhances the guidance at a suitable pivot ${\bm z}_{p}$.
  • ...and 8 more figures