Exploring Iterative Manifold Constraint for Zero-shot Image Editing
Maomao Li, Yu Li, Yunfei Liu, Dong Xu
TL;DR
This work tackles the trade-off between editability and fidelity in text-guided image editing by challenging the prevailing inversion-then-editing approach. It introduces ZZEdit, which locates a pivot latent $z_p$ along the inversion trajectory and employs a ZigZag process to gradually inject target guidance while preserving the pivot’s structure, framed as iterative manifold constraint between ${\mathcal{M}}_{p}$ and ${\mathcal{M}}_{p-1}$. The method integrates with existing editing pipelines (e.g., P2P and PnP) and demonstrates superior performance across quantitative metrics and qualitative visual results on PIE-Bench, reducing fidelity errors and improving editing consistency. The practical impact is a more reliable, plug-in editing paradigm for diffusion models that enables finer control without extensive fine-tuning or masking requirements.
Abstract
Editability and fidelity are two essential demands for text-driven image editing, which expects that the editing area should align with the target prompt and the rest remain unchanged separately. The current cutting-edge editing methods usually obey an "inversion-then-editing" pipeline, where the input image is inverted to an approximate Gaussian noise ${z}_T$, based on which a sampling process is conducted using the target prompt. Nevertheless, we argue that it is not a good choice to use a near-Gaussian noise as a pivot for further editing since it would bring plentiful fidelity errors. We verify this by a pilot analysis, discovering that intermediate-inverted latents can achieve a better trade-off between editability and fidelity than the fully-inverted ${z}_T$. Based on this, we propose a novel zero-shot editing paradigm dubbed ZZEdit, which first locates a qualified intermediate-inverted latent marked as ${z}_p$ as a better editing pivot, which is sufficient-for-editing while structure-preserving. Then, a ZigZag process is designed to execute denoising and inversion alternately, which progressively inject target guidance to ${z}_p$ while preserving the structure information of $p$ step. Afterwards, to achieve the same step number of inversion and denoising, we execute a pure sampling process under the target prompt. Essentially, our ZZEdit performs iterative manifold constraint between the manifold of $M_{p}$ and $M_{p-1}$, leading to fewer fidelity errors. Extensive experiments highlight the effectiveness of ZZEdit in diverse image editing scenarios compared with the "inversion-then-editing" pipeline.
