Table of Contents
Fetching ...

Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model

Qi Mao, Lan Chen, Yuchao Gu, Mike Zheng Shou, Ming-Hsuan Yang

TL;DR

This work tackles the challenge of balancing fidelity and editability in text-based image editing using diffusion models. It introduces UnifyEdit, a tuning-free diffusion latent optimization framework that replaces attention injections with two constraints—Self-Attention Preservation for structure and Cross-Attention Alignment for editability—guided by an adaptive time-step scheduler. The method is validated on the Unify-Bench dataset, showing improved trade-offs across diverse editing tasks and outperforming state-of-the-art tuning-free and gradient-based baselines. The approach enables explicit, configurable control over edits without retraining, providing practical impact for robust, user-tunable image editing workflows. Limitations regarding highly non-rigid transformations are acknowledged, with future work aimed at extending the constraints to non-rigid self-attention dynamics.

Abstract

Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.

Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model

TL;DR

This work tackles the challenge of balancing fidelity and editability in text-based image editing using diffusion models. It introduces UnifyEdit, a tuning-free diffusion latent optimization framework that replaces attention injections with two constraints—Self-Attention Preservation for structure and Cross-Attention Alignment for editability—guided by an adaptive time-step scheduler. The method is validated on the Unify-Bench dataset, showing improved trade-offs across diverse editing tasks and outperforming state-of-the-art tuning-free and gradient-based baselines. The approach enables explicit, configurable control over edits without retraining, providing practical impact for robust, user-tunable image editing workflows. Limitations regarding highly non-rigid transformations are acknowledged, with future work aimed at extending the constraints to non-rigid self-attention dynamics.

Abstract

Balancing fidelity and editability is essential in text-based image editing (TIE), where failures commonly lead to over- or under-editing issues. Existing methods typically rely on attention injections for structure preservation and leverage the inherent text alignment capabilities of pre-trained text-to-image (T2I) models for editability, but they lack explicit and unified mechanisms to properly balance these two objectives. In this work, we introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization to enable a balanced integration of fidelity and editability within a unified framework. Unlike direct attention injections, we develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment for improved editability. However, simultaneously applying both constraints can lead to gradient conflicts, where the dominance of one constraint results in over- or under-editing. To address this challenge, we introduce an adaptive time-step scheduler that dynamically adjusts the influence of these constraints, guiding the diffusion latent toward an optimal balance. Extensive quantitative and qualitative experiments validate the effectiveness of our approach, demonstrating its superiority in achieving a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods. The source code will be available at https://github.com/CUC-MIPG/UnifyEdit.

Paper Structure

This paper contains 19 sections, 16 equations, 14 figures, 3 tables, 1 algorithm.

Figures (14)

  • Figure 1: Illustration of balancing fidelity and editability. We demonstrate examples of over-, balanced, and under-editing across six types of edits: (a) color change, (b) texture modification (c) object replacement (d) background editing, (e) global style transfer, and (f) human face attribute editing. Over-editing occurs when excessive changes distort the original image, while under-editing results in changes too subtle to meet the text prompt's requirements. In contrast, our UnifyEdit balances fidelity and editability within a unified framework, ensuring edits align with the text prompt while preserving the essential integrity.
  • Figure 2: UnifyEdit vs. dual-branch editing paradigm. (a) The typical dual-branch editing paradigm consists of source and target branches, using attention injection to maintain fidelity while relying on the text prompt to achieve editability. (b) In contrast, our method explicitly models the fidelity and editability using two attention-based constraints and performs latent optimization within a unified framework, facilitating an adaptive balance across various editing types.
  • Figure 3: Experiments with self-attention and cross-attention. (a) Compared to SA injection, the SA constraint offers greater flexibility in editing. (b) When the CA map accurately focuses on the target region with a strong response, the resulting edits align effectively with the text prompt. However, attention leakage or low attention values can lead to misalignment or ineffective editing outcomes.
  • Figure 4: Illustration of UnifyEdit. UnifyEdit is applied to the diffusion latent feature $z_t^\ast$ in the target branch, involving two key steps: 1) calculating $\mathcal{L}_{\rm{SAP}}$ and $\mathcal{L}_{\rm{CAA}}$ for fidelity and editability, and 2) applying an adaptive time-step scheduler for latent optimization.
  • Figure 5: Editing and visualization results of different gradients. (a) Using Eq. \ref{['eq:a2']} alone results in a significantly stronger influence of $\mathcal{L}_{\rm{CAA}}$, disabling $\mathcal{L}_{\rm{SAP}}$ and causing an unbalanced guidance on $z_t$. (b) Although calculating their norms as in Eq. \ref{['eq:a3']} brings the magnitudes of the constraints closer, the irregular dynamics lead to either under-editing or over-editing failures. (c) In contrast, applying the adaptive time-step scheduler in Eq. \ref{['eq:a4']} shapes the gradient trends in Eq. \ref{['eq:a5']} such that $\nabla_{z_t^*}\mathcal{L}_{\rm{SAP}}$ starts small and gradually increases, whereas $\nabla_{z_t^*}\mathcal{L}_{\rm{CAA}}$ exhibits the opposite trend, facilitating fidelity-editability balance.
  • ...and 9 more figures