Improving Tuning-Free Real Image Editing with Proximal Guidance

Ligong Han; Song Wen; Qi Chen; Zhixing Zhang; Kunpeng Song; Mengwei Ren; Ruijiang Gao; Anastasis Stathopoulos; Xiaoxiao He; Yuxiao Chen; Di Liu; Qilong Zhangli; Jindong Jiang; Zhaoyang Xia; Akash Srivastava; Dimitris Metaxas

Improving Tuning-Free Real Image Editing with Proximal Guidance

Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Anastasis Stathopoulos, Xiaoxiao He, Yuxiao Chen, Di Liu, Qilong Zhangli, Jindong Jiang, Zhaoyang Xia, Akash Srivastava, Dimitris Metaxas

TL;DR

The paper tackles real image editing with diffusion models, where large classifier-free guidance improves edits but degrades DDIM reconstructions. It introduces proximal guidance integrated into Negative-Prompt Inversion (ProxNPI) and extends it to Mutual Self-Attention Control (ProxMasaCtrl), adding regularization and inversion- and reconstruction-guidance to reduce artifacts without requiring test-time optimization. The approach preserves source content while enabling cross-attention and geometry/layout edits, achieving efficient tuning-free editing with minimal overhead. Comprehensive ablations demonstrate how proximal guidance, thresholding strategies, and guidance steps affect reconstruction fidelity and editing quality, highlighting practical gains for tuning-free diffusion-based editing.

Abstract

DDIM inversion has revealed the remarkable potential of real image editing within diffusion-based methods. However, the accuracy of DDIM reconstruction degrades as larger classifier-free guidance (CFG) scales being used for enhanced editing. Null-text inversion (NTI) optimizes null embeddings to align the reconstruction and inversion trajectories with larger CFG scales, enabling real image editing with cross-attention control. Negative-prompt inversion (NPI) further offers a training-free closed-form solution of NTI. However, it may introduce artifacts and is still constrained by DDIM reconstruction quality. To overcome these limitations, we propose proximal guidance and incorporate it to NPI with cross-attention control. We enhance NPI with a regularization term and reconstruction guidance, which reduces artifacts while capitalizing on its training-free nature. Additionally, we extend the concepts to incorporate mutual self-attention control, enabling geometry and layout alterations in the editing process. Our method provides an efficient and straightforward approach, effectively addressing real image editing tasks with minimal computational overhead.

Improving Tuning-Free Real Image Editing with Proximal Guidance

TL;DR

Abstract

Paper Structure (13 sections, 15 equations, 13 figures, 2 algorithms)

This paper contains 13 sections, 15 equations, 13 figures, 2 algorithms.

Introduction
Related Work
Method
Background
Proximal Negative-Prompt Inversion
Proximal Mutual Self-Attention Control
Experiment
Cross-Attention Control
Mutual Self-Attention Control
Ablations
Discussion and Conclusion
Proof of Remark \ref{['remark:1']}
Reconstruction Guidance

Figures (13)

Figure 1: Proximal Negative-Prompt Inversion. A comparison of editing quality between Null-text inversion (NTI), Negative-prompt inversion (NPI), and our proposed method (ProxNPI). The bottom row represents the time required for inversion. Our approach incorporates the fast inversion capability of NPI without the need for test-time optimization, thereby incurring only minimal additional cost during inference.
Figure 2: Negative-Prompt Inversion ("NPI") is the exact closed-form solution if we solve Null-text inversion ("NTI") on the DDIM reconstruction sequence $\{\hat{z}_t\}$.
Figure 3: Illustration of a single inference step using classifier-free guidance (CFG) with a scale $w=2$. All methods initially utilize DDIM inversion song2021denoising with the source prompt (and $w=1$). During the inference process: (a) direct sampling is performed using the target prompt; (b) the null embedding is replaced with the source prompt embedding; (c) a proximal gradient step is applied to the scaled noise difference $(\epsilon_{tar}-\epsilon_{src})$ following step (b). Here, we are visualizing soft-thresholding with a threshold $\lambda$, which corresponds to L1 regularization on $\tilde{\epsilon}$. If all values are clamped to zero, resulting in ProxNPI reducing to DDIM reconstruction. Conversely, when all values are retained after thresholding, ProxNPI reduces to NPI.
Figure 4: Applying Negative Prompt Inversion (NPI) to Mutual Self-Attention Control (MasaCtrl cao2023masactrl). Directly applying NPI to MasaCtrl by substituting the null embedding with the source prompt embedding leads to the presence of strange artifacts (labeled as "NPI w/ MasaCtrl"). In our approach, we solely replace the null embedding with the source prompt in the DDIM reconstruction branch.
Figure 5: Qualitative comparisons of inversion methods. The figure showcases qualitative comparisons among Null-text inversion (NTI) mokady2022null, Negative-prompt inversion (NPI) miyake2023negative, and our proposed method (ProxNPI). Each row demonstrates the reconstruction results (columns 2-4) and editing results (columns 5-7) for the respective methods. Inversion guidance is employed to address minor errors in DDIM reconstruction. Errors or artifacts are marked using red circles or boxes. The comparisons highlight instances where NPI fails to retain specific image details (a), both NTI and NPI introduce undesired changes (b), the inversion guidance aids in recovering missing details (c), our method exhibits better background preservation (d), and NTI/NPI exhibit reconstruction errors (e).
...and 8 more figures

Theorems & Definitions (3)

Remark 3.1
Remark A.1
proof

Improving Tuning-Free Real Image Editing with Proximal Guidance

TL;DR

Abstract

Improving Tuning-Free Real Image Editing with Proximal Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (3)