SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

Qi Qian; Haiyang Xu; Ming Yan; Juhua Hu

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

Qi Qian, Haiyang Xu, Ming Yan, Juhua Hu

TL;DR

This work investigates the approximation error in DDIM inversion and proposes to disentangle the guidance scale for the source and target branches to reduce the error while keeping the original framework and can improve the performance of DDIM inversion dramatically without sacrificing efficiency.

Abstract

Diffusion models demonstrate impressive image generation performance with text guidance. Inspired by the learning process of diffusion, existing images can be edited according to text by DDIM inversion. However, the vanilla DDIM inversion is not optimized for classifier-free guidance and the accumulated error will result in the undesired performance. While many algorithms are developed to improve the framework of DDIM inversion for editing, in this work, we investigate the approximation error in DDIM inversion and propose to disentangle the guidance scale for the source and target branches to reduce the error while keeping the original framework. Moreover, a better guidance scale (i.e., 0.5) than default settings can be derived theoretically. Experiments on PIE-Bench show that our proposal can improve the performance of DDIM inversion dramatically without sacrificing efficiency.

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

TL;DR

Abstract

Paper Structure (25 sections, 6 theorems, 16 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 6 theorems, 16 equations, 13 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Simple Inversion
DDIM Inversion for Image Editing
Approximation Error in Dual Branch Image Editing
Simple Inversion with Symmetric Guidance Scale
Experiments
Quantitive Comparison on PIE-Bench
Qualitative Comparison
Ablation Study
Effect of $w_s$
Comparison of Running Time
Conclusion
Limitations
Broader Impacts
...and 10 more sections

Key Result

Proposition 1

Assuming that the gradient of $\epsilon$ on $z_{t-1}$ is bounded as $\|J_\epsilon(z_{t-1})\|_F\leq c$, we have

Figures (13)

Figure 1: Illustration of image editing by DDIM inversion and ours. $z_0$ denotes the source image. $z_0^s$ and $z_0^t$ are generated images from the source and target branches, respectively.
Figure 2: Illustration of image editing for random editing. The difference is highlighted by red bounding boxes.
Figure 3: Illustration of image editing for changing object.
Figure 4: Illustration of image editing for adding object.
Figure 5: Illustration of image editing for deleting object. The difference is highlighted by red bounding boxes.
...and 8 more figures

Theorems & Definitions (12)

Proposition 1
Proposition 2
Proposition 3
Corollary 1
Proposition 4
proof
Corollary 2
proof
proof
proof
...and 2 more

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

TL;DR

Abstract

SimInversion: A Simple Framework for Inversion-Based Text-to-Image Editing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (12)