Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing

Siyi Chen; Huijie Zhang; Minzhe Guo; Yifu Lu; Peng Wang; Qing Qu

Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing

Siyi Chen, Huijie Zhang, Minzhe Guo, Yifu Lu, Peng Wang, Qing Qu

TL;DR

This work provides a solid theoretical basis to justify the linearity and low-rankness in the PMP, and proposes an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models.

Abstract

Recently, diffusion models have emerged as a powerful class of generative models. Despite their success, there is still limited understanding of their semantic spaces. This makes it challenging to achieve precise and disentangled image generation without additional training, especially in an unsupervised way. In this work, we improve the understanding of their semantic spaces from intriguing observations: among a certain range of noise levels, (1) the learned posterior mean predictor (PMP) in the diffusion model is locally linear, and (2) the singular vectors of its Jacobian lie in low-dimensional semantic subspaces. We provide a solid theoretical basis to justify the linearity and low-rankness in the PMP. These insights allow us to propose an unsupervised, single-step, training-free LOw-rank COntrollable image editing (LOCO Edit) method for precise local editing in diffusion models. LOCO Edit identified editing directions with nice properties: homogeneity, transferability, composability, and linearity. These properties of LOCO Edit benefit greatly from the low-dimensional semantic subspace. Our method can further be extended to unsupervised or text-supervised editing in various text-to-image diffusion models (T-LOCO Edit). Finally, extensive empirical experiments demonstrate the effectiveness and efficiency of LOCO Edit. The codes will be released at https://github.com/ChicyChen/LOCO-Edit.

Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing

TL;DR

Abstract

Paper Structure (57 sections, 3 theorems, 32 equations, 11 figures, 1 table, 4 algorithms)

This paper contains 57 sections, 3 theorems, 32 equations, 11 figures, 1 table, 4 algorithms.

Introduction
Benefits of LOCO Edit.
Notations.
Preliminaries on Diffusion Models
Basics of diffusion models.
Learning the denoiser and estimation of the posterior mean.
DDIM and DDIM inversion.
Text-to-image (T2I) diffusion models and classifier-free guidance.
Exploring Linearity & Low-Dimensionality for Image Editting
Local Linearity and Intrinsic Low-Dimensionality in PMP
Key intuitions for precise image editing.
Low-rank Controllable Image Editing Method with Null-space Projection
Undirected LOCO Edit.
Text-directed LOCO Edit.
Justification of Local Linearity, Low-rankness, & Semantic Direction
...and 42 more sections

Key Result

Lemma 1

Under assumption:data distrib, for $t \in (0, 1]$, the posterior mean is

Figures (11)

Figure 1: LOCO Edit Result. (a) The proposed method can perform precise localized editing in the region of interest. The editing direction is (b) homogeneous, (c) composable, and (d) linear.
Figure 2: Low-rankness of the Jacobian $\bm J_{\bm \theta, t}(\bm x_t)$ and Local linearity of the PMP $\bm f_{\bm \theta, t}(\bm x_t)$. We evaluated DDPM (U-Net unet architecture) on CIFAR-10 dataset cifar10, U-ViT uvit (Transformer based networks) on CelebA CelebA, ImageNet imagenet datasets and DeepFloy IF DeepFloyd trained on LAION-5B schuhmann2022laion dataset. (a) The rank ratio of $\bm J_{\bm \theta, t}(\bm x_t)$ against timestep $t$. (b) The $\ell_2$-norm ratio (Top) and cosine similarity (Bottom) between $\bm{f}_{\bm \theta,t}(\bm x_t+ \lambda \Delta \bm x)$ and $\bm l_{\bm \theta}(\bm x_t; \lambda \Delta \bm x)$ against step size $\lambda$ at timestep $t = 0.7$. The detailed experiment settings are provided in \ref{['appendix:exp_setup_low_rank_linear']}.
Figure 3: Illustration of the undirected LOCO Edit for unconditional diffusion models. Given an image $\bm x_0$, we perform DDIM-Inv until time $t$ and estimate $\bm{\hat{x}_{0,t}}$ from $\bm x_t$. After masking to get the region of interest (ROI) $\bm{\Tilde{x}_{0,t}}$ and its counterparts $\bm{\bar{x}_{0,t}}$, we find the edit direction $\bm v_p$ via SVD and null space projection using Jacobians. By denoising $\bm x_t + \textcolor{red}{\lambda \bm v_p}$, an image $\bm x'_0$ with localized edition is generated. In this paper, the variables and notions related to ROI, null space, and final direction are respectively highlighted by green, blue, and red colors.
Figure 4: LOCO Edit on T2I diffusion models. (a) Undirected editing direction is found only via the give mask without editing prompt. (b) Text-directed editing direction is found with both a mask and an editing prompt such as "with glasses". Experiment details can be found in \ref{['appendix:t2i_setup']}.
Figure 5: Unconditional editing of the proposed method on various datasets. For each group of three images, in the center is the original image, and on the left and right are edited images along the negative and the positive directions accordingly.
...and 6 more figures

Theorems & Definitions (6)

Lemma 1
Theorem 1
proof : Proof of \ref{['lemma:posterior_mean']}
Lemma 2
proof : Proof of \ref{['lemma:jacobian of posterior mean']}
proof : Proof of \ref{['thm:1']}

Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing

TL;DR

Abstract

Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (6)