Table of Contents
Fetching ...

Exploring the latent space of diffusion models directly through singular value decomposition

Li Wang, Boyan Gao, Yanran Li, Zhao Wang, Xiaosong Yang, David A. Clifton, Jun Xiao

TL;DR

The paper tackles the challenge of interpreting and editing the latent space of diffusion models by performing Singular Value Decomposition directly on latent codes across diffusion time steps. It reveals three key properties of latent subspaces and introduces an Attribute Vector Integration framework that learns to embed new attributes from paired prompts without data collection or auxiliary spaces, using a learned singular-value predictor and targeted loss terms. Extensive experiments across vision datasets and text-to-image pipelines demonstrate improved attribute control with preserved identity fidelity, supported by theoretical analysis of edit fidelity. The approach offers a data-free, theoretically grounded pathway to flexible image editing within the diffusion latent space, with potential for broad impact on controllable image synthesis and manipulation.

Abstract

Despite the groundbreaking success of diffusion models in generating high-fidelity images, their latent space remains relatively under-explored, even though it holds significant promise for enabling versatile and interpretable image editing capabilities. The complicated denoising trajectory and high dimensionality of the latent space make it extremely challenging to interpret. Existing methods mainly explore the feature space of U-Net in Diffusion Models (DMs) instead of the latent space itself. In contrast, we directly investigate the latent space via Singular Value Decomposition (SVD) and discover three useful properties that can be used to control generation results without the requirements of data collection and maintain identity fidelity generated images. Based on these properties, we propose a novel image editing framework that is capable of learning arbitrary attributes from one pair of latent codes destined by text prompts in Stable Diffusion Models. To validate our approach, extensive experiments are conducted to demonstrate its effectiveness and flexibility in image editing. We will release our codes soon to foster further research and applications in this area.

Exploring the latent space of diffusion models directly through singular value decomposition

TL;DR

The paper tackles the challenge of interpreting and editing the latent space of diffusion models by performing Singular Value Decomposition directly on latent codes across diffusion time steps. It reveals three key properties of latent subspaces and introduces an Attribute Vector Integration framework that learns to embed new attributes from paired prompts without data collection or auxiliary spaces, using a learned singular-value predictor and targeted loss terms. Extensive experiments across vision datasets and text-to-image pipelines demonstrate improved attribute control with preserved identity fidelity, supported by theoretical analysis of edit fidelity. The approach offers a data-free, theoretically grounded pathway to flexible image editing within the diffusion latent space, with potential for broad impact on controllable image synthesis and manipulation.

Abstract

Despite the groundbreaking success of diffusion models in generating high-fidelity images, their latent space remains relatively under-explored, even though it holds significant promise for enabling versatile and interpretable image editing capabilities. The complicated denoising trajectory and high dimensionality of the latent space make it extremely challenging to interpret. Existing methods mainly explore the feature space of U-Net in Diffusion Models (DMs) instead of the latent space itself. In contrast, we directly investigate the latent space via Singular Value Decomposition (SVD) and discover three useful properties that can be used to control generation results without the requirements of data collection and maintain identity fidelity generated images. Based on these properties, we propose a novel image editing framework that is capable of learning arbitrary attributes from one pair of latent codes destined by text prompts in Stable Diffusion Models. To validate our approach, extensive experiments are conducted to demonstrate its effectiveness and flexibility in image editing. We will release our codes soon to foster further research and applications in this area.

Paper Structure

This paper contains 14 sections, 1 theorem, 13 equations, 8 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Given $x, z \in \mathcal{X}$ and their corresponding SVD, $U_x,S_x,V_x = \text{SVD}(x)$ and $U_z, S_z, V_z = \text{SVD}(z)$ where $U_x, \text{ and } U_z \in R^{M\times N}$. Let $k = \frac{N}{2}$, then the distance between attribute vectors $\hat{U}$ and its source singular value vectors $U_x$ and $U where $\hat{U} = [U_{x[:,:k]}, U^{'}_{z[:,:k]}]$.

Figures (8)

  • Figure 1: Our framework overview for image editing. (1) During the denoising process, we select one time step $T_{x}$ for introducing new attributes. (2) Two latent codes $x_{T_{x}}$ and $z_{T_x + \Delta \tau}$, guided by a pair of text prompts, is fed into our proposed AVI algorithm. (3) The AVI outputs a latent code $y_{pred}$ to replace $x_{T_x}$ to continue the denoising process with the guidance of the original text prompt. Note that the SVD is performed channel-wise.
  • Figure 2: Impact of singular values on their main singular vectors on Unconditional Diffusion Models (CelebA-HQ dataset on the left and LSUN-Cat and LSUN-Church datasets on the right). Fine-grained attributes, such as colours and texture appearances, are changing with respect to their singular values at earlier diffusion time steps.
  • Figure 3: Representative examples shown on the attributes that one single singular vector affects across the time steps in Stable Diffusion Models (ver-2.1). Starting from the second column, the rows show the impact of singular values, and the columns present the attributes that different singular vectors may affect. Texts on the left side of images denote the attributes that a singular vector affects. It is noticeable that attribute vectors (e.g., belt details) ordered in lower places at later time steps (e.g, last row under 0.7$T$) ascended to higher places at earlier time steps (e.g, 0.6$T$)
  • Figure 4: Impact of singular values on their main singular vectors on Stable Diffusion Models (ver 2.1).
  • Figure 5: Geodesic Distance across subspaces constructed by singular vectors of different datasets at various diffusion time steps. It is noticeable that the variance of stable diffusion is less than diffusion models trained on the CelebA-HQ dataset, thus we tend to consider the geodesic distance on subspaces to maintain semantically similar across all time steps.
  • ...and 3 more figures

Theorems & Definitions (1)

  • Theorem 3.1