Table of Contents
Fetching ...

Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing

Zitao Shuai, Chenwei Wu, Zhengxu Tang, Bowen Song, Liyue Shen

TL;DR

This paper investigates the latent space of DiT models and finds that DiT's latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions and proposes a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing.

Abstract

Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images. However, it remains largely unexplored how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images. In this paper, we investigate the latent space of DiT models and uncover two key properties: First, DiT's latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions. Second, consistent semantic editing requires utilizing the entire joint latent space, as neither encoded image nor text alone contains enough semantic information. We show that these editing directions can be obtained directly from text prompts, enabling precise semantic control without additional training or mask annotations. Based on these insights, we propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing. Specifically, we first encode both the given source image and the text prompt that describes the image, to obtain the joint latent embedding. Then, using our proposed Hessian Score Distillation Sampling (HSDS) method, we identify editing directions that control specific target attributes while preserving other image features. These directions are guided by text prompts and used to manipulate the latent embeddings. Moreover, we propose a new metric to quantify the disentanglement degree of the latent space of diffusion models. Extensive experiment results on our new curated benchmark dataset and analysis demonstrate DiT's disentanglement properties and effectiveness of the EIM framework.

Latent Space Disentanglement in Diffusion Transformers Enables Precise Zero-shot Semantic Editing

TL;DR

This paper investigates the latent space of DiT models and finds that DiT's latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions and proposes a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing.

Abstract

Diffusion Transformers (DiTs) have recently achieved remarkable success in text-guided image generation. In image editing, DiTs project text and image inputs to a joint latent space, from which they decode and synthesize new images. However, it remains largely unexplored how multimodal information collectively forms this joint space and how they guide the semantics of the synthesized images. In this paper, we investigate the latent space of DiT models and uncover two key properties: First, DiT's latent space is inherently semantically disentangled, where different semantic attributes can be controlled by specific editing directions. Second, consistent semantic editing requires utilizing the entire joint latent space, as neither encoded image nor text alone contains enough semantic information. We show that these editing directions can be obtained directly from text prompts, enabling precise semantic control without additional training or mask annotations. Based on these insights, we propose a simple yet effective Encode-Identify-Manipulate (EIM) framework for zero-shot fine-grained image editing. Specifically, we first encode both the given source image and the text prompt that describes the image, to obtain the joint latent embedding. Then, using our proposed Hessian Score Distillation Sampling (HSDS) method, we identify editing directions that control specific target attributes while preserving other image features. These directions are guided by text prompts and used to manipulate the latent embeddings. Moreover, we propose a new metric to quantify the disentanglement degree of the latent space of diffusion models. Extensive experiment results on our new curated benchmark dataset and analysis demonstrate DiT's disentanglement properties and effectiveness of the EIM framework.

Paper Structure

This paper contains 40 sections, 2 theorems, 27 equations, 22 figures, 5 tables.

Key Result

Proposition 1

Let the editing directions $\mathbf{n}_1, \mathbf{n}_2, \dots, \mathbf{n}_m \in \mathbb{R}^d$ be unit vectors, i.e., $\|\mathbf{n}_i\| = 1$ for all $i = 1, \dots, m$. Define $\mathbf{n}_i^{\text{ext}} \in \mathbb{R}^{md}$ as the extension of $\mathbf{n}_i$ into $\mathbb{R}^{md}$ by placing $\mathbf{ for any $\alpha \geq 1$ and $d \geq 4$. Here, $\mathbb{P}(\cdot)$ stands for probability and $c$ is

Figures (22)

  • Figure 1: (a) UNet-based models align text embeddings with image embeddings via cross-attention layers. In contrast, DiT creates a joint latent space by combining the text embedding and image embedding, then feeds them into the denoising block. (b) DiT's has semantic disentangled latent space, where intensities of image semantics in generated images are controlled by separate directions, which can be easily identified.
  • Figure 2: The disentanglement properties of the joint latent space of DiT enable: (a) Given a target semantic to be edited, we can identify a direction that allows fine-grained control of the intensity of this semantic; (b) With a reference image that only differs in the target semantic, we can obtain such direction in the image embedding space and achieve similar precise control.
  • Figure 3: To effectively modify semantics, we must manipulate the entire joint latent space. This is due to two observed challenges: (a) semantic loss, where certain semantics are randomly recovered in the denoising process if they are not conditioned; (b) manipulating only text embeddings is insufficient for modifying certain semantics effectively.
  • Figure 4: (a) Image-text cross-attention mechanism in UNet-based diffusion models. (b) Self-attention mechanism in the diffusion transformer, where image and text embeddings are concatenated and jointly flowed into the attention block. In the self-attention mechanism, the attention map of a specific semantic contains less category information of other semantics, as shown in Sec. \ref{['sec: probing']}.
  • Figure 5: (a). Encode: Given a source image, we first utilize a multi-modal LLM to get a source prompt that describes semantics of the image, and a target prompt for the desired image. We encode the source prompt and the given image to a joint latent embedding. (b). Identify: We obtain the editing direction in text subspace by subtracting embeddings encoded from the source prompt and target prompt. We propose a Hessian-Score-Distillation-Sampling method to identify the editing direction in image subspace. (c). Manipulate. Finally, we linearly combine the joint latent embedding with on the identified direction.
  • ...and 17 more figures

Theorems & Definitions (5)

  • Remark 1
  • Proposition 1
  • Proposition 2
  • proof
  • proof