Table of Contents
Fetching ...

Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

Zitao Shuai, Chenwei Wu, Zhengxu Tang, Bowen Song, Liyue Shen

TL;DR

This work investigates how diffusion-transformer latent spaces encode semantics and reveals a disentangled joint representation formed by text and image latents. It introduces the EMS framework, combining Extract, Manipulate, and Sample steps to achieve zero-shot, fine-grained editing by linearly adjusting text and image embeddings and applying constrained score-distillation sampling. A novel semantic disentanglement metric (SDE) and the ZOFIE benchmark quantify editing precision and disentanglement, with experiments showing superior performance of diffusion transformers over UNet-based models in maintaining non-target semantics. The findings offer a practical, training-free approach for controllable image editing and establish resources for reproducible evaluation in semantic editing research.

Abstract

Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation. However, how text and image latents individually and jointly contribute to the semantics of generated images, remain largely unexplored. Through our investigation of DiT's latent space, we have uncovered key findings that unlock the potential for zero-shot fine-grained semantic editing: (1) Both the text and image spaces in DiTs are inherently decomposable. (2) These spaces collectively form a disentangled semantic representation space, enabling precise and fine-grained semantic control. (3) Effective image editing requires the combined use of both text and image latent spaces. Leveraging these insights, we propose a simple and effective Extract-Manipulate-Sample (EMS) framework for zero-shot fine-grained image editing. Our approach first utilizes a multi-modal Large Language Model to convert input images and editing targets into text descriptions. We then linearly manipulate text embeddings based on the desired editing degree and employ constrained score distillation sampling to manipulate image embeddings. We quantify the disentanglement degree of the latent space of diffusion models by proposing a new metric. To evaluate fine-grained editing performance, we introduce a comprehensive benchmark incorporating both human annotations, manual evaluation, and automatic metrics. We have conducted extensive experimental results and in-depth analysis to thoroughly uncover the semantic disentanglement properties of the diffusion transformer, as well as the effectiveness of our proposed method. Our annotated benchmark dataset is publicly available at https://anonymous.com/anonymous/EMS-Benchmark, facilitating reproducible research in this domain.

Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

TL;DR

This work investigates how diffusion-transformer latent spaces encode semantics and reveals a disentangled joint representation formed by text and image latents. It introduces the EMS framework, combining Extract, Manipulate, and Sample steps to achieve zero-shot, fine-grained editing by linearly adjusting text and image embeddings and applying constrained score-distillation sampling. A novel semantic disentanglement metric (SDE) and the ZOFIE benchmark quantify editing precision and disentanglement, with experiments showing superior performance of diffusion transformers over UNet-based models in maintaining non-target semantics. The findings offer a practical, training-free approach for controllable image editing and establish resources for reproducible evaluation in semantic editing research.

Abstract

Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation. However, how text and image latents individually and jointly contribute to the semantics of generated images, remain largely unexplored. Through our investigation of DiT's latent space, we have uncovered key findings that unlock the potential for zero-shot fine-grained semantic editing: (1) Both the text and image spaces in DiTs are inherently decomposable. (2) These spaces collectively form a disentangled semantic representation space, enabling precise and fine-grained semantic control. (3) Effective image editing requires the combined use of both text and image latent spaces. Leveraging these insights, we propose a simple and effective Extract-Manipulate-Sample (EMS) framework for zero-shot fine-grained image editing. Our approach first utilizes a multi-modal Large Language Model to convert input images and editing targets into text descriptions. We then linearly manipulate text embeddings based on the desired editing degree and employ constrained score distillation sampling to manipulate image embeddings. We quantify the disentanglement degree of the latent space of diffusion models by proposing a new metric. To evaluate fine-grained editing performance, we introduce a comprehensive benchmark incorporating both human annotations, manual evaluation, and automatic metrics. We have conducted extensive experimental results and in-depth analysis to thoroughly uncover the semantic disentanglement properties of the diffusion transformer, as well as the effectiveness of our proposed method. Our annotated benchmark dataset is publicly available at https://anonymous.com/anonymous/EMS-Benchmark, facilitating reproducible research in this domain.
Paper Structure (25 sections, 2 theorems, 5 equations, 17 figures, 4 tables)

This paper contains 25 sections, 2 theorems, 5 equations, 17 figures, 4 tables.

Key Result

Proposition 1

Let the editing directions $\mathbf{n}_1, \mathbf{n}_2, \dots, \mathbf{n}_m \in \mathbb{R}^d$ be unit vectors, i.e., $\|\mathbf{n}_i\| = 1$ for all $i = 1, \dots, m$. Define $\mathbf{n}_i^{\text{ext}} \in \mathbb{R}^{md}$ as the extension of $\mathbf{n}_i$ into $\mathbb{R}^{md}$ by placing $\mathbf{ for any $\alpha \geq 1$ and $d \geq 4$. Here, $\mathbb{P}(\cdot)$ stands for probability and $c$ is

Figures (17)

  • Figure 1: (a). Comparisons of latent spaces of GAN-based models, UNet-based diffusion models, and diffusion transformers. (b). In classic modeling on visual generation and feature representation wang2021self, images are generated by unseen semantics through implicit mapping. We aim to find a semantic representation space learned by neural networks to manipulate visual semantics. (c). Text-to-image diffusion transformer possesses a disentangled semantic representation space, which facilitates precise and fine-grained editing on target semantics. We achieve this by identifying disentangled editing directions and modifying the original representations within the semantic representation space.
  • Figure 2: (a). We observed the semantic loss phenomenon, certain semantics will be lost during the forward process, and will be assigned random values in the reserve process if they are not conditioned. (b). The text latent space is decomposable, and provides explicit directions for adjusting how semantis move toward desired values in the edited images. For fine-grained and precise editing, we first obtain a text prompt of the source image, and gain a text prompt that describes the desired image. We then manipulate representations encoded from these two prompts. (c). Only conditioning on editing targets would result in inaccurate editing, where some task-irrelevant semantics are incorrectly modified.
  • Figure 3: (a). We observe that only manipulating text embeddings cannot effectively modify certain semantics. On the other hand, modifying the entire latent space, including score-distillation-sampling-based (SDS) manipulation on image embeddings, can help address this issue. (b). Given a reference image, image embeddings can also be linearly manipulated to achieve fine-grained editing. However, in real-world scenarios, an ideal reference image is often unavailable.
  • Figure 4: (a) Image-text cross-attention mechanism in UNet-based diffusion models. (b) Self-attention mechanism in the diffusion transformer, where image and text embeddings are concatenated and jointly flowed into the attention block.
  • Figure 5: Results of the probing analysis. We provide ratios of attention maps from different models being classified into various color categories. when tested against corresponding color classifiers.
  • ...and 12 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Remark 1
  • Proposition 2