Table of Contents
Fetching ...

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

Trung X. Pham, Kang Zhang, Ji Woo Hong, Chang D. Yoo

TL;DR

This work presents the first systematic study of class-conditioned embeddings of Transformer-based diffusion models and uncover a notable redundancy, revealing a semantic bottleneck in Transformer-based diffusion models.

Abstract

Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99\% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9\%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions--removing up to two-thirds of the embedding space--we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer-based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

TL;DR

This work presents the first systematic study of class-conditioned embeddings of Transformer-based diffusion models and uncover a notable redundancy, revealing a semantic bottleneck in Transformer-based diffusion models.

Abstract

Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99\% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9\%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions--removing up to two-thirds of the embedding space--we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer-based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.
Paper Structure (44 sections, 16 equations, 43 figures, 7 tables)

This paper contains 44 sections, 16 equations, 43 figures, 7 tables.

Figures (43)

  • Figure 1: Hidden Semantic Bottleneck: Extreme Alignment and Dimensional Sparsity. Conditional vectors $\vec{c}$ in state-of-the-art Transformer diffusion models on ImageNet-1K exhibit very high pairwise cosine similarity (mostly 90–99%) while concentrating semantic information in only a few of 1,152 dimensions.
  • Figure 2: Transformer-based diffusion models inject conditions as a globally compact vectors $\vec{v}_i$ via AdaLN for outputs such as images or mel-spectrograms.
  • Figure 3: Cosine similarity of conditional vectors $\vec{c}=y+t$ across 1000 ImageNet classes using REPA-XL yu2025representation. Despite distinct semantics, embeddings show over 99% similarity for nearly all class pairs. Left: full $1000\!\times\!1000$ matrix showing global alignment. Right: zoomed $10\!\times\!10$ subset for randomly chosen classes. Additional results for other SOTA methods appear in the Appendix.
  • Figure 4: Sparsity and alignment of conditional embeddings in X-MDPT pham2024crossview. (a) and (b): With $\tau=0.1$, over 51% of components in the conditional vectors have magnitudes below the threshold, highlighting significant sparsity. Remarkably, pruning these dimensions has minimal effect on generation quality. (c) Cosine similarity between random test samples in DeepFashion exceeds 99.9%, confirming extreme alignment across conditional embeddings.
  • Figure 5: Magnitude histogram distribution of learned conditional vector embedding $\vec{c}\in\mathbb{R}^{\times 1152}$. Most dimensions have near-zero values ($<0.01$), with only $\sim 5-20$ dimensions showing dominant magnitudes. This sparsity holds across multiple models, including DiT, MDT, LightningDiT, MG, SiT, and REPA. It is best viewed with 300% zoomed in.
  • ...and 38 more figures