Table of Contents
Fetching ...

Rare Text Semantics Were Always There in Your Diffusion Transformer

Seil Kang, Woojung Han, Dayun Ju, Seong Jae Hwang

TL;DR

Rare prompts challenge diffusion-transformer outputs because scarce concepts struggle to imprint during pretraining. ToRA provides a training-free intervention that surfaces latent semantics by splitting text embeddings into a principal space for Token Spacing and a residual space for Residual Alignment, with per-block PCA and Givens rotations to steer semantics toward desired directions; this yields improved rare-semantic emergence without external modules and generalizes to text-to-image, text-to-video, and image editing. Variance scale-up enhances local isotropy, while global anisotropy remains largely harmless to semantic emergence; however, alignment can fail for some seeds, motivating the Residual Alignment step. Across RareBench, T2I-CompBench, GenEval, and editing benchmarks, ToRA improves semantic coherence and visual fidelity for imaginative prompts while preserving performance on common prompts, offering a practical, plug-in approach to elevate internal semantic alignment in MM-DiTs.

Abstract

Starting from flow- and diffusion-based transformers, Multi-modal Diffusion Transformers (MM-DiTs) have reshaped text-to-vision generation, gaining acclaim for exceptional visual fidelity. As these models advance, users continually push the boundary with imaginative or rare prompts, which advanced models still falter in generating, since their concepts are often too scarce to leave a strong imprint during pre-training. In this paper, we propose a simple yet effective intervention that surfaces rare semantics inside MM-DiTs without additional training steps, data, denoising-time optimization, or reliance on external modules (e.g., large language models). In particular, the joint-attention mechanism intrinsic to MM-DiT sequentially updates text embeddings alongside image embeddings throughout transformer blocks. We find that by mathematically expanding representational basins around text token embeddings via variance scale-up before the joint-attention blocks, rare semantics clearly emerge in MM-DiT's outputs. Furthermore, our results generalize effectively across text-to-vision tasks, including text-to-image, text-to-video, and text-driven image editing. Our work invites generative models to reveal the semantics that users intend, once hidden yet ready to surface.

Rare Text Semantics Were Always There in Your Diffusion Transformer

TL;DR

Rare prompts challenge diffusion-transformer outputs because scarce concepts struggle to imprint during pretraining. ToRA provides a training-free intervention that surfaces latent semantics by splitting text embeddings into a principal space for Token Spacing and a residual space for Residual Alignment, with per-block PCA and Givens rotations to steer semantics toward desired directions; this yields improved rare-semantic emergence without external modules and generalizes to text-to-image, text-to-video, and image editing. Variance scale-up enhances local isotropy, while global anisotropy remains largely harmless to semantic emergence; however, alignment can fail for some seeds, motivating the Residual Alignment step. Across RareBench, T2I-CompBench, GenEval, and editing benchmarks, ToRA improves semantic coherence and visual fidelity for imaginative prompts while preserving performance on common prompts, offering a practical, plug-in approach to elevate internal semantic alignment in MM-DiTs.

Abstract

Starting from flow- and diffusion-based transformers, Multi-modal Diffusion Transformers (MM-DiTs) have reshaped text-to-vision generation, gaining acclaim for exceptional visual fidelity. As these models advance, users continually push the boundary with imaginative or rare prompts, which advanced models still falter in generating, since their concepts are often too scarce to leave a strong imprint during pre-training. In this paper, we propose a simple yet effective intervention that surfaces rare semantics inside MM-DiTs without additional training steps, data, denoising-time optimization, or reliance on external modules (e.g., large language models). In particular, the joint-attention mechanism intrinsic to MM-DiT sequentially updates text embeddings alongside image embeddings throughout transformer blocks. We find that by mathematically expanding representational basins around text token embeddings via variance scale-up before the joint-attention blocks, rare semantics clearly emerge in MM-DiT's outputs. Furthermore, our results generalize effectively across text-to-vision tasks, including text-to-image, text-to-video, and text-driven image editing. Our work invites generative models to reveal the semantics that users intend, once hidden yet ready to surface.

Paper Structure

This paper contains 44 sections, 13 equations, 42 figures, 5 tables.

Figures (42)

  • Figure 1: Our method, ToRA, achieves superior semantic alignment in text-to-vision outputs for rare prompts while requiring neither finetuning, optimization, nor additional modules; Misfired phrases in the baseline and existing method outputs are highlighted in red.
  • Figure 2: Comparison of existing diffusion-transformer methods: (a) Finetuning, (b) Optimization-based, (c) LLM-grounded guidance, and (d) Ours. (e): Contrastive results between rare and common prompts, comparing baseline and ours. The rotating arrow in (b) shows the latent‑vector update loop at each timestep.
  • Figure 2: Performance comparison between baseline and ours in Image Editing.
  • Figure 3: Effects of variance scaling. (a) Sum of eigenvalues of text embeddings across joint-attention blocks, (b) Generated images illustrating visual outcomes, (c) Local isotropy scores across joint-attention blocks, and (d) Visualization of self-attention maps for text embeddings. Results are shown for variance scale-down, original, or scale-up.
  • Figure 3: Comparison results on broader applicability in Text-to-Image generation using GenEval ghosh2023geneval and T2I-CompBench huang2023t2i. Darker cells indicate best scores; Lighter cells are second-best.
  • ...and 37 more figures