Table of Contents
Fetching ...

Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, Weiyao Lin

TL;DR

This work investigates Massive Activations (MAs) in Diffusion Transformers (DiTs) and reveals that MAs occur across all layers and tokens, with magnitudes shaped primarily by timestep embeddings rather than text prompts. The authors demonstrate that MAs are crucial for fine-grained local detail synthesis while barely affecting global semantic content. They propose Detail Guidance (DG), a training-free, MA-driven self-guidance strategy that degrades MA-driven detail generation in a controlled way and uses this degraded model to steer the original DiT toward higher-detail outputs; DG can be combined with Classifier-Free Guidance (CFG) for simultaneous improvements in detail fidelity and semantic alignment. Extensive experiments across SD3, SD3.5, and Flux show that DG consistently enhances local detail quality and integrates smoothly with CFG, often achieving state-of-the-art detail realism on challenging benchmarks.

Abstract

Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).

Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

TL;DR

This work investigates Massive Activations (MAs) in Diffusion Transformers (DiTs) and reveals that MAs occur across all layers and tokens, with magnitudes shaped primarily by timestep embeddings rather than text prompts. The authors demonstrate that MAs are crucial for fine-grained local detail synthesis while barely affecting global semantic content. They propose Detail Guidance (DG), a training-free, MA-driven self-guidance strategy that degrades MA-driven detail generation in a controlled way and uses this degraded model to steer the original DiT toward higher-detail outputs; DG can be combined with Classifier-Free Guidance (CFG) for simultaneous improvements in detail fidelity and semantic alignment. Extensive experiments across SD3, SD3.5, and Flux show that DG consistently enhances local detail quality and integrates smoothly with CFG, often achieving state-of-the-art detail realism on challenging benchmarks.

Abstract

Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).

Paper Structure

This paper contains 34 sections, 10 equations, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Visual results of our Detail Guidance (DG). Left: DG explicitly enhances fine-grained visual details, yielding high-quality outputs. Right: DG integrates seamlessly with Classifier-Free Guidance (CFG), allowing for further refinement of details.
  • Figure 2: Massive Activations in DiTs. The activation magnitudes of internal hidden states. We present the average magnitudes over 1,000 text prompts. Massive activations are consistently concentrated in a few fixed dimensions across all image patch tokens.
  • Figure 3: Illustration of several properties of massive activations in DiT-XL. (a) Activation distribution of the hidden states along DiT layers (b) Activation distribution of the hidden states along training iterations (c) Activation distribution of the hidden states across different model sizes. Massive activations occur throughout all layers and persist across different model sizes.
  • Figure 4: Impact of the input timestep and text on Massive Activations (MAs) in SD3. (a) Comparison of the distributions of hidden-state $z_t^k$ activations and their corresponding residual scaling factor $\alpha_k$. (b) Respective impact of input timestep and text embeddings on the magnitude distribution of MAs, where we compare the MAs of 1000 different text inputs. The massive activations are governed by the residual scaling factor; their magnitude is primarily shaped by the input timestep embedding $t$, while text embeddings $c$ have negligible effect.
  • Figure 5: Comparison of the original and Massive Activations (MAs) disrupted models. (a) Sampling results comparison between the original and MAs-disrupted models for SD3. (b) Win probability comparison for different models where we evaluate the model from two perspectives: Prompt Alignment (Blipscore and Clipscore) and Local Detail Quality (HPSv2.1 and Laion-Aesthetics). Disrupting massive activations markedly degrades the fidelity of fine-grained details in the generated images while exerting minimal impact on semantic content.
  • ...and 18 more figures