Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers
Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, Weiyao Lin
TL;DR
This work investigates Massive Activations (MAs) in Diffusion Transformers (DiTs) and reveals that MAs occur across all layers and tokens, with magnitudes shaped primarily by timestep embeddings rather than text prompts. The authors demonstrate that MAs are crucial for fine-grained local detail synthesis while barely affecting global semantic content. They propose Detail Guidance (DG), a training-free, MA-driven self-guidance strategy that degrades MA-driven detail generation in a controlled way and uses this degraded model to steer the original DiT toward higher-detail outputs; DG can be combined with Classifier-Free Guidance (CFG) for simultaneous improvements in detail fidelity and semantic alignment. Extensive experiments across SD3, SD3.5, and Flux show that DG consistently enhances local detail quality and integrates smoothly with CFG, often achieving state-of-the-art detail realism on challenging benchmarks.
Abstract
Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient'' model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).
