Table of Contents
Fetching ...

Steering Video Diffusion Transformers with Massive Activations

Xianhang Cheng, Yujian Zheng, Zhenyu Xie, Tingting Liao, Hao Li

Abstract

Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.

Steering Video Diffusion Transformers with Massive Activations

Abstract

Despite rapid progress in video diffusion transformers, how their internal model signals can be leveraged with minimal overhead to enhance video generation quality remains underexplored. In this work, we study the role of Massive Activations (MAs), which are rare, high-magnitude hidden state spikes in video diffusion transformers. We observed that MAs emerge consistently across all visual tokens, with a clear magnitude hierarchy: first-frame tokens exhibit the largest MA magnitudes, latent-frame boundary tokens (the head and tail portions of each temporal chunk in the latent space) show elevated but slightly lower MA magnitudes than the first frame, and interior tokens within each latent frame remain elevated, yet are comparatively moderate in magnitude. This structured pattern suggests that the model implicitly prioritizes token positions aligned with the temporal chunking in the latent space. Based on this observation, we propose Structured Activation Steering (STAS), a training-free self-guidance-like method that steers MA values at first-frame and boundary tokens toward a scaled global maximum reference magnitude. STAS achieves consistent improvements in terms of video quality and temporal coherence across different text-to-video models, while introducing negligible computational overhead.
Paper Structure (44 sections, 5 equations, 18 figures, 9 tables, 1 algorithm)

This paper contains 44 sections, 5 equations, 18 figures, 9 tables, 1 algorithm.

Figures (18)

  • Figure 1: Activation magnitudes across different DiTs (averaged over 100 text prompts). (a--c) 3D bar charts of hidden-state activations over the patch-token and feature-dimension axes for three video DiTs; (d) the same visualization for an image DiT (FLUX). The image DiT (d) exhibits near-uniform token magnitudes along MA dimensions, whereas video DiTs (a--c) display pronounced structure—highest responses in the first latent frame and recurring spikes at latent-frame boundaries. (e) further confirms that boundary tokens consistently receive higher MA values than interior tokens, with the first latent frame attaining the largest values overall.
  • Figure 2: Temporal analysis of MA dimension 1188 in the middle block (block 15) of Wan2.1-1.3B, averaged over 100 prompts. As shown in (a), first-frame tokens exhibit the highest MA values, followed by boundary and then interior tokens. All three decrease monotonically throughout the sampling process, yet the boundary-to-interior ratio also declines in (b), revealing that the latent-frame boundary signal is most pronounced in early denoising stages.
  • Figure 3: Impact of MA manipulation on the quality of the first generated frame in Wan2.1-1.3B. Disrupting MA degrades first-frame quality (b–c), and amplifying MA at all tokens also leads to inferior results (d). In contrast, amplifying MA only at the first frame tokens improves the first frame (e–f). (g) Aesthetic-quality scores.
  • Figure 4: Impact of amplifying MAs at latent-frame boundaries on temporal consistency (Wan2.1-1.3B, boundary = head 8% + tail 8% of each latent frame). Top: DINO and CLIP similarity between consecutive frame pairs. We highlight the middle within-chunk transition (green) and cross-chunk transition (yellow). Cross-chunk transitions exhibit lower similarity, and amplifying MA dimensions at boundary positions (red curve) substantially reduces these dips. Bottom: Frame samples at cross-chunk boundaries (frames 28--29 and 32--33). The original video shows abrupt appearance changes (e.g., umbrella color), while boundary MA amplified video maintains visual consistency.
  • Figure 5: Qualitative comparison with and without STAS on Wan2.1-1.3B.
  • ...and 13 more figures