Table of Contents
Fetching ...

Analysis of Attention in Video Diffusion Transformers

Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, Ashwinee Panda

TL;DR

This work presents a systematic analysis of attention in Video Diffusion Transformers (VDiTs), identifying three core properties—Structure, Sparsity, and Sinks—that shape efficiency and controllability. It demonstrates that attention is highly structured by spatiotemporal locality, enabling zero-shot video editing through self-attention transfer, and reveals a text-token dominance where the first token largely governs generation. The sparsity study shows that only a few late-layer heads are highly sensitive to pruning, while sparsity can be pushed further in other layers, especially when combined with temperature control and retraining strategies. Attention sinks are characterized as largely uninformative but consistent in late layers, and retraining the final blocks can remove sinks and restore sparsifiability, improving the efficiency-quality trade-off. These findings offer practical guidelines for designing more efficient, editable, and controllable VDiT systems and point to promising directions for future research and model optimization.

Abstract

We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in language models. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.

Analysis of Attention in Video Diffusion Transformers

TL;DR

This work presents a systematic analysis of attention in Video Diffusion Transformers (VDiTs), identifying three core properties—Structure, Sparsity, and Sinks—that shape efficiency and controllability. It demonstrates that attention is highly structured by spatiotemporal locality, enabling zero-shot video editing through self-attention transfer, and reveals a text-token dominance where the first token largely governs generation. The sparsity study shows that only a few late-layer heads are highly sensitive to pruning, while sparsity can be pushed further in other layers, especially when combined with temperature control and retraining strategies. Attention sinks are characterized as largely uninformative but consistent in late layers, and retraining the final blocks can remove sinks and restore sparsifiability, improving the efficiency-quality trade-off. These findings offer practical guidelines for designing more efficient, editable, and controllable VDiT systems and point to promising directions for future research and model optimization.

Abstract

We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in language models. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.

Paper Structure

This paper contains 18 sections, 16 figures.

Figures (16)

  • Figure 1: Attention Maps. Different models have the same structured attention patterns.
  • Figure 2: Attention Map Transfer to a Different Prompt.
  • Figure 3: Attention Map Transfer to a Close Prompt.
  • Figure 4: Attention Map Transfer to a Close Prompt. The source prompt is "A car is driving on the highway." from \ref{['fig:attention-transfer-red-car']} (a).
  • Figure 5: Attention Map Transfer to a Close Prompt only with One Layer. The source prompt is "A car is driving on the highway." from \ref{['fig:attention-transfer-red-car']} (a).
  • ...and 11 more figures