Table of Contents
Fetching ...

Understanding Attention Mechanism in Video Diffusion Models

Bingyan Liu, Chengyu Wang, Tongtong Su, Huan Ten, Jun Huang, Kailing Guo, Kui Jia

TL;DR

This paper analyzes how spatial and temporal attention in diffusion-based video diffusion models (VDMs) influence frame quality, motion, and structure. By perturbing attention maps with identity $I$ and uniform $U$ matrices and evaluating via entropy $\\mathcal{H}$ and energy $\\mathcal{E}$, it reveals that high-entropy attention maps correlate with better imaging quality while low-entropy maps carry structural information. It introduces information entropy-driven adaptation (IE-Adapt), a lightweight, training-free approach that leverages entropy cues to (i) enhance video synthesis during denoising and (ii) enable text-guided video editing by entropy-aware layer intervention. The methods are validated across multiple datasets and VDMs, showing improvements in video quality metrics and editing fidelity, with practical guidance on which layers to perturb. The work offers a principled lens on attention in VDMs and establishes a foundation for entropy-guided manipulation to improve both generation and editing in video diffusion pipelines.

Abstract

Text-to-video (T2V) synthesis models, such as OpenAI's Sora, have garnered significant attention due to their ability to generate high-quality videos from a text prompt. In diffusion-based T2V models, the attention mechanism is a critical component. However, it remains unclear what intermediate features are learned and how attention blocks in T2V models affect various aspects of video synthesis, such as image quality and temporal consistency. In this paper, we conduct an in-depth perturbation analysis of the spatial and temporal attention blocks of T2V models using an information-theoretic approach. Our results indicate that temporal and spatial attention maps affect not only the timing and layout of the videos but also the complexity of spatiotemporal elements and the aesthetic quality of the synthesized videos. Notably, high-entropy attention maps are often key elements linked to superior video quality, whereas low-entropy attention maps are associated with the video's intra-frame structure. Based on our findings, we propose two novel methods to enhance video quality and enable text-guided video editing. These methods rely entirely on lightweight manipulation of the attention matrices in T2V models. The efficacy and effectiveness of our methods are further validated through experimental evaluation across multiple datasets.

Understanding Attention Mechanism in Video Diffusion Models

TL;DR

This paper analyzes how spatial and temporal attention in diffusion-based video diffusion models (VDMs) influence frame quality, motion, and structure. By perturbing attention maps with identity and uniform matrices and evaluating via entropy and energy , it reveals that high-entropy attention maps correlate with better imaging quality while low-entropy maps carry structural information. It introduces information entropy-driven adaptation (IE-Adapt), a lightweight, training-free approach that leverages entropy cues to (i) enhance video synthesis during denoising and (ii) enable text-guided video editing by entropy-aware layer intervention. The methods are validated across multiple datasets and VDMs, showing improvements in video quality metrics and editing fidelity, with practical guidance on which layers to perturb. The work offers a principled lens on attention in VDMs and establishes a foundation for entropy-guided manipulation to improve both generation and editing in video diffusion pipelines.

Abstract

Text-to-video (T2V) synthesis models, such as OpenAI's Sora, have garnered significant attention due to their ability to generate high-quality videos from a text prompt. In diffusion-based T2V models, the attention mechanism is a critical component. However, it remains unclear what intermediate features are learned and how attention blocks in T2V models affect various aspects of video synthesis, such as image quality and temporal consistency. In this paper, we conduct an in-depth perturbation analysis of the spatial and temporal attention blocks of T2V models using an information-theoretic approach. Our results indicate that temporal and spatial attention maps affect not only the timing and layout of the videos but also the complexity of spatiotemporal elements and the aesthetic quality of the synthesized videos. Notably, high-entropy attention maps are often key elements linked to superior video quality, whereas low-entropy attention maps are associated with the video's intra-frame structure. Based on our findings, we propose two novel methods to enhance video quality and enable text-guided video editing. These methods rely entirely on lightweight manipulation of the attention matrices in T2V models. The efficacy and effectiveness of our methods are further validated through experimental evaluation across multiple datasets.

Paper Structure

This paper contains 25 sections, 6 equations, 24 figures, 5 tables, 1 algorithm.

Figures (24)

  • Figure 1: Perturbation results on AnimateDiff. $I$ perturbation refers to replacing $A$ with $I$ at the $L$-th layer, while $U$ perturbation substitutes it with a uniform matrix. The input prompts are selected from VBench. Perturbation is conducted on both spatial and temporal attention layers, where only one layer is perturbed at a time.
  • Figure 2: Histogram of attention map perturbation results on AnimateDiff. From top to bottom: structural measure, temporal consistency difference before and after perturbation, and shift in aesthetics. Smaller values for LPIPS indicate better performance, while larger values are preferred for the other metrics. EP: entropy percentage.
  • Figure 3: Visualization results of attention maps in the spatial and temporal layers of AnimateDiff. "Original" refers to the visualization of the original attention map. "Threshold" indicates the visualization after using a threshold, allowing for a clearer view of the distribution of attention map values.
  • Figure 4: Entropy and energy of attention layers in AnimateDiff. Top to bottom: energy values, energy proportion, information entropy proportion. The red line divides proportion values into top 50$\%$ and bottom 50$\%$. Information entropy is normalized due to varying upper bounds of different layer sizes.
  • Figure 5: Perturbation results in CogVideoX-5B.
  • ...and 19 more figures