Table of Contents
Fetching ...

Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion

Rui Hong, Shuxue Quan

Abstract

We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy -- global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9\% of the base UNet) while achieving competitive results on WebVid validation when trained on 100K videos. We demonstrate that the standard denoising objective alone provides sufficient implicit temporal regularization, outperforming approaches that add explicit temporal consistency losses. Our ablation studies reveal a clear trade-off between noise correlation and motion amplitude, providing a practical inference-time control for diverse generation behaviors.

Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion

Abstract

We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy -- global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9\% of the base UNet) while achieving competitive results on WebVid validation when trained on 100K videos. We demonstrate that the standard denoising objective alone provides sufficient implicit temporal regularization, outperforming approaches that add explicit temporal consistency losses. Our ablation studies reveal a clear trade-off between noise correlation and motion amplitude, providing a practical inference-time control for diverse generation behaviors.
Paper Structure (31 sections, 7 equations, 3 figures, 4 tables)

This paper contains 31 sections, 7 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Architecture overview. The overall pipeline follows the Latent Diffusion Model framework: video frames are encoded into latent space by a frozen VAE encoder ($\mathcal{E}$), processed by the denoising UNet with injected temporal attention modules (shown in red), conditioned on text via CLIP, and the VAE decoder $\mathcal{D}$ reconstructs the frames back to pixel space. All spatial components (ResNet, Spatial Transformer) are frozen (snowflake); only temporal attention blocks are trained (flame). Down and mid blocks use global temporal attention; up blocks use motion-adaptive attention.
  • Figure 2: Temporal attention block variants. Left: Global mode (down/mid blocks) --- uniform temporal self-attention, no motion conditioning. Right: Adaptive mode (up blocks) --- motion-adaptive attention bias $\mathbf{b}_m$ and motion-aware gate $g(m)$ are both conditioned on the per-video motion score $m$.
  • Figure 3: Qualitative comparison on two WebVid validation prompts. Each pair shows frame 1 and frame 8 of the 8-frame generated sequence at $256{\times}256$, highlighting temporal consistency across the full generation. AnimateDiff$\dagger$ retrained on 100K videos, $T{=}8$.