Table of Contents
Fetching ...

MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion

Xunnong Xu, Mengying Cao

TL;DR

The paper addresses the high computational cost of generating high-resolution videos with diffusion models by introducing a multi-scale spatio-temporal causal attention framework (MSC) for autoregressive video diffusion. It combines a two-branch spatial design (High-Res local attention and Low-Res global attention) with a Hi-Lo temporal scheme (local windowed attention for fine-scale motion and strided global attention for coarse motion), all within frame-level causal conditioning. A key innovation is noise scale modulated attention, where per-frame diffusion timesteps weight the contributions of each scale, enabling effective conditioning on noisy frames during training. The MSC framework reduces computational complexity, supports long video generation, and remains applicable to both pixel-space and latent-space diffusion models, offering a general, scalable approach to video diffusion.

Abstract

Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.

MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion

TL;DR

The paper addresses the high computational cost of generating high-resolution videos with diffusion models by introducing a multi-scale spatio-temporal causal attention framework (MSC) for autoregressive video diffusion. It combines a two-branch spatial design (High-Res local attention and Low-Res global attention) with a Hi-Lo temporal scheme (local windowed attention for fine-scale motion and strided global attention for coarse motion), all within frame-level causal conditioning. A key innovation is noise scale modulated attention, where per-frame diffusion timesteps weight the contributions of each scale, enabling effective conditioning on noisy frames during training. The MSC framework reduces computational complexity, supports long video generation, and remains applicable to both pixel-space and latent-space diffusion models, offering a general, scalable approach to video diffusion.

Abstract

Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.

Paper Structure

This paper contains 7 sections, 5 figures.

Figures (5)

  • Figure 1: Multi-Resolution spatial learning framework: (a) two spatial branches in each transformer layer, with local sliding window attention for High-Res branch and global attention for Low-Res branch; (b) more down-sample in the Low-Res branch of deeper transformer layers.
  • Figure 2: After temporal compression with 3D-VAE, the IPB frame structure of video data clearly show two time scales: the low frequency I frames with long-range dependency and the high frequency P frames with short-range dependency.
  • Figure 3: Hi-Lo frequency temporal learning framework: (a) strided global attention for the Low-Res branch. (b) local sliding window attention for the High-Res branch. In both figures, the dark-red image patch is the attention query, while the light-red image patches are the attention key and value.
  • Figure 4: In our proposed frame-wise causal attention, each token-to-be-predicted can only be affected by tokens in previous frames, but not by those in future frames. During diffusion training, we use independent noise timesteps for tokens in different frames. When applying down-sampling operations, image features becomes less noisy, which is equivalent to using a smaller noise timestep for diffusion. In this way, conditioning on noisy frames in causal attention becomes more effective.
  • Figure 5: In each transformer layer, the timestep embedding is used to control the weights of the two parallel spatio-temporal branches. By stacking a series of such layers with different down-factor $r$ and with independently learned weights, the full network is able to pass information efficiently for noisy frames during causal conditioning.