Table of Contents
Fetching ...

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran

TL;DR

This work proposes Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations and builds FrameDiT-G, a DiT architecture based on MatrixAttention, and introduces FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion.

Abstract

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.

FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

TL;DR

This work proposes Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations and builds FrameDiT-G, a DiT architecture based on MatrixAttention, and introduces FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion.

Abstract

High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.
Paper Structure (34 sections, 23 equations, 6 figures, 6 tables)

This paper contains 34 sections, 23 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of the proposed $\boldsymbol{\text{FrameDiT}}$. Built on the Diffusion Transformer with interleaved Spatial and Temporal blocks. (a) Local: conventional local factorized attention; (b) Global (ours): replaces temporal attention with Matrix Attention for frame-level temporal attention; (c) Global--Local Hybrid (ours): combines local and global temporal attention for unified spatio-temporal modeling.
  • Figure 2: Text-to-video generation comparison between Latte and our $\textbf{$\text{FrameDiT-H}$}$. We show 4 of 16 generated frames.
  • Figure 3: Scaling with video length. We compare Local Factorized, Full 3D attention, and our $\text{FrameDiT}$ variants as video length increases from 16 to 128 frames on the $128\times128$ Taichi dataset. From left to right: FVD, FLOPs, inference latency, and peak memory. While Full 3D achieves competitive quality, it exhibits steep growth in computational and memory costs. In contrast, our models maintain comparable or better FVD while scaling more efficiently, with latency and memory close to Local Factorized attention.
  • Figure 4: FVD comparison of different models as increasing model size. Each bubble shows a model variant, where y-axis reports FVD, and bubble diameter is proportional to GFLOPs.
  • Figure 5: Qualitative comparison on 128-frame Taichi-HD $\boldsymbol{128\times128}$. Local Factorized Attention exhibits severe temporal drift and collapsing human structure. In contrast, Full 3D model and $\text{FrameDiT-G}$, $\text{FrameDiT-H}$ remain stable even at 128 frames, generating smooth and coherent motion. The slight blurring of small regions (hands, face) arises from the low-resolution encoding of the Stable Diffusion 2.0 autoencoder.
  • ...and 1 more figures