Table of Contents
Fetching ...

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K. Jha, Xiaoliang Dai

TL;DR

LinGen tackles the heavy computational burden of high-resolution, long-duration text-to-video generation by replacing quadratic self-attention with a linear-block architecture (MATE) that combines a short-to-long-range MA-branch with RMS and review tokens and a TEmporal Swin Attention TE-branch. Through progressive, hybrid training and quality tuning, LinGen delivers minute-length videos at high resolutions with linear scalability and substantial speedups over Diffusion Transformer baselines, while maintaining competitive quality with state-of-the-art models. The approach demonstrates strong efficiency, robust long-range consistency, and rapid adaptation to longer tokens, paving the way for hour-length and real-time video generation. Overall, LinGen provides a practical, scalable path for high-quality, long-form text-to-video generation on commodity hardware.

Abstract

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15$\times$ (11.5$\times$) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

TL;DR

LinGen tackles the heavy computational burden of high-resolution, long-duration text-to-video generation by replacing quadratic self-attention with a linear-block architecture (MATE) that combines a short-to-long-range MA-branch with RMS and review tokens and a TEmporal Swin Attention TE-branch. Through progressive, hybrid training and quality tuning, LinGen delivers minute-length videos at high resolutions with linear scalability and substantial speedups over Diffusion Transformer baselines, while maintaining competitive quality with state-of-the-art models. The approach demonstrates strong efficiency, robust long-range consistency, and rapid adaptation to longer tokens, paving the way for hour-length and real-time video generation. Overall, LinGen provides a practical, scalable path for high-quality, long-form text-to-video generation on commodity hardware.

Abstract

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15 (11.5) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.

Paper Structure

This paper contains 27 sections, 7 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: LinGen generates photorealistic high-resolution long videos with linear computational complexity. (a) High-quality videos generated using our LinGen model. (b) The computational cost scaling curves across different video resolutions and lengths. LinGen achieves 15$\times$ speed-up compared to the standard DiT when generating 68s-length videos at 512p resolution.
  • Figure 2: Overview of the LinGen denoising module. LinGen replaces self-attention layers with a MATE block, which inherits linear complexity from its two branches: MA-branch and TE-branch. The MA-branch consists of a bidirectional Mamba2 block, RMS, and review tokens to cover short-to-long-range correlations. The TE-branch is a TEmporal Swin Attention block that addresses the adjacency preservation issue and improves the consistency of generated videos significantly.
  • Figure 3: The bidirectional Mamba2 module. Native Mamba2 only generates the lower triangular part of the attention map due to its causal characteristic. Thus, we deploy bidirectional Mamba2 to obtain the complete attention map for vision tasks.
  • Figure 4: Rotary-Major Scan (RMS). We apply different scan schedules across layers to preserve adjacency along various dimensions. Note that scan is bidirectional in practice, but for clarity, only one direction is illustrated for each scan schedule.
  • Figure 5: TEmporal Swin Attention (TESA). We divide the token tensor into small windows and calculate self-attention within each window. The windows are alternately shifted across layers to cross the boundaries of local windows. The window size remains fixed across different resolutions, hence maintaining linear complexity.
  • ...and 16 more figures