Table of Contents
Fetching ...

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, Furu Wei

TL;DR

ARLON tackles long-video generation by uniting autoregressive transformers with diffusion transformers through a latent VQ-VAE bridge and a semantic-aware conditioning mechanism. It introduces robust training strategies (coarser latent tokens and uncertainty sampling) to tolerate AR-induced noise and demonstrates state-of-the-art performance on long-video benchmarks with improved inference speed. The approach delivers stronger dynamic content and temporal coherence than baselines, while enabling progressive prompt-driven long-video generation. This hybrid framework offers practical benefits for scalable, high-quality long-form video synthesis and points to future upgrades with stronger diffusion backbones.

Abstract

Text-to-video models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive models for long video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarser visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. See demos of ARLON at http://aka.ms/arlon.

ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

TL;DR

ARLON tackles long-video generation by uniting autoregressive transformers with diffusion transformers through a latent VQ-VAE bridge and a semantic-aware conditioning mechanism. It introduces robust training strategies (coarser latent tokens and uncertainty sampling) to tolerate AR-induced noise and demonstrates state-of-the-art performance on long-video benchmarks with improved inference speed. The approach delivers stronger dynamic content and temporal coherence than baselines, while enabling progressive prompt-driven long-video generation. This hybrid framework offers practical benefits for scalable, high-quality long-form video synthesis and points to future upgrades with stronger diffusion backbones.

Abstract

Text-to-video models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive models for long video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarser visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. See demos of ARLON at http://aka.ms/arlon.

Paper Structure

This paper contains 27 sections, 10 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Generation process for long videos with autoregressive transformer and DiT.
  • Figure 2: Overview of the ARLON framework, which consists of three key components: Latent VQ-VAE Compression, Autoregressive Modeling, and Semantic-aware Condition Generation.
  • Figure 3: Semantic injection and uncertainty sampling.
  • Figure 4: Qualitative comparisons between StreamingT2V, FreeNoise, OpenSora, and ARLON. Each video contains 600 frames.
  • Figure 5: Comparison of qualitative results for text-to-video generation.
  • ...and 11 more figures