The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Aoxiong Yin; Kai Shen; Yichong Leng; Xu Tan; Xinyu Zhou; Juncheng Li; Siliang Tang

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang

TL;DR

LanDiff presents a coarse-to-fine framework that unites autoregressive language models and diffusion models for video generation. It introduces a video semantic tokenizer to compress high-level semantic content into 1D discrete tokens, an autoregressive semantic token generator, and a streaming diffusion model that refines semantics into perceptual features, decoded by a VAE. On VBench, LanDiff achieves state-of-the-art performance among open-source models and demonstrates strong long-video generation, outperforming larger models such as HunyuanVideo. The approach emphasizes semantic control and temporal coherence, enabling high-quality, semantically faithful videos from text with fewer tokens and efficient streaming inference, powered by a 5B-parameter model.

Abstract

Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

TL;DR

Abstract

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)