Table of Contents
Fetching ...

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang

TL;DR

LanDiff presents a coarse-to-fine framework that unites autoregressive language models and diffusion models for video generation. It introduces a video semantic tokenizer to compress high-level semantic content into 1D discrete tokens, an autoregressive semantic token generator, and a streaming diffusion model that refines semantics into perceptual features, decoded by a VAE. On VBench, LanDiff achieves state-of-the-art performance among open-source models and demonstrates strong long-video generation, outperforming larger models such as HunyuanVideo. The approach emphasizes semantic control and temporal coherence, enabling high-quality, semantically faithful videos from text with fewer tokens and efficient streaming inference, powered by a 5B-parameter model.

Abstract

Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.

The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

TL;DR

LanDiff presents a coarse-to-fine framework that unites autoregressive language models and diffusion models for video generation. It introduces a video semantic tokenizer to compress high-level semantic content into 1D discrete tokens, an autoregressive semantic token generator, and a streaming diffusion model that refines semantics into perceptual features, decoded by a VAE. On VBench, LanDiff achieves state-of-the-art performance among open-source models and demonstrates strong long-video generation, outperforming larger models such as HunyuanVideo. The approach emphasizes semantic control and temporal coherence, enabling high-quality, semantically faithful videos from text with fewer tokens and efficient streaming inference, powered by a 5B-parameter model.

Abstract

Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a 14,000 compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.

Paper Structure

This paper contains 19 sections, 6 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: The rate-distortion curve illustrates how visual distortion decreases as the number of transmitted bits increases. With just a small number of bits representing high-level semantic features, we can already achieve relatively low visual distortion. Building on this information-theoretic insight, LanDiff combines the strengths of both paradigms: LLMs efficiently generate compact semantic features in the first stage, followed by diffusion models that add perceptual details in the second stage, before final decoding to pixels via VAE. Data from ddpm, illustration is conceptual.
  • Figure 2: The architecture of LanDiff. Given text inputs, we first extract text embeddings and employ an LLM to generate semantic tokens in the first stage. Subsequently, we utilize a diffusion model to synthesize perceptual features conditioned on these semantic tokens, followed by a VAE decoder that transforms these features into the final video frames.
  • Figure 3: Proposed architecture of the video semantic tokenizer. We use query tokens to compress the semantic sequence length. Furthermore, we group the frames into groups (3 frames in a group in this figure). In a group, the first frame is the IFrame and the rest frames are PFrames. We use different query token numbers for them. The attention mask design is shown in the right.
  • Figure 4: Proposed diffusion model structure. We use a ControlNet-style control module to guide the model to generate perceptual feature based on semantic features.
  • Figure 5: Comparison of qualitative results for text-to-video generation.
  • ...and 13 more figures