Table of Contents
Fetching ...

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

Muyang He, Hanzhong Guo, Junxiong Lin, Yizhou Yu

Abstract

The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.

Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

Abstract

The rapid evolution of video generation has enabled models to simulate complex physical dynamics and long-horizon causalities, positioning them as potential world simulators. However, a critical gap still remains between the theoretical capacity for world simulation and the heavy computational costs of spatiotemporal modeling. To address this, we comprehensively and systematically review video generation frameworks and techniques that consider efficiency as a crucial requirement for practical world modeling. We introduce a novel taxonomy in three dimensions: efficient modeling paradigms, efficient network architectures, and efficient inference algorithms. We further show that bridging this efficiency gap directly empowers interactive applications such as autonomous driving, embodied AI, and game simulation. Finally, we identify emerging research frontiers in efficient video-based world modeling, arguing that efficiency is a fundamental prerequisite for evolving video generators into general-purpose, real-time, and robust world simulators.

Paper Structure

This paper contains 65 sections, 10 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: A taxonomy of representative topics related to efficiency improvement for video generation-based world models.
  • Figure 2: Pipeline of cascaded video generation. Figure courtesy of zhang2025waver.
  • Figure 3: LoViC jiang2025lovic introduces FlexFormer, a flexible encoder that compresses context of arbitrary length under an adaptive compression ratio. The resulting compressed context features are fed into a DiT-based decoder to generate the current video chunk. Figure courtesy of jiang2025lovic.
  • Figure 4: Comparison of inference time among full attention, SVGxi2025sparse, MoBA lu2025moba and VMoBAwu2025vmoba as sequence length increases. Figure courtesy of wu2025vmoba.
  • Figure 5: LongLive yang2025longlive processes sequential user prompts and generates a corresponding long video using efficient short window attention and frame sink. Figure courtesy of yang2025longlive.
  • ...and 4 more figures