Table of Contents
Fetching ...

Critiques of World Models

Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu

TL;DR

This work reframes world models as general-purpose simulators for actionable futures, arguing that current approaches overemphasize video generation at the expense of goal-directed reasoning. It critically analyzes five design dimensions—data, representation, architecture, objective, and usage—and presents theoretical and empirical critiques of leading WM approaches, notably JEPA, latent representations, and MPC/RL usage. Building on these critiques, it proposes the PAN architecture, a mixed discrete-continuous, hierarchical, multimodal WM with an enhanced LLM backbone and a diffusion-based latent predictor, trained with observation-grounded generative losses to enable long-horizon planning and agentic reasoning. PAN aims to enable efficient, flexible, and scalable imagined experience to train and inform autonomous agents, with mountaineering and other complex tasks serving as motivating demonstrations for future, broader generalization.

Abstract

World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of "hypothetical thinking" in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

Critiques of World Models

TL;DR

This work reframes world models as general-purpose simulators for actionable futures, arguing that current approaches overemphasize video generation at the expense of goal-directed reasoning. It critically analyzes five design dimensions—data, representation, architecture, objective, and usage—and presents theoretical and empirical critiques of leading WM approaches, notably JEPA, latent representations, and MPC/RL usage. Building on these critiques, it proposes the PAN architecture, a mixed discrete-continuous, hierarchical, multimodal WM with an enhanced LLM backbone and a diffusion-based latent predictor, trained with observation-grounded generative losses to enable long-horizon planning and agentic reasoning. PAN aims to enable efficient, flexible, and scalable imagined experience to train and inform autonomous agents, with mountaineering and other complex tasks serving as motivating demonstrations for future, broader generalization.

Abstract

World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of "hypothetical thinking" in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

Paper Structure

This paper contains 26 sections, 4 theorems, 38 equations, 11 figures.

Key Result

Theorem 1

Assume real inputs $\mathbf{x} = [x_1, \dots, x_T]$, where $x_t \in \mathbb{R}^D$ and $\| x_t \| < K$. For any $\epsilon > 0$, there exists a language $L_\epsilon = (\mathcal{V}, N, f_\epsilon)$ with vocabulary $\mathcal{V}$, maximal sentence length $N < \infty$, and a mapping function $f_\epsilon:

Figures (11)

  • Figure 1: Familiar example of reasoning by simulation -- an individual (possibly self-serving) decides to offer help to a crying person by mentally simulating multiple possible outcomes, with the best expected reward in mind.
  • Figure 2: A possible definition of an optimal agent
  • Figure 3: An agent in real world where groundtruth world state and universe are unavailable to experience or experiment, so world model is crucial for simulation.
  • Figure 4: Framework for world model proposed by a vocal school of thought.
  • Figure 5: Vocabulary-based tokens is an effective way to categorize perceptual inputs into discrete concepts for reasoning (left). We may scale up or scale out discrete code to deal with increasing data complexity (right). Thm.\ref{['thm:completeness-of-language']} shows either is effective, but scaling out is more efficient.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Theorem 1: Completeness of Language Representation
  • proof : Proof Sketch
  • Proposition 1: Collapse of Latent Reconstruction Loss
  • proof : Proof Sketch
  • Proposition 2: Non-Collapse of Generative Loss
  • proof : Proof Sketch
  • Theorem 2: Latent reconstruction is an upper-bounded surrogate for generative reconstruction
  • proof : Proof Sketch
  • proof
  • proof
  • ...and 2 more