Table of Contents
Fetching ...

WoW: Towards a World omniscient World model Through Embodied Interaction

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou, Chi-min Chan, Chengkai Hou, Wei Xue, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang

TL;DR

<3-5 sentence high-level summary>

Abstract

Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.

WoW: Towards a World omniscient World model Through Embodied Interaction

TL;DR

<3-5 sentence high-level summary>

Abstract

Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.

Paper Structure

This paper contains 72 sections, 18 equations, 28 figures, 6 tables.

Figures (28)

  • Figure 1: WoW is a world model that integrates perception, prediction, Judgement, reflection, and action. It learns from real-world interaction data and generates high-quality, physically consistent robot videos in seen and out-of-distribution scenarios, enabling real-world robotic execution.
  • Figure 1: Comparative analysis of foundational video generation models. We benchmark our WoW-DiT against SOTA models using direct text-to-video generation. All metrics: higher is better. Best results are bold with highlight.
  • Figure 2: Developmental trajectory of world models, from modality-specific models (e.g., VGM, LLM) to unified models after a critical emergence point.
  • Figure 3: The technological development of world models in pursuit of intrinsic physical consistency has primarily followed two approaches. One possible approach is to construct a world model with inherent physical modeling capabilities for robots. The mainstream approaches to realizing such a world model include two methods. One is based on generative AI combined with a differentiable physics engine. The other, grounded in video generation models, constructs a neural network-driven physics engine that possesses both intrinsic physical consistency and external high visual fidelity.
  • Figure 4: The architecture of an embodied agent with a world model. An intelligent agent perceives the environment through various sensory inputs (e.g., visual, sound, heat, force). These perceptions are processed by a World Model, which builds an internal, predictive representation of the environment. The model's predictions and past experiences, stored in short-term and long-term memory, inform Reasoning and Judgement. Based on this internal simulation, the Actor generates Actions that manipulate the real world. This closed-loop system allows the agent to learn the dynamics of its environment, plan for the future, and achieve complex goals. (Figure inspired by BrainCog)
  • ...and 23 more figures