Table of Contents
Fetching ...

LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models

Zicheng Duan, Jiatong Xia, Zeyu Zhang, Wenbo Zhang, Gengze Zhou, Chenhui Gou, Yefei He, Feng Chen, Xinyu Zhang, Lingqiao Liu

TL;DR

LiveWorld is proposed, a novel framework that extends video world models to support persistent world evolution and enables persistent event evolution and long-term scene consistency, bridging the gap between existing 2D observation-based memory and true 4D dynamic world simulation.

Abstract

Recent generative video world models aim to simulate visual environment evolution, allowing an observer to interactively explore the scene via camera control. However, they implicitly assume that the world only evolves within the observer's field of view. Once an object leaves the observer's view, its state is "frozen" in memory, and revisiting the same region later often fails to reflect events that should have occurred in the meantime. In this work, we identify and formalize this overlooked limitation as the "out-of-sight dynamics" problem, which impedes video world models from representing a continuously evolving world. To address this issue, we propose LiveWorld, a novel framework that extends video world models to support persistent world evolution. Instead of treating the world as static observational memory, LiveWorld models a persistent global state composed of a static 3D background and dynamic entities that continue evolving even when unobserved. To maintain these unseen dynamics, LiveWorld introduces a monitor-based mechanism that autonomously simulates the temporal progression of active entities and synchronizes their evolved states upon revisiting, ensuring spatially coherent rendering. For evaluation, we further introduce LiveBench, a dedicated benchmark for the task of maintaining out-of-sight dynamics. Extensive experiments show that LiveWorld enables persistent event evolution and long-term scene consistency, bridging the gap between existing 2D observation-based memory and true 4D dynamic world simulation. The baseline and benchmark will be publicly available at https://zichengduan.github.io/LiveWorld/index.html.

LiveWorld: Simulating Out-of-Sight Dynamics in Generative Video World Models

TL;DR

LiveWorld is proposed, a novel framework that extends video world models to support persistent world evolution and enables persistent event evolution and long-term scene consistency, bridging the gap between existing 2D observation-based memory and true 4D dynamic world simulation.

Abstract

Recent generative video world models aim to simulate visual environment evolution, allowing an observer to interactively explore the scene via camera control. However, they implicitly assume that the world only evolves within the observer's field of view. Once an object leaves the observer's view, its state is "frozen" in memory, and revisiting the same region later often fails to reflect events that should have occurred in the meantime. In this work, we identify and formalize this overlooked limitation as the "out-of-sight dynamics" problem, which impedes video world models from representing a continuously evolving world. To address this issue, we propose LiveWorld, a novel framework that extends video world models to support persistent world evolution. Instead of treating the world as static observational memory, LiveWorld models a persistent global state composed of a static 3D background and dynamic entities that continue evolving even when unobserved. To maintain these unseen dynamics, LiveWorld introduces a monitor-based mechanism that autonomously simulates the temporal progression of active entities and synchronizes their evolved states upon revisiting, ensuring spatially coherent rendering. For evaluation, we further introduce LiveBench, a dedicated benchmark for the task of maintaining out-of-sight dynamics. Extensive experiments show that LiveWorld enables persistent event evolution and long-term scene consistency, bridging the gap between existing 2D observation-based memory and true 4D dynamic world simulation. The baseline and benchmark will be publicly available at https://zichengduan.github.io/LiveWorld/index.html.
Paper Structure (45 sections, 10 equations, 6 figures, 3 tables)

This paper contains 45 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: LiveWorld enables persistent out-of-sight dynamics. Instead of freezing unobserved regions, our framework explicitly decouples world evolution from observation rendering. We register stationary Monitors to autonomously fast-forward the temporal progression of active entities (e.g., the dog and the person) in the background. As the observer explores the scene along the target trajectory (green cameras), our state-aware renderer projects the continuously evolved world states to synthesize the final observation. This ensures that dynamic events progress naturally, accurately reflecting the elapsed time even when entities are completely out of the observer's view.
  • Figure 2: World State Formulation. We approximate the intractable 4D world state $\mathcal{W}_t$ by decoupling it into two trackable representations: a temporally-invariant static 3D environment $\mathcal{M}_{static}$ via $T$-axis projection, and 2D video sequences of dynamic entities $\mathcal{M}_{dyn,t}$ via $Z$-axis projection.
  • Figure 3: LiveWorld overview. Our system explicitly decouples world modeling into two processes. (1) Static Accumulation (Blue): Temporally-invariant backgrounds are fused into a static 3D point cloud via SLAM. (2) Dynamic Evolution (Green): Stationary monitors use the Evolution Engine $G_{\theta}^{\text{evo}}$ to fast-forward the out-of-sight progression of active entities, lifting them into 4D point clouds. (3) State-aware Rendering (Purple): Both representations are projected onto the target camera trajectory. This geometric projection, alongside appearance references, guides the renderer $G_{\theta}^{\text{render}}$ to synthesize coherent observations reflecting the elapsed dynamics.
  • Figure 4: Given one or multiple preceding frames from the previous round, we first detect if the scene visited by the observer contains active dynamic entities, using off-the-shelf VLMs and segmentors. Following a positive detection, we further validate if the entity and scene are already registered by existing monitors.
  • Figure 5: A comparison result with the latest state-of-the-art methods on LiveBench. With the camera view repeatedly moving rightwards and backwards, our methods stand out alone to successfully maintain long-horizon (260 frames) out-of-sight dynamics, while others fail. Different colors correspond to different evolving prompts of the event.
  • ...and 1 more figures