PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation
Jiahao Zhan, Zizhang Li, Hong-Xing Yu, Jiajun Wu
TL;DR
PerpetualWonder tackles long-horizon action-conditioned 4D scene generation from a single image by introducing a Visual-Physical Aligned Particle (VPP) that tightly couples dynamics and appearance, enabling a true closed-loop with forward physics and backward neural refinement. A robust multi-view optimization initializes a full 3D scene and progressively leverages supervision from multiple viewpoints to resolve ambiguity, producing physically plausible and visually coherent sequences across time. The approach yields superior performance over prior hybrid simulators and conditional video generators in terms of 3D consistency, controllability, and long-horizon stability, with strong user preference in realism evaluations. This work advances embodied AI and interactive virtual worlds by enabling long-horizon, action-conditioned dynamics that remain coherent under arbitrary viewpoint renderings.
Abstract
We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.
