Table of Contents
Fetching ...

PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

Jiahao Zhan, Zizhang Li, Hong-Xing Yu, Jiajun Wu

TL;DR

PerpetualWonder tackles long-horizon action-conditioned 4D scene generation from a single image by introducing a Visual-Physical Aligned Particle (VPP) that tightly couples dynamics and appearance, enabling a true closed-loop with forward physics and backward neural refinement. A robust multi-view optimization initializes a full 3D scene and progressively leverages supervision from multiple viewpoints to resolve ambiguity, producing physically plausible and visually coherent sequences across time. The approach yields superior performance over prior hybrid simulators and conditional video generators in terms of 3D consistency, controllability, and long-horizon stability, with strong user preference in realism evaluations. This work advances embodied AI and interactive virtual worlds by enabling long-horizon, action-conditioned dynamics that remain coherent under arbitrary viewpoint renderings.

Abstract

We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.

PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

TL;DR

PerpetualWonder tackles long-horizon action-conditioned 4D scene generation from a single image by introducing a Visual-Physical Aligned Particle (VPP) that tightly couples dynamics and appearance, enabling a true closed-loop with forward physics and backward neural refinement. A robust multi-view optimization initializes a full 3D scene and progressively leverages supervision from multiple viewpoints to resolve ambiguity, producing physically plausible and visually coherent sequences across time. The approach yields superior performance over prior hybrid simulators and conditional video generators in terms of 3D consistency, controllability, and long-horizon stability, with strong user preference in realism evaluations. This work advances embodied AI and interactive virtual worlds by enabling long-horizon, action-conditioned dynamics that remain coherent under arbitrary viewpoint renderings.

Abstract

We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.
Paper Structure (16 sections, 2 equations, 10 figures, 4 tables)

This paper contains 16 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: We propose PerpetualWonder, a hybrid generative simulator that generates a 4D scene with long-horizon actions and a single image. Here we show a side-by-side comparison for a three-step action sequence (top to bottom, actions overlaid on the images). The left and right image blocks show renderings from two different viewpoints. PerpetualWonder shows superior performance over the previous method. We show video results in https://johnzhan2023.github.io/PerpetualWonder/.
  • Figure 2: Overview of PerpetualWonder. Given an input image, based on the visual-physical aligned particle, we reconstruct a 3D scene from synthesized dense views. Then we iterate between a forward physics pass and a backward neural optimization. The forward pass leverages physical simulation to generate coarse scene dynamics. Then the backward optimization updates the scene according to the multi-view refined videos from the video generation model. The closed-loop system enables long-horizon actions for the final 4D scene generation. The rendered results on the right showcase the generated scene from each consecutive action.
  • Figure 3: Qualitative results of the proposed PerpetualWonder. We show the long-horizon scenes with three consecutive actions. and indicate global force (gravity or wind force field), and 3D point force, respectively. The results are all rendered from novel views, demonstrating our method's ability in long-horizon action-conditioned 4D scene generation.
  • Figure 4: Qualitative comparisons between PerpetualWonder (ours) and the baseline methods. The top row shows the input images, actions, camera trajectories, and the texts describing the actions for conditional video generators WanGen3c. For ease of comparison, only one time window is shown. The images from left to right illustrate the resulting scene dynamics and camera motion for each method.
  • Figure 5: Long-horizon actions comparison between PerpetualWonder (top row) and WonderPlay (bottom row). For each method, the view changes across time, illustrating the four-round interaction results on a castle scene. The applied actions are overlaid on the top row.
  • ...and 5 more figures