Table of Contents
Fetching ...

WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions

Zizhang Li, Hong-Xing Yu, Wei Liu, Yin Yang, Charles Herrmann, Gordon Wetzstein, Jiajun Wu

TL;DR

WonderPlay tackles action-conditioned dynamic 3D scene generation from a single image by marrying physics-based solvers with diffusion-based video generation. It introduces a hybrid generative simulator that produces coarse dynamics with physics, refines motion and appearance via a bimodal video generator conditioned on flow and the input image, and updates the scene through differentiable rendering. The approach supports diverse materials (rigid, elastic, cloth, liquids, gases, granular) and outperforms purely physics-based and purely video-based baselines on both quantitative metrics and human judgments. This framework enables intuitive user control while achieving high physical plausibility and visual realism, with potential impact on AR/VR, embodied AI, and interactive content creation.

Abstract

WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies -- all using a single image input. Code will be made public. Project website: https://kyleleey.github.io/WonderPlay/

WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions

TL;DR

WonderPlay tackles action-conditioned dynamic 3D scene generation from a single image by marrying physics-based solvers with diffusion-based video generation. It introduces a hybrid generative simulator that produces coarse dynamics with physics, refines motion and appearance via a bimodal video generator conditioned on flow and the input image, and updates the scene through differentiable rendering. The approach supports diverse materials (rigid, elastic, cloth, liquids, gases, granular) and outperforms purely physics-based and purely video-based baselines on both quantitative metrics and human judgments. This framework enables intuitive user control while achieving high physical plausibility and visual realism, with potential impact on AR/VR, embodied AI, and interactive content creation.

Abstract

WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. While prior works are restricted to rigid body or simple elastic dynamics, WonderPlay features a hybrid generative simulator to synthesize a wide range of 3D dynamics. The hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elastic, and rigid bodies -- all using a single image input. Code will be made public. Project website: https://kyleleey.github.io/WonderPlay/

Paper Structure

This paper contains 16 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: We propose WonderPlay, a framework that takes a single image and actions as inputs, and then generates dynamic 3D scenes that depict the consequence of the actions. WonderPlay allows users to interact with various scenes of diverse physical materials, e.g., the hat and wine glass (rigid body), the hair (thin strands), the steam (gas), the mushroom (elastic), honey (liquid), and more. See https://kyleleey.github.io/WonderPlay/ for interactive video results.
  • Figure 2: Overview of WonderPlay. Given a single image, we first reconstruct the 3D scene and estimate material properties. Then our hybrid generative simulator uses physics solver and input actions to infer coarse 3D dynamics. The simulated appearance and motion signals are used to condition the video generator through spatially varying bimodal control to synthesize the realistic motion. The dynamic 3D scene is refined using the synthesized video, finishing the hybrid generative simulation.
  • Figure 3: Illustration on our spatially varying bimodal control, which drives the video generator with input image $\mathbf{I}$, pixel-space flow $\mathbf{F}$ and simulation rendered $\tilde{\mathbf{V}}$.
  • Figure 4: Qualitative comparisons between WonderPlay (ours) and the baseline methods. The top row shows the input images, actions, and the texts describing the actions for CogVideoX yang2024cogvideox.
  • Figure 5: Qualitative results of the proposed WonderPlay. In the left column we show the input scene image and actions, where , , indicate gravity action, wind field action and 3D point force action, respectively.
  • ...and 5 more figures