Table of Contents
Fetching ...

RealWonder: Real-Time Physical Action-Conditioned Video Generation

Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, Jiajun Wu

TL;DR

This work presents RealWonder, the first real-time system for action-conditioned video generation from a single image using physics simulation as an intermediate bridge, which opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning.

Abstract

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/

RealWonder: Real-Time Physical Action-Conditioned Video Generation

TL;DR

This work presents RealWonder, the first real-time system for action-conditioned video generation from a single image using physics simulation as an intermediate bridge, which opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning.

Abstract

Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/
Paper Structure (13 sections, 5 equations, 12 figures, 5 tables)

This paper contains 13 sections, 5 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overview of RealWonder. (Left) Given a single image and a sequence of actions as input, we first reconstruct the 3D scene as point clouds, (Middle) estimate material for the objects to interact with, and then maintain a physics simulation stream using the actions. (Right) Meanwhile, we maintain another stream of rendering optical flow and RGB preview to condition a few-step conditional video generator, producing the physical action-conditioned video streaming.
  • Figure 2: Training of Real-time Flow-Conditioned Video Model. A pretrained image-to-video model wan2025 is first adapted to optical-flow conditioning through LoRA hu2022lora post-training. Next, it is distilled via distribution-matching training into a causal, flow-conditioned real-time video generator.
  • Figure 3: Qualitative results. In the left column we show the input scene image and initial actions, where the arrow indicates 3D force and the blue wind icon indicates force fields. Note that we always apply gravity in our simulation.
  • Figure 4: Comparison to baselines. In the first row, we show the input images and actions (we use arrows for 3D point forces and wind icon for force fields) together with the text prompts. We always apply gravity in our simulation.
  • Figure 5: Different actions on the same scene, leading to different generated physical outcomes.
  • ...and 7 more figures