Table of Contents
Fetching ...

Robot Learning from a Physical World Model

Jiageng Mao, Sicheng He, Hao-Ning Wu, Yang You, Shuyang Sun, Zhicheng Wang, Yanan Bao, Huizhong Chen, Leonidas Guibas, Vitor Guizilini, Howard Zhou, Yue Wang

TL;DR

PhysWorld tackles the physics gap in learning robotic manipulation from generated video demonstrations by building a physical world model from a single RGB-D image and a task command. It generates a task-conditioned video, reconstructs a physically interactable 4D scene (including metric depth via depth calibration with scale and shift $(\alpha,\beta)$, dynamic point clouds, and textured meshes with gravity alignment), and learns an object-centric residual RL policy that maps observations to executable actions using $a_t = a^{base}_t + \pi_\theta(o_t)$ trained with PPO. This pipeline enables zero-shot real-world manipulation without collecting real robot data and shows improved robustness and manipulation accuracy, achieving an average success rate of 82% and outperforming zero-shot baselines; object-pose tracking provides stronger signals than point tracks or optical flow, and object-centric learning yields faster convergence and higher rewards than RL from scratch. The work bridges video generation and robot learning by converting visual demonstrations into physically feasible robotic trajectories, while noting limitations from simulator fidelity and sim-to-real gaps and pointing to future work on synthesizing physically accurate videos for training.

Abstract

We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task-conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object-centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero-shot generalizable robotic manipulation. Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit \href{https://pointscoder.github.io/PhysWorld_Web/}{the project webpage} for details.

Robot Learning from a Physical World Model

TL;DR

PhysWorld tackles the physics gap in learning robotic manipulation from generated video demonstrations by building a physical world model from a single RGB-D image and a task command. It generates a task-conditioned video, reconstructs a physically interactable 4D scene (including metric depth via depth calibration with scale and shift , dynamic point clouds, and textured meshes with gravity alignment), and learns an object-centric residual RL policy that maps observations to executable actions using trained with PPO. This pipeline enables zero-shot real-world manipulation without collecting real robot data and shows improved robustness and manipulation accuracy, achieving an average success rate of 82% and outperforming zero-shot baselines; object-pose tracking provides stronger signals than point tracks or optical flow, and object-centric learning yields faster convergence and higher rewards than RL from scratch. The work bridges video generation and robot learning by converting visual demonstrations into physically feasible robotic trajectories, while noting limitations from simulator fidelity and sim-to-real gaps and pointing to future work on synthesizing physically accurate videos for training.

Abstract

We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task-conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object-centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero-shot generalizable robotic manipulation. Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit \href{https://pointscoder.github.io/PhysWorld_Web/}{the project webpage} for details.

Paper Structure

This paper contains 10 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: PhysWorld: a framework for robot learning from video generation. Given an image and a task prompt as inputs (column #1), our method generates a task-conditioned video (column #2) and reconstructs the underlying physical world to ground generated visual demonstrations into physically feasible robot actions (column #3), enabling zero-shot robotic manipulation in the real world (column #4).
  • Figure 2: PhysWorld pipeline. Given an RGB-D image and a task prompt, our framework (i) generates a task-conditioned video, (ii) reconstructs a geometry-aligned 4D representation from the generated video, (iii) generates textured object and background meshes, (iv) assembles them into a physically interactable scene through property estimation, gravity alignment, and collision optimization, (v) learns object-centric residual RL policies that transform visual demonstrations into feasible robotic actions, and (vi) deploys to the real world.
  • Figure 3: Qualitative evaluation of physical scene modeling from generated videos.
  • Figure 4: Quantitative evaluation of PhysWorld on real-world manipulation tasks.
  • Figure 5: Qualitative evaluation of PhysWorld on real-world manipulation tasks.
  • ...and 2 more figures