Table of Contents
Fetching ...

One-Shot Real-to-Sim via End-to-End Differentiable Simulation and Rendering

Yifan Zhu, Tianyi Xiang, Aaron Dollar, Zherong Pan

TL;DR

This work tackles the challenge of learning physically consistent world models from sparse robotic observations by jointly optimizing geometry, appearance, and physical parameters (GAP) of rigid objects. It introduces a differentiable pipeline that combines Shape-as-Points geometry with a grid-based appearance field, a Poisson-based occupancy representation, and a differentiable marching cubes renderer, all integrated with a differentiable rigid-body simulator. The proposed two-stage real-to-sim optimization leverages a geometry prior from web-scale models to recover plausible object shapes and physical properties from a single push, achieving accurate dynamics parameters and plausible novel-view renderings in both simulated and real environments. This approach yields a physically grounded world model suitable for planning and control in novel environments, with potential extensions to multi-object and richer appearance modeling.

Abstract

Identifying predictive world models for robots in novel environments from sparse online observations is essential for robot task planning and execution in novel environments. However, existing methods that leverage differentiable programming to identify world models are incapable of jointly optimizing the geometry, appearance, and physical properties of the scene. In this work, we introduce a novel rigid object representation that allows the joint identification of these properties. Our method employs a novel differentiable point-based geometry representation coupled with a grid-based appearance field, which allows differentiable object collision detection and rendering. Combined with a differentiable physical simulator, we achieve end-to-end optimization of world models, given the sparse visual and tactile observations of a physical motion sequence. Through a series of world model identification tasks in simulated and real environments, we show that our method can learn both simulation- and rendering-ready world models from only one robot action sequence. The code and additional videos are available at our project website: https://tianyi20.github.io/rigid-world-model.github.io/

One-Shot Real-to-Sim via End-to-End Differentiable Simulation and Rendering

TL;DR

This work tackles the challenge of learning physically consistent world models from sparse robotic observations by jointly optimizing geometry, appearance, and physical parameters (GAP) of rigid objects. It introduces a differentiable pipeline that combines Shape-as-Points geometry with a grid-based appearance field, a Poisson-based occupancy representation, and a differentiable marching cubes renderer, all integrated with a differentiable rigid-body simulator. The proposed two-stage real-to-sim optimization leverages a geometry prior from web-scale models to recover plausible object shapes and physical properties from a single push, achieving accurate dynamics parameters and plausible novel-view renderings in both simulated and real environments. This approach yields a physically grounded world model suitable for planning and control in novel environments, with potential extensions to multi-object and richer appearance modeling.

Abstract

Identifying predictive world models for robots in novel environments from sparse online observations is essential for robot task planning and execution in novel environments. However, existing methods that leverage differentiable programming to identify world models are incapable of jointly optimizing the geometry, appearance, and physical properties of the scene. In this work, we introduce a novel rigid object representation that allows the joint identification of these properties. Our method employs a novel differentiable point-based geometry representation coupled with a grid-based appearance field, which allows differentiable object collision detection and rendering. Combined with a differentiable physical simulator, we achieve end-to-end optimization of world models, given the sparse visual and tactile observations of a physical motion sequence. Through a series of world model identification tasks in simulated and real environments, we show that our method can learn both simulation- and rendering-ready world models from only one robot action sequence. The code and additional videos are available at our project website: https://tianyi20.github.io/rigid-world-model.github.io/

Paper Structure

This paper contains 17 sections, 8 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: From the visual and tactile observations of a single robot push (top), our method jointly optimizes the shape, appearance, and physical parameters of a world model consisting of rigid objects in the form of a rigid body simulator (bottom, the robot arm is not rendered in this picture and the end-effector is treated as a floating blue sphere robot).
  • Figure 2: Overview of the proposed fully differentiable pipeline for world model identification from sparse robot observations. Our object representation couples an oriented point cloud $\mathcal{P}$ and a 3D appearance grid $\mathcal{\psi}$. Through a differentiable Poisson solver and differentiable marching cubes, the oriented point cloud is converted to an indicator grid $\chi$ and then a mesh, whose vertex textures are interpolated from the appearance grid $\mathcal{\psi}$. Feeding the object mesh, physical parameters $M$ and $\mu$, the terrain point cloud $\mathcal{P}_t$, and the robot pushing trajectory and control $\langle e^t,u^t \rangle$ into a differentiable rigid body simulator and renderer, the predicted scenes can be rendered. Calculating the loss against observed RGB-D images, the scene shape, appearance, and physical parameters are jointly optimized with gradient descent.
  • Figure 3: The experiment setups for the simulation (left) and physical (right) experiments. 9 objects are used for simulation with the PyBullet simulator, including 8 YCB objects and a green box. For the real-world experiments, three YCB objects (Drill, Mustard, and Sugar) are used. A UR5e arm equipped with a pusher and a ATI Gamma F/T sensor and an overhead Realsense D435 RGB-D camera are used. Note that only the circled object in the real-world setup is the object of interest and everything else is treated as the static terrain.
  • Figure 4: The pushing trajectories used in the experiments. Left: The 8 starting locations of the floating spherical robot pushing trajectories and 3 pushing directions towards the robot at one of the starting locations for the Drill object in the simulation experiments. Middle and right: the training trajectory and 2 sample testing trajectories, with the first and last frames shown. [Best viewed in color.]
  • Figure 5: The predicted and ground-truth poses of the 5 different objects at the end of sampled testing trajectories for the simulation experiments. After training, the predicted poses are obtained by applying the control forces from the initial pose and integrating forward in time. The predicted object poses are highlighted with a yellow silhouette and overlaid with the ground-truth object, blue floating spherical robot, and background. [Best viewed in color.]
  • ...and 3 more figures