Table of Contents
Fetching ...

Scalable Real2Sim: Physics-Aware Asset Generation Via Robotic Pick-and-Place Setups

Nicholas Pfaff, Evelyn Fu, Jeremy Binagia, Phillip Isola, Russ Tedrake

TL;DR

The paper tackles the bottleneck of creating accurate, simulation-ready assets for real-world objects by proposing an automated Real2Sim pipeline that operates in a standard pick-and-place setup. It jointly reconstructs object visual and collision geometry and identifies inertial parameters using only joint-torque data and an external camera, leveraging alpha-transparent training for object-centric photometric reconstruction and a convex-optimization–based, information-driven excitation strategy. Key contributions include a general recipe for object-centric meshes from photometric methods, a practical augmented-Lagrangian solver for trajectory design under constraints, and extensive real-world validation plus a 20-object benchmark dataset. The approach enables scalable, intervention-free asset generation with potential to significantly accelerate sim-to-real research and data collection for physics-aware robotic manipulation.

Abstract

Simulating object dynamics from real-world perception shows great promise for digital twins and robotic manipulation but often demands labor-intensive measurements and expertise. We present a fully automated Real2Sim pipeline that generates simulation-ready assets for real-world objects through robotic interaction. Using only a robot's joint torque sensors and an external camera, the pipeline identifies visual geometry, collision geometry, and physical properties such as inertial parameters. Our approach introduces a general method for extracting high-quality, object-centric meshes from photometric reconstruction techniques (e.g., NeRF, Gaussian Splatting) by employing alpha-transparent training while explicitly distinguishing foreground occlusions from background subtraction. We validate the full pipeline through extensive experiments, demonstrating its effectiveness across diverse objects. By eliminating the need for manual intervention or environment modifications, our pipeline can be integrated directly into existing pick-and-place setups, enabling scalable and efficient dataset creation. Project page (with code and data): https://scalable-real2sim.github.io/.

Scalable Real2Sim: Physics-Aware Asset Generation Via Robotic Pick-and-Place Setups

TL;DR

The paper tackles the bottleneck of creating accurate, simulation-ready assets for real-world objects by proposing an automated Real2Sim pipeline that operates in a standard pick-and-place setup. It jointly reconstructs object visual and collision geometry and identifies inertial parameters using only joint-torque data and an external camera, leveraging alpha-transparent training for object-centric photometric reconstruction and a convex-optimization–based, information-driven excitation strategy. Key contributions include a general recipe for object-centric meshes from photometric methods, a practical augmented-Lagrangian solver for trajectory design under constraints, and extensive real-world validation plus a 20-object benchmark dataset. The approach enables scalable, intervention-free asset generation with potential to significantly accelerate sim-to-real research and data collection for physics-aware robotic manipulation.

Abstract

Simulating object dynamics from real-world perception shows great promise for digital twins and robotic manipulation but often demands labor-intensive measurements and expertise. We present a fully automated Real2Sim pipeline that generates simulation-ready assets for real-world objects through robotic interaction. Using only a robot's joint torque sensors and an external camera, the pipeline identifies visual geometry, collision geometry, and physical properties such as inertial parameters. Our approach introduces a general method for extracting high-quality, object-centric meshes from photometric reconstruction techniques (e.g., NeRF, Gaussian Splatting) by employing alpha-transparent training while explicitly distinguishing foreground occlusions from background subtraction. We validate the full pipeline through extensive experiments, demonstrating its effectiveness across diverse objects. By eliminating the need for manual intervention or environment modifications, our pipeline can be integrated directly into existing pick-and-place setups, enabling scalable and efficient dataset creation. Project page (with code and data): https://scalable-real2sim.github.io/.

Paper Structure

This paper contains 23 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: An overview of our system. Objects are placed in the first bin, where the robot picks them up and reconstructs their geometries by moving them in front of a static camera while re-grasping to reduce occlusions (Section \ref{['visual_geometry']} & \ref{['collision_geometry']}). Next, the robot identifies the object's physical parameters by following a trajectory designed to be informative for the inertial parameters (Section \ref{['system_id']}). Finally, it places the object into the second bin and repeats the process with the next object. The extracted geometric and physical parameters are combined to generate a complete, simulatable object description.
  • Figure 2: Our object scanning method. We use re-grasps to display the object along two perpendicular axes, providing the camera with a complete view of the object.
  • Figure 3: Our object-centric visual reconstruction recipe. From the collected RGB images (a), we obtain the object masks (b) and gripper masks (c). Using only the object masks to ignore background pixels during training (d) results in density bleeding into unoccupied regions (g). Applying alpha-transparent training (e) mitigates density bleeding but incorrectly drives occluded object regions toward transparency (h). Ignoring pixels inside of the gripper mask during training, along with employing alpha transparent training (f), successfully reconstructs an unoccluded object view with no density bleeding (i).
  • Figure 4: Our pick-and-place setup. It features a Kuka LBR iiwa 7 arm with a Schunk WSG-50 gripper and Toyota Research Institute Finray fingers. Workspace observations rely on three static RealSense D415 cameras (orange circles), while bin picking uses a RealSense D435 (green circle), and object scanning is performed with another D415 (red circle). All cameras capture 640×480 resolution RGBD images. Objects are picked from the right bin and placed into the left bin. A platform is used in the scanning workspace to enhance the iiwa’s kinematic range during re-grasping.
  • Figure 5: Real-world objects (left) and their reconstructed counterparts (right). Each object on the left was individually reconstructed using our pipeline. These assets were then manually arranged in simulation to approximately match their real-world poses and rendered to produce the image on the right. The strong visual similarity is notable, especially given that the reconstructions are rendered triangle meshes rather than neural renders.
  • ...and 3 more figures