Table of Contents
Fetching ...

Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection

Jia-Feng Cai, Zibo Chen, Xiao-Ming Wu, Jian-Jian Jiang, Yi-Lin Wei, Wei-Shi Zheng

TL;DR

This work addresses the persistent gap between simulation and reality in 6-DoF grasp detection by introducing R2SGrasp, a real-to-sim framework. It trains a grasp detector on noiseless simulated data and performs inference-time adaptation to real data through two modules: R2SRepairer, which repairs depth maps to mitigate camera noise, and R2SEnhancer, which enriches real features with simulated geometric primitives via a memory-bank and cross-attention mechanism. The authors also present the R2Sim large-scale synthetic dataset to enable cost-efficient, scalable training. Across GraspNet-1Billion and real-world tests, R2SGrasp demonstrates strong transfer, surpassing many sim-to-real baselines and approaching or exceeding performance of models trained on real data, highlighting the practical potential of real-to-sim adaptation for robust, scalable grasping systems.

Abstract

For 6-DoF grasp detection, simulated data is expandable to train more powerful model, but it faces the challenge of the large gap between simulation and real world. Previous works bridge this gap with a sim-to-real way. However, this way explicitly or implicitly forces the simulated data to adapt to the noisy real data when training grasp detectors, where the positional drift and structural distortion within the camera noise will harm the grasp learning. In this work, we propose a Real-to-Sim framework for 6-DoF Grasp detection, named R2SGrasp, with the key insight of bridging this gap in a real-to-sim way, which directly bypasses the camera noise in grasp detector training through an inference-time real-to-sim adaption. To achieve this real-to-sim adaptation, our R2SGrasp designs the Real-to-Sim Data Repairer (R2SRepairer) to mitigate the camera noise of real depth maps in data-level, and the Real-to-Sim Feature Enhancer (R2SEnhancer) to enhance real features with precise simulated geometric primitives in feature-level. To endow our framework with the generalization ability, we construct a large-scale simulated dataset cost-efficiently to train our grasp detector, which includes 64,000 RGB-D images with 14.4 million grasp annotations. Sufficient experiments show that R2SGrasp is powerful and our real-to-sim perspective is effective. The real-world experiments further show great generalization ability of R2SGrasp. Project page is available on https://isee-laboratory.github.io/R2SGrasp.

Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection

TL;DR

This work addresses the persistent gap between simulation and reality in 6-DoF grasp detection by introducing R2SGrasp, a real-to-sim framework. It trains a grasp detector on noiseless simulated data and performs inference-time adaptation to real data through two modules: R2SRepairer, which repairs depth maps to mitigate camera noise, and R2SEnhancer, which enriches real features with simulated geometric primitives via a memory-bank and cross-attention mechanism. The authors also present the R2Sim large-scale synthetic dataset to enable cost-efficient, scalable training. Across GraspNet-1Billion and real-world tests, R2SGrasp demonstrates strong transfer, surpassing many sim-to-real baselines and approaching or exceeding performance of models trained on real data, highlighting the practical potential of real-to-sim adaptation for robust, scalable grasping systems.

Abstract

For 6-DoF grasp detection, simulated data is expandable to train more powerful model, but it faces the challenge of the large gap between simulation and real world. Previous works bridge this gap with a sim-to-real way. However, this way explicitly or implicitly forces the simulated data to adapt to the noisy real data when training grasp detectors, where the positional drift and structural distortion within the camera noise will harm the grasp learning. In this work, we propose a Real-to-Sim framework for 6-DoF Grasp detection, named R2SGrasp, with the key insight of bridging this gap in a real-to-sim way, which directly bypasses the camera noise in grasp detector training through an inference-time real-to-sim adaption. To achieve this real-to-sim adaptation, our R2SGrasp designs the Real-to-Sim Data Repairer (R2SRepairer) to mitigate the camera noise of real depth maps in data-level, and the Real-to-Sim Feature Enhancer (R2SEnhancer) to enhance real features with precise simulated geometric primitives in feature-level. To endow our framework with the generalization ability, we construct a large-scale simulated dataset cost-efficiently to train our grasp detector, which includes 64,000 RGB-D images with 14.4 million grasp annotations. Sufficient experiments show that R2SGrasp is powerful and our real-to-sim perspective is effective. The real-world experiments further show great generalization ability of R2SGrasp. Project page is available on https://isee-laboratory.github.io/R2SGrasp.
Paper Structure (21 sections, 4 equations, 15 figures, 6 tables)

This paper contains 21 sections, 4 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Illustrations of problems related to the gap between simulation and real world. Figure (a) shows mixed point clouds, including single-view point clouds of the scene and point clouds sampled from accurate object meshes. There are positional drift and structural distortion in real-world single-view point clouds which are caused by camera noise in real data. Figure (b) depicts that the camera noise disrupts the training of the grasp detector, as the average precision of the grasp detector trained in point clouds with real-world noise is lower than that trained in noiseless point clouds.
  • Figure 2: Overview of R2SGrasp framework. In inference phase, the Real-to-Sim Data Repairer (R2SRepairer) repairs depth map from RGB-D input, then a feature extractor extracts local features from the single-view point cloud which is transformed from the repaired depth map. Then Real-to-Sim Feature Enhancer (R2SEnhancer) enhances the real features using the stored simulated structural features and finally predicts the grasp poses. In training phase, we train the R2SRepairer on twin datasets and train the grasp detector with R2SEnhancer on our R2Sim dataset.
  • Figure 3: Illustrations on the impact of camera noise. (a) shows the point cloud and noise map, where different colors in the noise map represent different noise amplitude ranges, with amplitude measured in millimeters. (b) shows the performance difference before and after depth map repair using ground truth, and the green line presents the real-world training performance.
  • Figure 4: Comparison of camera noise before and after R2SRepairer. Different colors in the noise map represent different noise amplitude ranges, with amplitude measured in millimeters
  • Figure 5: Experiments with different number of scenes. Top-$N$ represents the $N$ grasp poses with the highest scores.
  • ...and 10 more figures