Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection
Jia-Feng Cai, Zibo Chen, Xiao-Ming Wu, Jian-Jian Jiang, Yi-Lin Wei, Wei-Shi Zheng
TL;DR
This work addresses the persistent gap between simulation and reality in 6-DoF grasp detection by introducing R2SGrasp, a real-to-sim framework. It trains a grasp detector on noiseless simulated data and performs inference-time adaptation to real data through two modules: R2SRepairer, which repairs depth maps to mitigate camera noise, and R2SEnhancer, which enriches real features with simulated geometric primitives via a memory-bank and cross-attention mechanism. The authors also present the R2Sim large-scale synthetic dataset to enable cost-efficient, scalable training. Across GraspNet-1Billion and real-world tests, R2SGrasp demonstrates strong transfer, surpassing many sim-to-real baselines and approaching or exceeding performance of models trained on real data, highlighting the practical potential of real-to-sim adaptation for robust, scalable grasping systems.
Abstract
For 6-DoF grasp detection, simulated data is expandable to train more powerful model, but it faces the challenge of the large gap between simulation and real world. Previous works bridge this gap with a sim-to-real way. However, this way explicitly or implicitly forces the simulated data to adapt to the noisy real data when training grasp detectors, where the positional drift and structural distortion within the camera noise will harm the grasp learning. In this work, we propose a Real-to-Sim framework for 6-DoF Grasp detection, named R2SGrasp, with the key insight of bridging this gap in a real-to-sim way, which directly bypasses the camera noise in grasp detector training through an inference-time real-to-sim adaption. To achieve this real-to-sim adaptation, our R2SGrasp designs the Real-to-Sim Data Repairer (R2SRepairer) to mitigate the camera noise of real depth maps in data-level, and the Real-to-Sim Feature Enhancer (R2SEnhancer) to enhance real features with precise simulated geometric primitives in feature-level. To endow our framework with the generalization ability, we construct a large-scale simulated dataset cost-efficiently to train our grasp detector, which includes 64,000 RGB-D images with 14.4 million grasp annotations. Sufficient experiments show that R2SGrasp is powerful and our real-to-sim perspective is effective. The real-world experiments further show great generalization ability of R2SGrasp. Project page is available on https://isee-laboratory.github.io/R2SGrasp.
