Table of Contents
Fetching ...

Sim-to-Real Grasp Detection with Global-to-Local RGB-D Adaptation

Haoxiang Ma, Ran Qin, Modi shi, Boyang Gao, Di Huang

TL;DR

This work tackles the sim-to-real gap in RGB-D grasp detection by casting it as a multi-modal domain adaptation problem. It introduces GL-MSDA, a two-stage framework that first uses self-supervised rotation pre-training and then conducts global RGB and depth alignment plus local grasp-feature alignment, aided by a grasp prototype adaptation module. The method demonstrates substantial improvements on the GraspNet-Planar benchmark and in real-robot experiments, and it is complemented by a large-scale simulated dataset to support future research. Overall, GL-MSDA advances robust, multi-modal sim-to-real grasp detection with explicit local-feature alignment and prototype-based cross-domain coupling, offering practical benefits for robotic manipulation tasks across varied environments.

Abstract

This paper focuses on the sim-to-real issue of RGB-D grasp detection and formulates it as a domain adaptation problem. In this case, we present a global-to-local method to address hybrid domain gaps in RGB and depth data and insufficient multi-modal feature alignment. First, a self-supervised rotation pre-training strategy is adopted to deliver robust initialization for RGB and depth networks. We then propose a global-to-local alignment pipeline with individual global domain classifiers for scene features of RGB and depth images as well as a local one specifically working for grasp features in the two modalities. In particular, we propose a grasp prototype adaptation module, which aims to facilitate fine-grained local feature alignment by dynamically updating and matching the grasp prototypes from the simulation and real-world scenarios throughout the training process. Due to such designs, the proposed method substantially reduces the domain shift and thus leads to consistent performance improvements. Extensive experiments are conducted on the GraspNet-Planar benchmark and physical environment, and superior results are achieved which demonstrate the effectiveness of our method.

Sim-to-Real Grasp Detection with Global-to-Local RGB-D Adaptation

TL;DR

This work tackles the sim-to-real gap in RGB-D grasp detection by casting it as a multi-modal domain adaptation problem. It introduces GL-MSDA, a two-stage framework that first uses self-supervised rotation pre-training and then conducts global RGB and depth alignment plus local grasp-feature alignment, aided by a grasp prototype adaptation module. The method demonstrates substantial improvements on the GraspNet-Planar benchmark and in real-robot experiments, and it is complemented by a large-scale simulated dataset to support future research. Overall, GL-MSDA advances robust, multi-modal sim-to-real grasp detection with explicit local-feature alignment and prototype-based cross-domain coupling, offering practical benefits for robotic manipulation tasks across varied environments.

Abstract

This paper focuses on the sim-to-real issue of RGB-D grasp detection and formulates it as a domain adaptation problem. In this case, we present a global-to-local method to address hybrid domain gaps in RGB and depth data and insufficient multi-modal feature alignment. First, a self-supervised rotation pre-training strategy is adopted to deliver robust initialization for RGB and depth networks. We then propose a global-to-local alignment pipeline with individual global domain classifiers for scene features of RGB and depth images as well as a local one specifically working for grasp features in the two modalities. In particular, we propose a grasp prototype adaptation module, which aims to facilitate fine-grained local feature alignment by dynamically updating and matching the grasp prototypes from the simulation and real-world scenarios throughout the training process. Due to such designs, the proposed method substantially reduces the domain shift and thus leads to consistent performance improvements. Extensive experiments are conducted on the GraspNet-Planar benchmark and physical environment, and superior results are achieved which demonstrate the effectiveness of our method.
Paper Structure (18 sections, 12 equations, 5 figures, 3 tables)

This paper contains 18 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) The domain gap occurs in sim-to-real grasp detection and (b) the proposed GL-MSDA pipeline.
  • Figure 2: Overview of the proposed GL-MSDA method.
  • Figure 3: (a) Visualization of scenes rendered in the simulator by DR. (b) Scene-level grasp annotation.
  • Figure 4: (a) The 25 objects used for physical evaluation. (b)The 7-DoF Agile Diana-7 robot arm with Intel RealSense D435i camera mounted at the end.
  • Figure 5: Visualization of the results on GraspNet-Planar.