Toward a Plug-and-Play Vision-Based Grasping Module for Robotics

François Hélénon; Johann Huber; Faïz Ben Amar; Stéphane Doncieux

Toward a Plug-and-Play Vision-Based Grasping Module for Robotics

François Hélénon, Johann Huber, Faïz Ben Amar, Stéphane Doncieux

TL;DR

This paper tackles the bottleneck of vision-based robotic grasping across multiple manipulators by coupling Quality-Diversity (QD) generated grasp repertoires with a modular perception pipeline for $6$DoF object pose estimation. It introduces an integration workflow that rigidly transforms QD trajectories from simulation to the real object frame, enabling generalization across the robot's operational space without retraining. The key contributions include a cross-platform deployment framework, an open-source integration pipeline leveraging Detic, Megapose, and ICG, and empirical validation on FR3 and UR5 platforms with multiple hands and ten YCB objects. The results demonstrate that diverse, high-quality grasp trajectories can be effectively exploited in real scenes, achieving meaningful sim-to-real transfer and paving the way for plug-and-play vision-based grasping modules.

Abstract

Despite recent advancements in AI for robotics, grasping remains a partially solved challenge, hindered by the lack of benchmarks and reproducibility constraints. This paper introduces a vision-based grasping framework that can easily be transferred across multiple manipulators. Leveraging Quality-Diversity (QD) algorithms, the framework generates diverse repertoires of open-loop grasping trajectories, enhancing adaptability while maintaining a diversity of grasps. This framework addresses two main issues: the lack of an off-the-shelf vision module for detecting object pose and the generalization of QD trajectories to the whole robot operational space. The proposed solution combines multiple vision modules for 6DoF object detection and tracking while rigidly transforming QD-generated trajectories into the object frame. Experiments on a Franka Research 3 arm and a UR5 arm with a SIH Schunk hand demonstrate comparable performance when the real scene aligns with the simulation used for grasp generation. This work represents a significant stride toward building a reliable vision-based grasping module transferable to new platforms, while being adaptable to diverse scenarios without further training iterations.

Toward a Plug-and-Play Vision-Based Grasping Module for Robotics

TL;DR

This paper tackles the bottleneck of vision-based robotic grasping across multiple manipulators by coupling Quality-Diversity (QD) generated grasp repertoires with a modular perception pipeline for

DoF object pose estimation. It introduces an integration workflow that rigidly transforms QD trajectories from simulation to the real object frame, enabling generalization across the robot's operational space without retraining. The key contributions include a cross-platform deployment framework, an open-source integration pipeline leveraging Detic, Megapose, and ICG, and empirical validation on FR3 and UR5 platforms with multiple hands and ten YCB objects. The results demonstrate that diverse, high-quality grasp trajectories can be effectively exploited in real scenes, achieving meaningful sim-to-real transfer and paving the way for plug-and-play vision-based grasping modules.

Abstract

Paper Structure (8 sections, 3 equations, 9 figures)

This paper contains 8 sections, 3 equations, 9 figures.

INTRODUCTION
RELATED WORKS
METHOD
Training
Deployment
EXPERIMENTS
RESULTS & DISCUSSION
CONCLUSIONS

Figures (9)

Figure 1: Overview of the proposed framework. It involves utilizing 3D models of the robot, target objects, and RGB-D camera data. A diverse grasping repertoire is generated with ME-scs huber2023quality in simulation. The integration pipeline predicts the object pose through a sequence of perception modules zhou2022deticlabbe2022megaposestoiber2022icg. The selected grasping trajectory is transformed into the object frame and fed to a motion planner to generalize the trajectory to the whole operational space. This adaptable framework is compatible with various manipulators with minimal need for engineering efforts.
Figure 2: Notations and adaptation principle. The robot base frame $B$ and the world frame $W$ are assumed to be equal. The robot has to grasp a mug (frame $O$) with a pose estimated by an RGB-D camera (frame $C$) and perception modules. The trajectory $\tau$ has been generated in simulation with the object at $O_{sim}$. The path followed by the end-effector is adapted from one pose to another, resulting in the trajectory $\tau'$.
Figure 3: Object 6DoF pose detection pipeline. (a) The scene is first segmented to isolate the targeted object using Detic zhou2022detic; (b) Megapose labbe2022megapose does a 3d model matching to predict the 6DoF pose; (c) ICG stoiber2022icg tracks the object pose to generalize the 6DoF tracking to any pose and allow retrial after failure.
Figure 4: Experimental setups. To demonstrate the framework flexibility to platforms, experiments have been conducted on an FR3 arm with a parallel gripper and on a UR5 arm with an SIH 5-fingers hand. The 3 RGB-D cameras have been indifferently used to demonstrate both hardware and point-of-view robustness. The 10 YCB objects calli2015benchmarking are used in both setups.
Figure 5: Adaptation of diverse trajectories. Results obtained in simulation on the FR3 robot by randomly picking 5 reach-and-grasp trajectories from a learned repertoire and different object poses. (Upper row): 2500 positions in the $xy$ grid at $z=0$ and for a fixed orientation. (Lower row): 625 positions and 6 orientations per position - 2 rotations around the $y$ axis and $3$ around the $z$ axis. The maximum number of transferable trajectories per pose is then 5x6=30. The rigid transform adaptation method generalizes the grasps to the whole operational space. Failures occur when rotations prevent some grasps (e.g., collisions or reachability constraints)
...and 4 more figures

Toward a Plug-and-Play Vision-Based Grasping Module for Robotics

TL;DR

Abstract

Toward a Plug-and-Play Vision-Based Grasping Module for Robotics

Authors

TL;DR

Abstract

Table of Contents

Figures (9)