Toward a Plug-and-Play Vision-Based Grasping Module for Robotics
François Hélénon, Johann Huber, Faïz Ben Amar, Stéphane Doncieux
TL;DR
This paper tackles the bottleneck of vision-based robotic grasping across multiple manipulators by coupling Quality-Diversity (QD) generated grasp repertoires with a modular perception pipeline for $6$DoF object pose estimation. It introduces an integration workflow that rigidly transforms QD trajectories from simulation to the real object frame, enabling generalization across the robot's operational space without retraining. The key contributions include a cross-platform deployment framework, an open-source integration pipeline leveraging Detic, Megapose, and ICG, and empirical validation on FR3 and UR5 platforms with multiple hands and ten YCB objects. The results demonstrate that diverse, high-quality grasp trajectories can be effectively exploited in real scenes, achieving meaningful sim-to-real transfer and paving the way for plug-and-play vision-based grasping modules.
Abstract
Despite recent advancements in AI for robotics, grasping remains a partially solved challenge, hindered by the lack of benchmarks and reproducibility constraints. This paper introduces a vision-based grasping framework that can easily be transferred across multiple manipulators. Leveraging Quality-Diversity (QD) algorithms, the framework generates diverse repertoires of open-loop grasping trajectories, enhancing adaptability while maintaining a diversity of grasps. This framework addresses two main issues: the lack of an off-the-shelf vision module for detecting object pose and the generalization of QD trajectories to the whole robot operational space. The proposed solution combines multiple vision modules for 6DoF object detection and tracking while rigidly transforming QD-generated trajectories into the object frame. Experiments on a Franka Research 3 arm and a UR5 arm with a SIH Schunk hand demonstrate comparable performance when the real scene aligns with the simulation used for grasp generation. This work represents a significant stride toward building a reliable vision-based grasping module transferable to new platforms, while being adaptable to diverse scenarios without further training iterations.
