XGrasp: Gripper-Aware Grasp Detection with Multi-Gripper Data Generation
Yeonseo Lee, Jungwook Mun, Hyosup Shin, Guebin Hwang, Junhee Nam, Taeyeop Lee, Sungho Jo
TL;DR
XGrasp tackles the challenge of grasp detection across multiple gripper types by systematically augmenting existing datasets with multi-gripper annotations and proposing a real-time two-stage architecture. The Grasp Point Predictor (GPP) handles global scene and gripper information to select candidate locations, while the Angle-Width Predictor (AWP) refines the grasp angle and width using local features and a contrastive learning objective that enables zero-shot generalization to unseen grippers. Empirical results on the Jacquard dataset, simulation, and real-world experiments demonstrate competitive grasp success rates across diverse grippers and substantially faster inference than prior gripper-aware methods. The approach also integrates with vision foundation models, highlighting practical potential for scalable, language-enabled robotic grasping, and points to future work on extending to $6$-DOF grasping and 3D spaces.
Abstract
Most robotic grasping methods are typically designed for single gripper types, which limits their applicability in real-world scenarios requiring diverse end-effectors. We propose XGrasp, a real-time gripper-aware grasp detection framework that efficiently handles multiple gripper configurations. The proposed method addresses data scarcity by systematically augmenting existing datasets with multi-gripper annotations. XGrasp employs a hierarchical two-stage architecture. In the first stage, a Grasp Point Predictor (GPP) identifies optimal locations using global scene information and gripper specifications. In the second stage, an Angle-Width Predictor (AWP) refines the grasp angle and width using local features. Contrastive learning in the AWP module enables zero-shot generalization to unseen grippers by learning fundamental grasping characteristics. The modular framework integrates seamlessly with vision foundation models, providing pathways for future vision-language capabilities. The experimental results demonstrate competitive grasp success rates across various gripper types, while achieving substantial improvements in inference speed compared to existing gripper-aware methods. Project page: https://sites.google.com/view/xgrasp
