OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping
Li Meng, Zhao Qi, Lyu Shuchang, Wang Chunlei, Ma Yujing, Cheng Guangliang, Yang Chenguang
TL;DR
This work tackles open-vocabulary robotic grasping by introducing OVGrasping, a large-scale dataset that pairs base and novel objects with rich language descriptions, and OVGNet, a unified visual-linguistic framework for locating and grasping targets guided by language. The architecture combines a visual-linguistic perception module with a GraspNet-based grasping system, augmented by two alignment modules, IGLA and LGIA, to improve cross-modal alignment and generalization to unseen objects. Key contributions include the dataset construction (117 categories, 63,385 instances), the open-vocabulary learning framework, and ablation studies showing gains from the alignment modules and robust grasping performance in both simulated and open tests. The results demonstrate improved recognition and grasp success on novel objects, supporting practical deployment and providing a benchmark to spur further development in open-vocabulary robotic manipulation.
Abstract
Recognizing and grasping novel-category objects remains a crucial yet challenging problem in real-world robotic applications. Despite its significance, limited research has been conducted in this specific domain. To address this, we seamlessly propose a novel framework that integrates open-vocabulary learning into the domain of robotic grasping, empowering robots with the capability to adeptly handle novel objects. Our contributions are threefold. Firstly, we present a large-scale benchmark dataset specifically tailored for evaluating the performance of open-vocabulary grasping tasks. Secondly, we propose a unified visual-linguistic framework that serves as a guide for robots in successfully grasping both base and novel objects. Thirdly, we introduce two alignment modules designed to enhance visual-linguistic perception in the robotic grasping process. Extensive experiments validate the efficacy and utility of our approach. Notably, our framework achieves an average accuracy of 71.2\% and 64.4\% on base and novel categories in our new dataset, respectively.
