Table of Contents
Fetching ...

OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping

Li Meng, Zhao Qi, Lyu Shuchang, Wang Chunlei, Ma Yujing, Cheng Guangliang, Yang Chenguang

TL;DR

This work tackles open-vocabulary robotic grasping by introducing OVGrasping, a large-scale dataset that pairs base and novel objects with rich language descriptions, and OVGNet, a unified visual-linguistic framework for locating and grasping targets guided by language. The architecture combines a visual-linguistic perception module with a GraspNet-based grasping system, augmented by two alignment modules, IGLA and LGIA, to improve cross-modal alignment and generalization to unseen objects. Key contributions include the dataset construction (117 categories, 63,385 instances), the open-vocabulary learning framework, and ablation studies showing gains from the alignment modules and robust grasping performance in both simulated and open tests. The results demonstrate improved recognition and grasp success on novel objects, supporting practical deployment and providing a benchmark to spur further development in open-vocabulary robotic manipulation.

Abstract

Recognizing and grasping novel-category objects remains a crucial yet challenging problem in real-world robotic applications. Despite its significance, limited research has been conducted in this specific domain. To address this, we seamlessly propose a novel framework that integrates open-vocabulary learning into the domain of robotic grasping, empowering robots with the capability to adeptly handle novel objects. Our contributions are threefold. Firstly, we present a large-scale benchmark dataset specifically tailored for evaluating the performance of open-vocabulary grasping tasks. Secondly, we propose a unified visual-linguistic framework that serves as a guide for robots in successfully grasping both base and novel objects. Thirdly, we introduce two alignment modules designed to enhance visual-linguistic perception in the robotic grasping process. Extensive experiments validate the efficacy and utility of our approach. Notably, our framework achieves an average accuracy of 71.2\% and 64.4\% on base and novel categories in our new dataset, respectively.

OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping

TL;DR

This work tackles open-vocabulary robotic grasping by introducing OVGrasping, a large-scale dataset that pairs base and novel objects with rich language descriptions, and OVGNet, a unified visual-linguistic framework for locating and grasping targets guided by language. The architecture combines a visual-linguistic perception module with a GraspNet-based grasping system, augmented by two alignment modules, IGLA and LGIA, to improve cross-modal alignment and generalization to unseen objects. Key contributions include the dataset construction (117 categories, 63,385 instances), the open-vocabulary learning framework, and ablation studies showing gains from the alignment modules and robust grasping performance in both simulated and open tests. The results demonstrate improved recognition and grasp success on novel objects, supporting practical deployment and providing a benchmark to spur further development in open-vocabulary robotic manipulation.

Abstract

Recognizing and grasping novel-category objects remains a crucial yet challenging problem in real-world robotic applications. Despite its significance, limited research has been conducted in this specific domain. To address this, we seamlessly propose a novel framework that integrates open-vocabulary learning into the domain of robotic grasping, empowering robots with the capability to adeptly handle novel objects. Our contributions are threefold. Firstly, we present a large-scale benchmark dataset specifically tailored for evaluating the performance of open-vocabulary grasping tasks. Secondly, we propose a unified visual-linguistic framework that serves as a guide for robots in successfully grasping both base and novel objects. Thirdly, we introduce two alignment modules designed to enhance visual-linguistic perception in the robotic grasping process. Extensive experiments validate the efficacy and utility of our approach. Notably, our framework achieves an average accuracy of 71.2\% and 64.4\% on base and novel categories in our new dataset, respectively.
Paper Structure (25 sections, 6 equations, 5 figures, 6 tables)

This paper contains 25 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The diagram of open-vocabulary grasping. Objects and fonts in red and blue respectively indicate the base and novel categories.
  • Figure 2: Samples of OVGrasping dataset. Red boxes indicate the target objects, and green boxes denote the relative objects.
  • Figure 3: The overview of OVGNet. The visual-linguistic perception system locates the target object referred by natural language. The grasping system generates grasping pose for the target object. MHA stands for multi head attention. Feature constrain represents the scaling of image feature using the constraint score. LGQS represents the language guided query selection module.
  • Figure 4: Visualization on OVGrasping dataset. Green boxes indicate the ground-truth, and red boxes denote the detection results.
  • Figure 5: Case analysis. Green boxes indicate the ground-truth, red boxes denote the predict results, and yellow area represents the defect.