Table of Contents
Fetching ...

GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning

Huy Hoang Nguyen, An Vuong, Anh Nguyen, Ian Reid, Minh Nhat Vu

TL;DR

GraspMamba addresses the bottlenecks of language-driven robotic grasping in cluttered environments by proposing a Mamba-based vision-language framework that fuses multimodal features across multiple scales. The method introduces hierarchical feature fusion that integrates text embeddings at each stage of a four-stage Mamba backbone, enabling efficient long-range context modeling with linear complexity. Empirical results on the Grasp-Anything dataset and real-robot experiments show improved grasp accuracy and faster inference compared with strong baselines, along with analyses of fusion contributions and generalization. This work advances practical, real-time language-guided manipulation and offers a scalable path for deploying multimodal grasping systems in real-world robotics.

Abstract

Grasp detection is a fundamental robotic task critical to the success of many industrial applications. However, current language-driven models for this task often struggle with cluttered images, lengthy textual descriptions, or slow inference speed. We introduce GraspMamba, a new language-driven grasp detection method that employs hierarchical feature fusion with Mamba vision to tackle these challenges. By leveraging rich visual features of the Mamba-based backbone alongside textual information, our approach effectively enhances the fusion of multimodal features. GraspMamba represents the first Mamba-based grasp detection model to extract vision and language features at multiple scales, delivering robust performance and rapid inference time. Intensive experiments show that GraspMamba outperforms recent methods by a clear margin. We validate our approach through real-world robotic experiments, highlighting its fast inference speed.

GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning

TL;DR

GraspMamba addresses the bottlenecks of language-driven robotic grasping in cluttered environments by proposing a Mamba-based vision-language framework that fuses multimodal features across multiple scales. The method introduces hierarchical feature fusion that integrates text embeddings at each stage of a four-stage Mamba backbone, enabling efficient long-range context modeling with linear complexity. Empirical results on the Grasp-Anything dataset and real-robot experiments show improved grasp accuracy and faster inference compared with strong baselines, along with analyses of fusion contributions and generalization. This work advances practical, real-time language-guided manipulation and offers a scalable path for deploying multimodal grasping systems in real-world robotics.

Abstract

Grasp detection is a fundamental robotic task critical to the success of many industrial applications. However, current language-driven models for this task often struggle with cluttered images, lengthy textual descriptions, or slow inference speed. We introduce GraspMamba, a new language-driven grasp detection method that employs hierarchical feature fusion with Mamba vision to tackle these challenges. By leveraging rich visual features of the Mamba-based backbone alongside textual information, our approach effectively enhances the fusion of multimodal features. GraspMamba represents the first Mamba-based grasp detection model to extract vision and language features at multiple scales, delivering robust performance and rapid inference time. Intensive experiments show that GraspMamba outperforms recent methods by a clear margin. We validate our approach through real-world robotic experiments, highlighting its fast inference speed.
Paper Structure (14 sections, 8 equations, 6 figures, 3 tables)

This paper contains 14 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The overview of our GraspMamba framework for the language-driven grasp detection task.
  • Figure 2: Visualization of language-driven grasp detection results of different methods.
  • Figure 3: The visualization of feature fusion result when different text inputs are used.
  • Figure 4: In the wild detection results. The images are from YCB-Video fang2020graspnet dataset and the internet.
  • Figure 5: Failure cases of our method.
  • ...and 1 more figures