Table of Contents
Fetching ...

Language-driven Grasp Detection with Mask-guided Attention

Tuan Van Vo, Minh Nhat Vu, Baoru Huang, An Vuong, Ngan Le, Thieu Vo, Anh Nguyen

TL;DR

This work addresses language-driven grasp detection under occlusions by introducing a mask-guided attention Transformer that fuses visual features, segmentation mask cues, and natural language. The MaskGrasp framework leverages cross-modal attention among text, grasp-region visual features, and segmentation features, guided by a triplet correspondence loss and a grasp regression/classification objective. Key contributions include the mask-guided attention mechanism, a multimodal fusion pipeline with segmentation awareness, and comprehensive evaluations on the Grasp-Anything dataset plus real-robot experiments, showing clear performance gains over strong baselines. The method enhances robustness in cluttered scenes and supports on-demand object manipulation via natural language, offering practical benefits for language-enabled robotic systems.

Abstract

Grasp detection is an essential task in robotics with various industrial applications. However, traditional methods often struggle with occlusions and do not utilize language for grasping. Incorporating natural language into grasp detection remains a challenging task and largely unexplored. To address this gap, we propose a new method for language-driven grasp detection with mask-guided attention by utilizing the transformer attention mechanism with semantic segmentation features. Our approach integrates visual data, segmentation mask features, and natural language instructions, significantly improving grasp detection accuracy. Our work introduces a new framework for language-driven grasp detection, paving the way for language-driven robotic applications. Intensive experiments show that our method outperforms other recent baselines by a clear margin, with a 10.0% success score improvement. We further validate our method in real-world robotic experiments, confirming the effectiveness of our approach.

Language-driven Grasp Detection with Mask-guided Attention

TL;DR

This work addresses language-driven grasp detection under occlusions by introducing a mask-guided attention Transformer that fuses visual features, segmentation mask cues, and natural language. The MaskGrasp framework leverages cross-modal attention among text, grasp-region visual features, and segmentation features, guided by a triplet correspondence loss and a grasp regression/classification objective. Key contributions include the mask-guided attention mechanism, a multimodal fusion pipeline with segmentation awareness, and comprehensive evaluations on the Grasp-Anything dataset plus real-robot experiments, showing clear performance gains over strong baselines. The method enhances robustness in cluttered scenes and supports on-demand object manipulation via natural language, offering practical benefits for language-enabled robotic systems.

Abstract

Grasp detection is an essential task in robotics with various industrial applications. However, traditional methods often struggle with occlusions and do not utilize language for grasping. Incorporating natural language into grasp detection remains a challenging task and largely unexplored. To address this gap, we propose a new method for language-driven grasp detection with mask-guided attention by utilizing the transformer attention mechanism with semantic segmentation features. Our approach integrates visual data, segmentation mask features, and natural language instructions, significantly improving grasp detection accuracy. Our work introduces a new framework for language-driven grasp detection, paving the way for language-driven robotic applications. Intensive experiments show that our method outperforms other recent baselines by a clear margin, with a 10.0% success score improvement. We further validate our method in real-world robotic experiments, confirming the effectiveness of our approach.
Paper Structure (14 sections, 11 equations, 8 figures, 3 tables)

This paper contains 14 sections, 11 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: We propose a mask-guided attention mechanism that learns the mask and language features to tackle the language-driven grasping task.
  • Figure 2: The overview of our mask-guided attention framework for the language-driven grasp detection task.
  • Figure 3: Language-driven grasp detection results.
  • Figure 4: The visualization comparison between using and not using our mask-guided attention.
  • Figure 5: t-SNE visualization of the grasp object feature representations. We apply t-SNE to cluster the grasp object feature representations $z^{vis}$ of Equation \ref{['eq:5']} when using and not using the correspondence loss with mask feature objects in our method.
  • ...and 3 more figures