Table of Contents
Fetching ...

HMT-Grasp: A Hybrid Mamba-Transformer Approach for Robot Grasping in Cluttered Environments

Songsong Xiong, Hamidreza Kasaei

TL;DR

This work tackles robotic grasping in clutter by balancing local geometric detail and global contextual information. It introduces a hybrid Mamba-Transformer architecture that fuses CNN and Transformer features via Vision Mamba blocks and a 2D state-space fusion mechanism, culminating in a Grasp Encoder/Decoder with Grasp Synthesis. The approach achieves state-of-the-art results on Cornell, Jacquard, and OCID-Grasp, with near 99.5% accuracy on Cornell and 93.3% RGB-D accuracy on Jacquard, plus strong performance in cluttered real-robot trials. This method offers a practical, adaptable solution for robust grasping in industrial and service robotics, demonstrating improved generalization across diverse scenes and modalities.

Abstract

Robot grasping, whether handling isolated objects, cluttered items, or stacked objects, plays a critical role in industrial and service applications. However, current visual grasp detection methods based on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) often struggle to adapt to diverse scenarios, as they tend to emphasize either local or global features exclusively, neglecting complementary cues. In this paper, we propose a novel hybrid Mamba-Transformer approach to address these challenges. Our method improves robotic visual grasping by effectively capturing both global and local information through the integration of Vision Mamba and parallel convolutional-transformer blocks. This hybrid architecture significantly improves adaptability, precision, and flexibility across various robotic tasks. To ensure a fair evaluation, we conducted extensive experiments on the Cornell, Jacquard, and OCID-Grasp datasets, ranging from simple to complex scenarios. Additionally, we performed both simulated and real-world robotic experiments. The results demonstrate that our method not only surpasses state-of-the-art techniques on standard grasping datasets but also delivers strong performance in both simulation and real-world robot applications.

HMT-Grasp: A Hybrid Mamba-Transformer Approach for Robot Grasping in Cluttered Environments

TL;DR

This work tackles robotic grasping in clutter by balancing local geometric detail and global contextual information. It introduces a hybrid Mamba-Transformer architecture that fuses CNN and Transformer features via Vision Mamba blocks and a 2D state-space fusion mechanism, culminating in a Grasp Encoder/Decoder with Grasp Synthesis. The approach achieves state-of-the-art results on Cornell, Jacquard, and OCID-Grasp, with near 99.5% accuracy on Cornell and 93.3% RGB-D accuracy on Jacquard, plus strong performance in cluttered real-robot trials. This method offers a practical, adaptable solution for robust grasping in industrial and service robotics, demonstrating improved generalization across diverse scenes and modalities.

Abstract

Robot grasping, whether handling isolated objects, cluttered items, or stacked objects, plays a critical role in industrial and service applications. However, current visual grasp detection methods based on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) often struggle to adapt to diverse scenarios, as they tend to emphasize either local or global features exclusively, neglecting complementary cues. In this paper, we propose a novel hybrid Mamba-Transformer approach to address these challenges. Our method improves robotic visual grasping by effectively capturing both global and local information through the integration of Vision Mamba and parallel convolutional-transformer blocks. This hybrid architecture significantly improves adaptability, precision, and flexibility across various robotic tasks. To ensure a fair evaluation, we conducted extensive experiments on the Cornell, Jacquard, and OCID-Grasp datasets, ranging from simple to complex scenarios. Additionally, we performed both simulated and real-world robotic experiments. The results demonstrate that our method not only surpasses state-of-the-art techniques on standard grasping datasets but also delivers strong performance in both simulation and real-world robot applications.
Paper Structure (25 sections, 2 equations, 5 figures, 7 tables)

This paper contains 25 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of the hybrid mamba-transformer architecture for robotic grasping in cluttered environments: The system comprises a grasp encoder, which integrates parallel CNN and transformer networks, followed by mamba blocks for enhanced feature fusion. A grasp decoder with skip connections further refines local and global feature extraction. The decoder upsamples to predict grasp quality, angle, and width, forming the grasp synthesis. Finally, based on the synthesized grasp prediction, the robot executes the grasp.
  • Figure 2: Multiple-object grasp detection using the HMT. From left to right: grasp rectangles on RGB images, and heatmaps for grasp quality, width, and angle.
  • Figure 3: Comparison of grasp detection outcomes and quality across CNN, Transformer, Mamba, and HMT methods in complex multi-object scenes.
  • Figure 4: Experimental setups for robotic grasping with varying clutter: Light, Moderate, and High.
  • Figure 5: Simulation setups: (left) all simulation objects; (right) our robot grasps and releases a randomly placed object.