Table of Contents
Fetching ...

HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion

Guanghu Xie, Yonglong Zhang, Zhiduo Jiang, Yang Liu, Zongwu Xie, Baoshi Cao, Hong Liu

TL;DR

HTMNet addresses the core challenge of depth completion for transparent and reflective objects by integrating a dual-branch Transformer-CNN encoder with a Transformer-Mamba bottleneck fusion and a multi-scale decoder. The method leverages self-attention and state-space modeling to fuse multimodal features, achieving state-of-the-art results on TransCG, ClearGrasp, and STD datasets and producing detailed depth maps in challenging regions. This work advances robust depth perception in scenarios where standard RGB-D sensors fail, with practical implications for robotic grasping and manipulation in real-world environments. Overall, HTMNet demonstrates the benefits of combining attention-based fusion with state-space models to handle complex optical phenomena in depth sensing.

Abstract

Transparent and reflective objects pose significant challenges for depth sensors, resulting in incomplete depth information that adversely affects downstream robotic perception and manipulation tasks. To address this issue, we propose HTMNet, a novel hybrid model integrating Transformer, CNN, and Mamba architectures. The encoder is based on a dual-branch CNN-Transformer framework, the bottleneck fusion module adopts a Transformer-Mamba architecture, and the decoder is built upon a multi-scale fusion module. We introduce a novel multimodal fusion module grounded in self-attention mechanisms and state space models, marking the first application of the Mamba architecture in the field of transparent object depth completion and revealing its promising potential. Additionally, we design an innovative multi-scale fusion module for the decoder that combines channel attention, spatial attention, and multi-scale feature extraction techniques to effectively integrate multi-scale features through a down-fusion strategy. Extensive evaluations on multiple public datasets demonstrate that our model achieves state-of-the-art(SOTA) performance, validating the effectiveness of our approach.

HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion

TL;DR

HTMNet addresses the core challenge of depth completion for transparent and reflective objects by integrating a dual-branch Transformer-CNN encoder with a Transformer-Mamba bottleneck fusion and a multi-scale decoder. The method leverages self-attention and state-space modeling to fuse multimodal features, achieving state-of-the-art results on TransCG, ClearGrasp, and STD datasets and producing detailed depth maps in challenging regions. This work advances robust depth perception in scenarios where standard RGB-D sensors fail, with practical implications for robotic grasping and manipulation in real-world environments. Overall, HTMNet demonstrates the benefits of combining attention-based fusion with state-space models to handle complex optical phenomena in depth sensing.

Abstract

Transparent and reflective objects pose significant challenges for depth sensors, resulting in incomplete depth information that adversely affects downstream robotic perception and manipulation tasks. To address this issue, we propose HTMNet, a novel hybrid model integrating Transformer, CNN, and Mamba architectures. The encoder is based on a dual-branch CNN-Transformer framework, the bottleneck fusion module adopts a Transformer-Mamba architecture, and the decoder is built upon a multi-scale fusion module. We introduce a novel multimodal fusion module grounded in self-attention mechanisms and state space models, marking the first application of the Mamba architecture in the field of transparent object depth completion and revealing its promising potential. Additionally, we design an innovative multi-scale fusion module for the decoder that combines channel attention, spatial attention, and multi-scale feature extraction techniques to effectively integrate multi-scale features through a down-fusion strategy. Extensive evaluations on multiple public datasets demonstrate that our model achieves state-of-the-art(SOTA) performance, validating the effectiveness of our approach.

Paper Structure

This paper contains 23 sections, 9 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: a$)$ illustrates two typical types of errors encountered when capturing the depth of transparent objects using depth cameras: one is missing depth information, and the other is the erroneous acquisition of background depth;b$)$ depicts the general pipeline in which a depth completion model is used to recover the depth of transparent objects, which is subsequently fed into downstream tasks.
  • Figure 2: Depth completion plays a crucial role in dexterous grasping applications. When grasping transparent or specular objects, depth completion is first performed to obtain relatively complete depth information. The completed depth maps are then fed into a dexterous grasping network for grasp detection. This process effectively addresses the failure of grasp detection caused by missing depth information in transparent or specular objects.
  • Figure 3: HTMNet Architecture.Our method consists of a dual-branch encoder, a bottleneck fusion module, and a decoder. The Transformer-based backbone extracts RGB-D features, while the CNN-based backbone extracts depth features. The bottleneck fusion module performs multimodal fusion at the network bottleneck, and the decoder is composed of a multi-scale fusion module, convolutional layers, and upsampling operations.
  • Figure 4: Bottleneck fusion module(BFM): Composed of Transformer and Mamba modules, multimodal features are pixel-wise summed and then sequentially processed through a self-attention block, a Mamba block, and an MLP block to output enhanced fused features.
  • Figure 5: Multi-scale fusion module(MSFM): Constructed based on spatial attention, channel attention, and multi-scale feature extraction mechanisms, it fuses multi-scale features from the encoder for the decoder.
  • ...and 5 more figures