Table of Contents
Fetching ...

HDCNet: A Hybrid Depth Completion Network for Grasping Transparent and Reflective Objects

Guanghu Xie, Mingxu Li, Songwei Wu, Yang Liu, Zongwu Xie, Baoshi Cao, Hong Liu

TL;DR

HDCNet addresses the critical problem of depth perception for transparent and reflective objects by introducing a hybrid depth completion network that fuses RGB-D and depth modalities through a dual-branch Transformer-CNN encoder, a shallow multimodal fusion module, and a bottleneck Transformer-Mamba fusion block. The approach achieves state-of-the-art depth completion on public benchmarks and demonstrates practical gains in robotic grasping tasks, with improvements in grasp success rates for challenging materials. Key contributions include the hierarchical multimodal fusion strategy and the demonstration that combining Transformer, CNN, and Mamba architectures yields robust, globally informed depth estimates. The method's effectiveness across real and synthetic datasets, plus real-world grasping validation, highlights the potential of hybrid fusion architectures for robust perception in complex optical environments.

Abstract

Depth perception of transparent and reflective objects has long been a critical challenge in robotic manipulation.Conventional depth sensors often fail to provide reliable measurements on such surfaces, limiting the performance of robots in perception and grasping tasks. To address this issue, we propose a novel depth completion network,HDCNet,which integrates the complementary strengths of Transformer,CNN and Mamba architectures.Specifically,the encoder is designed as a dual-branch Transformer-CNN framework to extract modality-specific features. At the shallow layers of the encoder, we introduce a lightweight multimodal fusion module to effectively integrate low-level features. At the network bottleneck,a Transformer-Mamba hybrid fusion module is developed to achieve deep integration of high-level semantic and global contextual information, significantly enhancing depth completion accuracy and robustness. Extensive evaluations on multiple public datasets demonstrate that HDCNet achieves state-of-the-art(SOTA) performance in depth completion tasks.Furthermore,robotic grasping experiments show that HDCNet substantially improves grasp success rates for transparent and reflective objects,achieving up to a 60% increase.

HDCNet: A Hybrid Depth Completion Network for Grasping Transparent and Reflective Objects

TL;DR

HDCNet addresses the critical problem of depth perception for transparent and reflective objects by introducing a hybrid depth completion network that fuses RGB-D and depth modalities through a dual-branch Transformer-CNN encoder, a shallow multimodal fusion module, and a bottleneck Transformer-Mamba fusion block. The approach achieves state-of-the-art depth completion on public benchmarks and demonstrates practical gains in robotic grasping tasks, with improvements in grasp success rates for challenging materials. Key contributions include the hierarchical multimodal fusion strategy and the demonstration that combining Transformer, CNN, and Mamba architectures yields robust, globally informed depth estimates. The method's effectiveness across real and synthetic datasets, plus real-world grasping validation, highlights the potential of hybrid fusion architectures for robust perception in complex optical environments.

Abstract

Depth perception of transparent and reflective objects has long been a critical challenge in robotic manipulation.Conventional depth sensors often fail to provide reliable measurements on such surfaces, limiting the performance of robots in perception and grasping tasks. To address this issue, we propose a novel depth completion network,HDCNet,which integrates the complementary strengths of Transformer,CNN and Mamba architectures.Specifically,the encoder is designed as a dual-branch Transformer-CNN framework to extract modality-specific features. At the shallow layers of the encoder, we introduce a lightweight multimodal fusion module to effectively integrate low-level features. At the network bottleneck,a Transformer-Mamba hybrid fusion module is developed to achieve deep integration of high-level semantic and global contextual information, significantly enhancing depth completion accuracy and robustness. Extensive evaluations on multiple public datasets demonstrate that HDCNet achieves state-of-the-art(SOTA) performance in depth completion tasks.Furthermore,robotic grasping experiments show that HDCNet substantially improves grasp success rates for transparent and reflective objects,achieving up to a 60% increase.

Paper Structure

This paper contains 19 sections, 22 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of grasp detection for transparent and reflective objects with and without depth completion.
  • Figure 2: HDCNet Architecture.Our method consists of a dual-branch encoder, a bottleneck fusion module, and a decoder. The Transformer-based backbone extracts RGB-D features, while the CNN-based backbone extracts depth features. The bottleneck fusion module performs multimodal fusion at the network bottleneck, and the decoder is composed of a multi-scale fusion module, convolutional layers, and upsampling operations.
  • Figure 3: Depth Completion Visualizations of Different Models on the TransCG Dataset
  • Figure 4: Depth Completion Visualizations of Different Models on the ClearGrasp Real-world Dataset
  • Figure 5: Transparent and reflective objects in real-world grasping experiments. According to the sequence numbers, the objects are respectively: water bottle 1, reflective foam board, reflective box, beverage bottle 1, milk bottle, water bottle 2, beverage bottle 2, water bottle 3, and detergent bottle.
  • ...and 1 more figures