Table of Contents
Fetching ...

GAA-TSO: Geometry-Aware Assisted Depth Completion for Transparent and Specular Objects

Yizhe Liu, Tong Jia, Da Cai, Hao Wang, Dongyue Chen

TL;DR

The paper tackles the challenge of depth completion for transparent and specular objects, where depth sensors produce incomplete or noisy data. It introduces GAA-TSO, a geometry-aware framework with a 2D image branch and a 3D point-cloud branch (via PMP-Net) whose features are fused through gated cross-modal modules and an adaptive correlation aggregation mechanism to align 2D and 3D cues. The approach yields state-of-the-art results on four public datasets (ClearGrasp, OOD, TransCG, STD) and demonstrates meaningful improvements in downstream robotic grasping tasks, validating the practical impact for manipulation in challenging materials. By explicitly modelling 3D structure and robust cross-modal fusion, the method provides sharper depth boundaries and better scene understanding for robotics in real-world settings.

Abstract

Transparent and specular objects are frequently encountered in daily life, factories, and laboratories. However, due to the unique optical properties, the depth information on these objects is usually incomplete and inaccurate, which poses significant challenges for downstream robotics tasks. Therefore, it is crucial to accurately restore the depth information of transparent and specular objects. Previous depth completion methods for these objects usually use RGB information as an additional channel of the depth image to perform depth prediction. Due to the poor-texture characteristics of transparent and specular objects, these methods that rely heavily on color information tend to generate structure-less depth predictions. Moreover, these 2D methods cannot effectively explore the 3D structure hidden in the depth channel, resulting in depth ambiguity. To this end, we propose a geometry-aware assisted depth completion method for transparent and specular objects, which focuses on exploring the 3D structural cues of the scene. Specifically, besides extracting 2D features from RGB-D input, we back-project the input depth to a point cloud and build the 3D branch to extract hierarchical scene-level 3D structural features. To exploit 3D geometric information, we design several gated cross-modal fusion modules to effectively propagate multi-level 3D geometric features to the image branch. In addition, we propose an adaptive correlation aggregation strategy to appropriately assign 3D features to the corresponding 2D features. Extensive experiments on ClearGrasp, OOD, TransCG, and STD datasets show that our method outperforms other state-of-the-art methods. We further demonstrate that our method significantly enhances the performance of downstream robotic grasping tasks.

GAA-TSO: Geometry-Aware Assisted Depth Completion for Transparent and Specular Objects

TL;DR

The paper tackles the challenge of depth completion for transparent and specular objects, where depth sensors produce incomplete or noisy data. It introduces GAA-TSO, a geometry-aware framework with a 2D image branch and a 3D point-cloud branch (via PMP-Net) whose features are fused through gated cross-modal modules and an adaptive correlation aggregation mechanism to align 2D and 3D cues. The approach yields state-of-the-art results on four public datasets (ClearGrasp, OOD, TransCG, STD) and demonstrates meaningful improvements in downstream robotic grasping tasks, validating the practical impact for manipulation in challenging materials. By explicitly modelling 3D structure and robust cross-modal fusion, the method provides sharper depth boundaries and better scene understanding for robotics in real-world settings.

Abstract

Transparent and specular objects are frequently encountered in daily life, factories, and laboratories. However, due to the unique optical properties, the depth information on these objects is usually incomplete and inaccurate, which poses significant challenges for downstream robotics tasks. Therefore, it is crucial to accurately restore the depth information of transparent and specular objects. Previous depth completion methods for these objects usually use RGB information as an additional channel of the depth image to perform depth prediction. Due to the poor-texture characteristics of transparent and specular objects, these methods that rely heavily on color information tend to generate structure-less depth predictions. Moreover, these 2D methods cannot effectively explore the 3D structure hidden in the depth channel, resulting in depth ambiguity. To this end, we propose a geometry-aware assisted depth completion method for transparent and specular objects, which focuses on exploring the 3D structural cues of the scene. Specifically, besides extracting 2D features from RGB-D input, we back-project the input depth to a point cloud and build the 3D branch to extract hierarchical scene-level 3D structural features. To exploit 3D geometric information, we design several gated cross-modal fusion modules to effectively propagate multi-level 3D geometric features to the image branch. In addition, we propose an adaptive correlation aggregation strategy to appropriately assign 3D features to the corresponding 2D features. Extensive experiments on ClearGrasp, OOD, TransCG, and STD datasets show that our method outperforms other state-of-the-art methods. We further demonstrate that our method significantly enhances the performance of downstream robotic grasping tasks.

Paper Structure

This paper contains 19 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: An example of depth completion in a scene containing transparent and specular objects. From the first to the third row, we show the RGB image, the raw depth map, and the depth completion result from FDCT li2023fdct respectively.
  • Figure 2: The overview of our proposed GAA-TSO architecture, which consists of three main components: image branch, point cloud branch, and gated cross-modal fusion (GCMF).
  • Figure 3: Details of the adaptive correlation aggregation strategy.
  • Figure 4: The structure of our gated cross-modal fusion module, which consists of self-attention, cross-attention, and gated recurrent unit.
  • Figure 5: Qualitative results on the ClearGrasp dataset. From the first to the fifth column, we show the input RGB, raw depth, ground truth, results of FDCT, and ours.
  • ...and 3 more figures