GAA-TSO: Geometry-Aware Assisted Depth Completion for Transparent and Specular Objects
Yizhe Liu, Tong Jia, Da Cai, Hao Wang, Dongyue Chen
TL;DR
The paper tackles the challenge of depth completion for transparent and specular objects, where depth sensors produce incomplete or noisy data. It introduces GAA-TSO, a geometry-aware framework with a 2D image branch and a 3D point-cloud branch (via PMP-Net) whose features are fused through gated cross-modal modules and an adaptive correlation aggregation mechanism to align 2D and 3D cues. The approach yields state-of-the-art results on four public datasets (ClearGrasp, OOD, TransCG, STD) and demonstrates meaningful improvements in downstream robotic grasping tasks, validating the practical impact for manipulation in challenging materials. By explicitly modelling 3D structure and robust cross-modal fusion, the method provides sharper depth boundaries and better scene understanding for robotics in real-world settings.
Abstract
Transparent and specular objects are frequently encountered in daily life, factories, and laboratories. However, due to the unique optical properties, the depth information on these objects is usually incomplete and inaccurate, which poses significant challenges for downstream robotics tasks. Therefore, it is crucial to accurately restore the depth information of transparent and specular objects. Previous depth completion methods for these objects usually use RGB information as an additional channel of the depth image to perform depth prediction. Due to the poor-texture characteristics of transparent and specular objects, these methods that rely heavily on color information tend to generate structure-less depth predictions. Moreover, these 2D methods cannot effectively explore the 3D structure hidden in the depth channel, resulting in depth ambiguity. To this end, we propose a geometry-aware assisted depth completion method for transparent and specular objects, which focuses on exploring the 3D structural cues of the scene. Specifically, besides extracting 2D features from RGB-D input, we back-project the input depth to a point cloud and build the 3D branch to extract hierarchical scene-level 3D structural features. To exploit 3D geometric information, we design several gated cross-modal fusion modules to effectively propagate multi-level 3D geometric features to the image branch. In addition, we propose an adaptive correlation aggregation strategy to appropriately assign 3D features to the corresponding 2D features. Extensive experiments on ClearGrasp, OOD, TransCG, and STD datasets show that our method outperforms other state-of-the-art methods. We further demonstrate that our method significantly enhances the performance of downstream robotic grasping tasks.
