Table of Contents
Fetching ...

ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation

Kaixin Bai, Huajian Zeng, Lei Zhang, Yiwen Liu, Hongli Xu, Zhaopeng Chen, Jianwei Zhang

TL;DR

A vision transformer-based algorithm for stereo depth recovery of transparent objects is developed, which incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation, accelerated by AI algorithm.

Abstract

Transparent object depth perception poses a challenge in everyday life and logistics, primarily due to the inability of standard 3D sensors to accurately capture depth on transparent or reflective surfaces. This limitation significantly affects depth map and point cloud-reliant applications, especially in robotic manipulation. We developed a vision transformer-based algorithm for stereo depth recovery of transparent objects. This approach is complemented by an innovative feature post-fusion module, which enhances the accuracy of depth recovery by structural features in images. To address the high costs associated with dataset collection for stereo camera-based perception of transparent objects, our method incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation, accelerated by AI algorithm. Our experimental results demonstrate the model's exceptional Sim2Real generalizability in real-world scenarios, enabling precise depth mapping of transparent objects to assist in robotic manipulation. Project details are available at https://sites.google.com/view/cleardepth/ .

ClearDepth: Enhanced Stereo Perception of Transparent Objects for Robotic Manipulation

TL;DR

A vision transformer-based algorithm for stereo depth recovery of transparent objects is developed, which incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation, accelerated by AI algorithm.

Abstract

Transparent object depth perception poses a challenge in everyday life and logistics, primarily due to the inability of standard 3D sensors to accurately capture depth on transparent or reflective surfaces. This limitation significantly affects depth map and point cloud-reliant applications, especially in robotic manipulation. We developed a vision transformer-based algorithm for stereo depth recovery of transparent objects. This approach is complemented by an innovative feature post-fusion module, which enhances the accuracy of depth recovery by structural features in images. To address the high costs associated with dataset collection for stereo camera-based perception of transparent objects, our method incorporates a parameter-aligned, domain-adaptive, and physically realistic Sim2Real simulation for efficient data generation, accelerated by AI algorithm. Our experimental results demonstrate the model's exceptional Sim2Real generalizability in real-world scenarios, enabling precise depth mapping of transparent objects to assist in robotic manipulation. Project details are available at https://sites.google.com/view/cleardepth/ .
Paper Structure (33 sections, 8 equations, 12 figures, 6 tables)

This paper contains 33 sections, 8 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: ClearDepth leverages structure-aware stereo matching and synthetic training data to bridge the Sim2Real gap in transparent object grasping, achieving superior speed–accuracy trade-offs.
  • Figure 2: Our stereo depth recovery network for transparent objects. The feature encoder extracts appearance features from both left and right images, while a context encoder processes the left image to provide structural priors for disparity refinement. A correlation pyramid is then constructed by merging left–right features to capture correspondence cues. These features, together with structural priors, are iteratively refined through a GRU-based update loop, which integrates texture similarity and structural consistency. The network finally outputs a refined disparity map that is robust to transparency-induced ambiguities.
  • Figure 3: SynClearDepth dataset with diverse objects, various scene configurations.
  • Figure 4: The visualization results of our transparent object stereo depth reconstruction method compare with other SOTA stereo depth estimation methods by fine-tuning on SynClearDepth dataset.
  • Figure 5: Qualitative experiments of ClearGrasp sajjan2020clear, TransCG fang2022transcg, ASGrasp shi2024asgrasp and proposed ClearDepth for objects with different materials in single-object and cluttered scene.
  • ...and 7 more figures