Table of Contents
Fetching ...

TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image

Haoxiao Wang, Kaichen Zhou, Binrui Gu, Zhiyuan Feng, Weijie Wang, Peilin Sun, Yicheng Xiao, Jianhua Zhang, Hao Dong

TL;DR

This work proposes a single-view RGB-D-based depth completion framework that leverages the Denoising Diffusion Probabilistic Models to achieve material-agnostic object grasping in desktop and proposes a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step.

Abstract

Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on https://wang-haoxiao.github.io/TransDiff/

TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image

TL;DR

This work proposes a single-view RGB-D-based depth completion framework that leverages the Denoising Diffusion Probabilistic Models to achieve material-agnostic object grasping in desktop and proposes a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step.

Abstract

Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on https://wang-haoxiao.github.io/TransDiff/

Paper Structure

This paper contains 23 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of TransDiff. We introduce a novel approach to reconstruct depth map of transparent object with single RGB-D image, learns an iterative denoising process that transforms a random depth distribution into a depth map. Once the point cloud image is obtained, we carry out grasp pose generation and grasp execution in the scene.
  • Figure 2: TransDiff pipeline Given RGB-D image, TransDiff first predicts the mask, the boundary and the surface normal of transparent objects. Global optimization will then generate a depth map of the initial refinement, combined with an RGB image, and conduct feature fusion to integrate the features from both the RGB image and the initially refined depth map to create a combined feature representation. The Refined Visual Conditioned Denoising Block (RVCDB) iteratively refines the depth map through a denoising process guided by the fused visual conditions. The denoising process incorporates both the diffusion process and the guidance provided by the refined visual conditions, progressively improving the depth map's quality.
  • Figure 3: Comparison Between TransDiff and Other methods in ClearGrasp Real-World Dataset.