Table of Contents
Fetching ...

Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion

Jiangyuan Liu, Hongxuan Ma, Yuxin Guo, Yuhao Zhao, Chi Zhang, Wei Sui, Wei Zou

TL;DR

This work tackles the challenging problem of monocularly sensing transparent objects by jointly estimating depth and segmentation from a single RGB image. It introduces a semantic and geometric fusion module that enables cross-task information exchange and an iterative refinement strategy that progressively sharpens predictions. The approach leverages a Vision Transformer backbone, a reassemble module to build multi-scale feature pyramids, and a shared-weight decoder refined through gated iterations, achieving state-of-the-art performance on synthetic and real datasets without extra modalities. The results demonstrate substantial improvements over monocular, stereo, and multi-view baselines, highlighting the practical potential for transparent-object perception in robotics and related applications.

Abstract

Transparent object perception is indispensable for numerous robotic tasks. However, accurately segmenting and estimating the depth of transparent objects remain challenging due to complex optical properties. Existing methods primarily delve into only one task using extra inputs or specialized sensors, neglecting the valuable interactions among tasks and the subsequent refinement process, leading to suboptimal and blurry predictions. To address these issues, we propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects, with only a single-image input. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. In addition, drawing inspiration from human perception of objects, we further incorporate an iterative strategy, which progressively refines initial features for clearer results. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin of about 38.8%-46.2% with only a single RGB input. Codes and models are publicly available at https://github.com/L-J-Yuan/MODEST.

Monocular Depth Estimation and Segmentation for Transparent Object with Iterative Semantic and Geometric Fusion

TL;DR

This work tackles the challenging problem of monocularly sensing transparent objects by jointly estimating depth and segmentation from a single RGB image. It introduces a semantic and geometric fusion module that enables cross-task information exchange and an iterative refinement strategy that progressively sharpens predictions. The approach leverages a Vision Transformer backbone, a reassemble module to build multi-scale feature pyramids, and a shared-weight decoder refined through gated iterations, achieving state-of-the-art performance on synthetic and real datasets without extra modalities. The results demonstrate substantial improvements over monocular, stereo, and multi-view baselines, highlighting the practical potential for transparent-object perception in robotics and related applications.

Abstract

Transparent object perception is indispensable for numerous robotic tasks. However, accurately segmenting and estimating the depth of transparent objects remain challenging due to complex optical properties. Existing methods primarily delve into only one task using extra inputs or specialized sensors, neglecting the valuable interactions among tasks and the subsequent refinement process, leading to suboptimal and blurry predictions. To address these issues, we propose a monocular framework, which is the first to excel in both segmentation and depth estimation of transparent objects, with only a single-image input. Specifically, we devise a novel semantic and geometric fusion module, effectively integrating the multi-scale information between tasks. In addition, drawing inspiration from human perception of objects, we further incorporate an iterative strategy, which progressively refines initial features for clearer results. Experiments on two challenging synthetic and real-world datasets demonstrate that our model surpasses state-of-the-art monocular, stereo, and multi-view methods by a large margin of about 38.8%-46.2% with only a single RGB input. Codes and models are publicly available at https://github.com/L-J-Yuan/MODEST.

Paper Structure

This paper contains 17 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Previous frameworks rely either on multi-view inputs or additional modalities (e.g., depth maps, thermal images) to make predictions. Differently, we propose the first monocular framework that utilizes iterative cross-task fusion to improve both depth and segmentation performance.
  • Figure 2: Overview of our proposed end-to-end framework. (a) Given an RGB input, our model jointly predicts depth and segmentation mask through encoding, reassembling, and iterative fusion decoding. (b) The encoder uses ViT c38 to extract vision tokens of four layers. (c) Then in the reassemble module, the tokens are transformed into multi-scale feature maps, forming two pyramids for depth and segmentation, respectively. (d) A novel semantic and geometric fusion module is designed in the decoder for better leveraging the complementary information of both tasks. (e) The shared-weight decoder is updated iteratively by lightweight gates to gradually refine the initial results. Final predictions are obtained by two heads after the last iteration.
  • Figure 3: Illustration of the semantic and geometric fusion module (SGFM).$F_d$ and $F_s$ represent features of a certain layer of the depth and segmentation pyramid, respectively. The two feature maps are processed along both channel and spatial dimensions to adaptively emphasize semantic and geometric information. They are then cross-multiplied to achieve the fusion.
  • Figure 4: Qualitative comparison on Syn-TODD dataset of depth and segmentation, where Seg and GT stand for segmentation and ground truth, respectively. SimNet and MVTrans take both RGB images as input, while the other methods only take the first one as input. Obviously, our predictions are far better than all other methods with only single RGB as input.
  • Figure 5: Ablation studies on the iterative strategy.
  • ...and 1 more figures