Table of Contents
Fetching ...

Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues

Tuan-Anh Vu, Hai Nguyen-Truong, Ziqiang Zheng, Binh-Son Hua, Qing Guo, Ivor Tsang, Sai-Kit Yeung

TL;DR

This work tackles the challenging problem of segmenting transparent and reflective objects by introducing TransCues, a pyramid Vision Transformer–based encoder–decoder that jointly leverages boundary and reflection cues. The Boundary Feature Enhancement (BFE) and Reflection Feature Enhancement (RFE) modules enrich features to better delineate glass boundaries and distinguish reflections, supervised by a Sobel-based boundary loss and a pseudo-ground-truth reflection loss. Across glass, mirror, and generic segmentation benchmarks, TransCues achieves state-of-the-art or competitive gains, with ablations showing complementary benefits from combining BFE and RFE. Limitations include fixed positional encoding and higher computational cost, with future work aiming at multi-modality extensions and real-time deployment in dynamic scenes. Overall, the approach demonstrates robust and scalable segmentation of transparent and reflective objects using a fully Transformer-based framework.

Abstract

Glass is a prevalent material among solid objects in everyday life, yet segmentation methods struggle to distinguish it from opaque materials due to its transparency and reflection. While it is known that human perception relies on boundary and reflective-object features to distinguish glass objects, the existing literature has not yet sufficiently captured both properties when handling transparent objects. Hence, we propose incorporating both of these powerful visual cues via the Boundary Feature Enhancement and Reflection Feature Enhancement modules in a mutually beneficial way. Our proposed framework, TransCues, is a pyramidal transformer encoder-decoder architecture to segment transparent objects. We empirically show that these two modules can be used together effectively, improving overall performance across various benchmark datasets, including glass object semantic segmentation, mirror object semantic segmentation, and generic segmentation datasets. Our method outperforms the state-of-the-art by a large margin, achieving +4.2% mIoU on Trans10K-v2, +5.6% mIoU on MSD, +10.1% mIoU on RGBD-Mirror, +13.1% mIoU on TROSD, and +8.3% mIoU on Stanford2D3D, showing the effectiveness of our method against glass objects.

Power of Boundary and Reflection: Semantic Transparent Object Segmentation using Pyramid Vision Transformer with Transparent Cues

TL;DR

This work tackles the challenging problem of segmenting transparent and reflective objects by introducing TransCues, a pyramid Vision Transformer–based encoder–decoder that jointly leverages boundary and reflection cues. The Boundary Feature Enhancement (BFE) and Reflection Feature Enhancement (RFE) modules enrich features to better delineate glass boundaries and distinguish reflections, supervised by a Sobel-based boundary loss and a pseudo-ground-truth reflection loss. Across glass, mirror, and generic segmentation benchmarks, TransCues achieves state-of-the-art or competitive gains, with ablations showing complementary benefits from combining BFE and RFE. Limitations include fixed positional encoding and higher computational cost, with future work aiming at multi-modality extensions and real-time deployment in dynamic scenes. Overall, the approach demonstrates robust and scalable segmentation of transparent and reflective objects using a fully Transformer-based framework.

Abstract

Glass is a prevalent material among solid objects in everyday life, yet segmentation methods struggle to distinguish it from opaque materials due to its transparency and reflection. While it is known that human perception relies on boundary and reflective-object features to distinguish glass objects, the existing literature has not yet sufficiently captured both properties when handling transparent objects. Hence, we propose incorporating both of these powerful visual cues via the Boundary Feature Enhancement and Reflection Feature Enhancement modules in a mutually beneficial way. Our proposed framework, TransCues, is a pyramidal transformer encoder-decoder architecture to segment transparent objects. We empirically show that these two modules can be used together effectively, improving overall performance across various benchmark datasets, including glass object semantic segmentation, mirror object semantic segmentation, and generic segmentation datasets. Our method outperforms the state-of-the-art by a large margin, achieving +4.2% mIoU on Trans10K-v2, +5.6% mIoU on MSD, +10.1% mIoU on RGBD-Mirror, +13.1% mIoU on TROSD, and +8.3% mIoU on Stanford2D3D, showing the effectiveness of our method against glass objects.

Paper Structure

This paper contains 32 sections, 11 equations, 17 figures, 19 tables.

Figures (17)

  • Figure 1: Our method achieves competitive performance compared to previous methods across glass, mirror, and generic segmentation tasks. To maintain fairness, we only compare with methods that use the same input (only RGB image).
  • Figure 2: Overview of our TransCues method. An RGB image is processed by four FEM modules in the encoder for multi-scale feature extraction. These features are then refined by the decoder's FPM, BFE, and RFE modules, and ultimately converted into semantic labels via an MLP. Our main contributions, BFE and RFE modules, are elaborated in Sections \ref{['sec:BFE']} and \ref{['sec:RFE']}.
  • Figure 3: Visualization of feature maps of our method. Zoom in for better visualization.
  • Figure 4: Comparison of glass segmentation methods on Trans10K-v2 (left), RGB-P (top-right), and GSD-S (bottom-right) datasets.
  • Figure 5: Qualitative comparison of our method with other methods on MSD, PMD, and RGBD-Mirror datasets.
  • ...and 12 more figures