GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models

Jing Hao; Moyun Liu; Jinrong Yang; Kuo Feng Hung

GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models

Jing Hao, Moyun Liu, Jinrong Yang, Kuo Feng Hung

TL;DR

Glass surface segmentation is hampered by transparency and reflections, and existing methods rely on limited data and complex architectures. The authors harness vision foundation models (SAM and Stable Diffusion with ControlNet) to generate a large synthetic dataset (S-GSD) and to build GEM, a lightweight SAM-based segmentor with a discerning query mechanism. S-GSD contains 168k image-mask pairs across four scales and yields strong zero-shot and transfer learning performance, with GEM achieving state-of-the-art IoU on GSD-S and benefiting from pretraining on S-GSD. The approach reduces data annotation costs and demonstrates robust glass segmentation on RGB imagery, potentially guiding perception systems in real-world applications.

Abstract

Detecting glass regions is a challenging task due to the inherent ambiguity in their transparency and reflective characteristics. Current solutions in this field remain rooted in conventional deep learning paradigms, requiring the construction of annotated datasets and the design of network architectures. However, the evident drawback with these mainstream solutions lies in the time-consuming and labor-intensive process of curating datasets, alongside the increasing complexity of model structures. In this paper, we propose to address these issues by fully harnessing the capabilities of two existing vision foundation models (VFMs): Stable Diffusion and Segment Anything Model (SAM). Firstly, we construct a Synthetic but photorealistic large-scale Glass Surface Detection dataset, dubbed S-GSD, without any labour cost via Stable Diffusion. This dataset consists of four different scales, consisting of 168k images totally with precise masks. Besides, based on the powerful segmentation ability of SAM, we devise a simple Glass surface sEgMentor named GEM, which follows the simple query-based encoder-decoder architecture. Comprehensive experiments are conducted on the large-scale glass segmentation dataset GSD-S. Our GEM establishes a new state-of-the-art performance with the help of these two VFMs, surpassing the best-reported method GlassSemNet with an IoU improvement of 2.1%. Additionally, extensive experiments demonstrate that our synthetic dataset S-GSD exhibits remarkable performance in zero-shot and transfer learning settings. Codes, datasets and models are publicly available at: https://github.com/isbrycee/GEM

GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models

TL;DR

Abstract

Paper Structure (20 sections, 2 equations, 8 figures, 8 tables)

This paper contains 20 sections, 2 equations, 8 figures, 8 tables.

Introduction
Related works
Glass Surface Segmentation
Vision Foundation Models
The Applications of Vision Foundation Models
Proposed Synthetic Dataset
Data Construction
Data Analysis
Proposed Method
Model architecture
Discerning Query Selection
Loss Function
Experiments and Discussion
Experimental Setup
Comparisons with the State-of-the-Arts
...and 5 more sections

Figures (8)

Figure 1: The comparison of the metrics IoU (Intersection over Union) and 1/BER (balance error rate) on GSD-S validation set upon different pre-trained datasets and the state-of-the-art (SOTA). Our S-GSD dataset boosts higher improvements compared with two real glass surface datasets, GDD and GSD. The GEM-T and GEM-B indicate the GEM-Tiny and GEM-Base, respectively. The SOTA is derived from GlassSegNet lin2022exploiting. The metric IoU represents the accuracy of localization, and the metric BER refers to the ratio of errors in classification.
Figure 2: The pipeline of glass surface segmentation with the help of VFMs. Firstly, we utilize the ControlNet with Stable Diffusion to generate massive high-quality images by using the mask prior in the real dataset as control conditions. After that, we can get large-scale and diverse image-mask paris. Finally, we train proposed GEM model on the synthetic dataset and implement zero-shot and transfer learning.
Figure 3: Visual examples of synthetic data. The first and second columns refer to the real data and corresponding mask, respectively. The rest of the columns are the synthetic data. We draw the Red Edge of the mask on the synthesized image to demonstrate the precise alignment between the glass region and the mask.
Figure 4: The architecture of proposed GEM. The discerning query selection is to predict the foreground and its corresponding features will be used to initialize the decoder’s query
Figure 5: Illustration of the discerning query selection module. (a) MaskDINO's query selection method chooses top-k queries based on separate intermediate features, potentially resulting in redundant information when different queries are located in the same receptive field. (b) Our approach involves merging different-level features before selecting queries on the integrated feature, minimizing redundancy. (c) Detailed structure of our discerning query selection module.
...and 3 more figures

GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models

TL;DR

Abstract

GEM: Boost Simple Network for Glass Surface Segmentation via Vision Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)