GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis

Jing Hao; Moyun Liu; Kuo Feng Hung

GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis

Jing Hao, Moyun Liu, Kuo Feng Hung

TL;DR

This work addresses the challenging task of segmenting glass surfaces, whose transparency and reflections yield ambiguous boundaries. It proposes GEM, a lightweight, SAM-based segmentation framework with a discerning query selection module, coupled with S-GSD, a large-scale synthetic glass dataset generated via ControlNet and Stable Diffusion for transfer learning. Empirical results show GEM achieving state-of-the-art performance on GSD-S (IoU improvements up to +2.1%) and benefiting from synthetic pretraining, with further gains in zero-shot and finetuning when using S-GSD (e.g., IoU improvements of 0.026 and 0.018 for GEM-Tiny and GEM-Base). The study demonstrates the potential of combining visual foundation models with synthetic data for specialized perception tasks, while also revealing data-scale saturation effects and signaling directions for future AIGC-assisted segmentation research.

Abstract

Detecting glass regions is a challenging task due to the ambiguity of their transparency and reflection properties. These transparent glasses share the visual appearance of both transmitted arbitrary background scenes and reflected objects, thus having no fixed patterns.Recent visual foundation models, which are trained on vast amounts of data, have manifested stunning performance in terms of image perception and image generation. To segment glass surfaces with higher accuracy, we make full use of two visual foundation models: Segment Anything (SAM) and Stable Diffusion.Specifically, we devise a simple glass surface segmentor named GEM, which only consists of a SAM backbone, a simple feature pyramid, a discerning query selection module, and a mask decoder. The discerning query selection can adaptively identify glass surface features, assigning them as initialized queries in the mask decoder. We also propose a Synthetic but photorealistic large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales, which contain 1x, 5x, 10x, and 20x of the original real data size. This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually saturate as the amount of data increases. Extensive experiments demonstrate that GEM achieves a new state-of-the-art on the GSD-S validation set (IoU +2.1%). Codes and datasets are available at: https://github.com/isbrycee/GEM-Glass-Segmentor.

GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis

TL;DR

Abstract

Paper Structure (22 sections, 2 equations, 9 figures, 7 tables)

This paper contains 22 sections, 2 equations, 9 figures, 7 tables.

Introduction
Related works
Glass Surface Detection
Image Generation Foundation Model
Synthetic Data for Image Recognition
Method
Model architecture
Discerning Query Selection
Pre-trained Dataset Generation
Experiments
Experimental Setup
Comparisons with the State-of-the-Arts
Effectiveness of auto-generated S-GSD
The impact of synthetic data scale
Ablation Study
...and 7 more sections

Figures (9)

Figure 1: The upper part depicts the paradigms of glass surface segmentation with the assistance of two foundation models. The bottom part shows the comparison of the metrics IoU (Intersection over Union) and 1/BER (balance error rate) on GSD-S validation set upon different pre-trained datasets and the state-of-the-art (SOTA). Our S-GSD dataset boosts higher improvements compared with two real glass surface datasets, GDD and GSD. The GEM-T and GEM-Base indicate the GEM-Tiny and GEM-Base, respectively. The SOTA is derived from GlassSegNet lin2022exploiting.
Figure 1: Visualization of feature distribution on our synthetic data and three real datasets utilizing the t-SNE algorithm. The image features are extracted using the CLIP image encoder.
Figure 2: The architecture of our proposed GEM. It employs a generic encoder-decoder structure, which consists of an image encoder, a simple feature pyramid, an discerning query selection, and a mask decoder. The discerning query selection is to predict the foreground and its corresponding features will be used to initialize the decoder's query.
Figure 2: Visual examples of synthetic data. The first and second columns represent the real data and their corresponding masks, respectively. The subsequent columns showcase the synthetic data.
Figure 3: Illustration of the discerning query selection module.
...and 4 more figures

GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis

TL;DR

Abstract

GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (9)