Table of Contents
Fetching ...

GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis

Jing Hao, Moyun Liu, Kuo Feng Hung

TL;DR

This work addresses the challenging task of segmenting glass surfaces, whose transparency and reflections yield ambiguous boundaries. It proposes GEM, a lightweight, SAM-based segmentation framework with a discerning query selection module, coupled with S-GSD, a large-scale synthetic glass dataset generated via ControlNet and Stable Diffusion for transfer learning. Empirical results show GEM achieving state-of-the-art performance on GSD-S (IoU improvements up to +2.1%) and benefiting from synthetic pretraining, with further gains in zero-shot and finetuning when using S-GSD (e.g., IoU improvements of 0.026 and 0.018 for GEM-Tiny and GEM-Base). The study demonstrates the potential of combining visual foundation models with synthetic data for specialized perception tasks, while also revealing data-scale saturation effects and signaling directions for future AIGC-assisted segmentation research.

Abstract

Detecting glass regions is a challenging task due to the ambiguity of their transparency and reflection properties. These transparent glasses share the visual appearance of both transmitted arbitrary background scenes and reflected objects, thus having no fixed patterns.Recent visual foundation models, which are trained on vast amounts of data, have manifested stunning performance in terms of image perception and image generation. To segment glass surfaces with higher accuracy, we make full use of two visual foundation models: Segment Anything (SAM) and Stable Diffusion.Specifically, we devise a simple glass surface segmentor named GEM, which only consists of a SAM backbone, a simple feature pyramid, a discerning query selection module, and a mask decoder. The discerning query selection can adaptively identify glass surface features, assigning them as initialized queries in the mask decoder. We also propose a Synthetic but photorealistic large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales, which contain 1x, 5x, 10x, and 20x of the original real data size. This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually saturate as the amount of data increases. Extensive experiments demonstrate that GEM achieves a new state-of-the-art on the GSD-S validation set (IoU +2.1%). Codes and datasets are available at: https://github.com/isbrycee/GEM-Glass-Segmentor.

GEM: Boost Simple Network for Glass Surface Segmentation via Segment Anything Model and Data Synthesis

TL;DR

This work addresses the challenging task of segmenting glass surfaces, whose transparency and reflections yield ambiguous boundaries. It proposes GEM, a lightweight, SAM-based segmentation framework with a discerning query selection module, coupled with S-GSD, a large-scale synthetic glass dataset generated via ControlNet and Stable Diffusion for transfer learning. Empirical results show GEM achieving state-of-the-art performance on GSD-S (IoU improvements up to +2.1%) and benefiting from synthetic pretraining, with further gains in zero-shot and finetuning when using S-GSD (e.g., IoU improvements of 0.026 and 0.018 for GEM-Tiny and GEM-Base). The study demonstrates the potential of combining visual foundation models with synthetic data for specialized perception tasks, while also revealing data-scale saturation effects and signaling directions for future AIGC-assisted segmentation research.

Abstract

Detecting glass regions is a challenging task due to the ambiguity of their transparency and reflection properties. These transparent glasses share the visual appearance of both transmitted arbitrary background scenes and reflected objects, thus having no fixed patterns.Recent visual foundation models, which are trained on vast amounts of data, have manifested stunning performance in terms of image perception and image generation. To segment glass surfaces with higher accuracy, we make full use of two visual foundation models: Segment Anything (SAM) and Stable Diffusion.Specifically, we devise a simple glass surface segmentor named GEM, which only consists of a SAM backbone, a simple feature pyramid, a discerning query selection module, and a mask decoder. The discerning query selection can adaptively identify glass surface features, assigning them as initialized queries in the mask decoder. We also propose a Synthetic but photorealistic large-scale Glass Surface Detection dataset dubbed S-GSD via diffusion model with four different scales, which contain 1x, 5x, 10x, and 20x of the original real data size. This dataset is a feasible source for transfer learning. The scale of synthetic data has positive impacts on transfer learning, while the improvement will gradually saturate as the amount of data increases. Extensive experiments demonstrate that GEM achieves a new state-of-the-art on the GSD-S validation set (IoU +2.1%). Codes and datasets are available at: https://github.com/isbrycee/GEM-Glass-Segmentor.
Paper Structure (22 sections, 2 equations, 9 figures, 7 tables)

This paper contains 22 sections, 2 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The upper part depicts the paradigms of glass surface segmentation with the assistance of two foundation models. The bottom part shows the comparison of the metrics IoU (Intersection over Union) and 1/BER (balance error rate) on GSD-S validation set upon different pre-trained datasets and the state-of-the-art (SOTA). Our S-GSD dataset boosts higher improvements compared with two real glass surface datasets, GDD and GSD. The GEM-T and GEM-Base indicate the GEM-Tiny and GEM-Base, respectively. The SOTA is derived from GlassSegNet lin2022exploiting.
  • Figure 1: Visualization of feature distribution on our synthetic data and three real datasets utilizing the t-SNE algorithm. The image features are extracted using the CLIP image encoder.
  • Figure 2: The architecture of our proposed GEM. It employs a generic encoder-decoder structure, which consists of an image encoder, a simple feature pyramid, an discerning query selection, and a mask decoder. The discerning query selection is to predict the foreground and its corresponding features will be used to initialize the decoder's query.
  • Figure 2: Visual examples of synthetic data. The first and second columns represent the real data and their corresponding masks, respectively. The subsequent columns showcase the synthetic data.
  • Figure 3: Illustration of the discerning query selection module.
  • ...and 4 more figures