Glass Segmentation with Fusion of Learned and General Visual Features

Risto Ojala; Tristan Ellison; Mo Chen

Glass Segmentation with Fusion of Learned and General Visual Features

Risto Ojala, Tristan Ellison, Mo Chen

TL;DR

A novel architecture for glass segmentation is presented, deploying a dual-backbone producing general visual features as well as task-specific learned visual features, achieving state-of-the-art results on several accuracy metrics.

Abstract

Glass surface segmentation from RGB images is a challenging task, since glass as a transparent material distinctly lacks visual characteristics. However, glass segmentation is critical for scene understanding and robotics, as transparent glass surfaces must be identified as solid material. This paper presents a novel architecture for glass segmentation, deploying a dual-backbone producing general visual features as well as task-specific learned visual features. General visual features are produced by a frozen DINOv3 vision foundation model, and the task-specific features are generated with a Swin model trained in a supervised manner. Resulting multi-scale feature representations are downsampled with residual Squeeze-and-Excitation Channel Reduction, and fed into a Mask2Former Decoder, producing the final segmentation masks. The architecture was evaluated on four commonly used glass segmentation datasets, achieving state-of-the-art results on several accuracy metrics. The model also has a competitive inference speed compared to the previous state-of-the-art method, and surpasses it when using a lighter DINOv3 backbone variant. The implementation source code and model weights are available at: https://github.com/ojalar/lgnet

Glass Segmentation with Fusion of Learned and General Visual Features

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 6 figures, 4 tables)

This paper contains 19 sections, 1 equation, 6 figures, 4 tables.

Introduction
Related Work
Glass Segmentation
Recent Advancements in Computer Vision
Research Gap
Methods
Model Architecture
Learned Features Backbone (A)
General Features Backbone (B)
Squeeze-and-Excitation Channel Reduction (C)
Segmentation Decoder (D)
Experiments
Datasets
Implementation details
Accuracy Benchmark
...and 4 more sections

Figures (6)

Figure 1: High-level overview of the proposed L+GNet architecture. A dual-backbone is constructed, which utilizes a Learned Features Backbone for generating task-specific features, and a frozen General Features Backbone for context from a vision foundation model.
Figure 2: The proposed L+GNet architecture. Bottom right corner shows the detailed contents of a SE Channel Reduction block. Labels (A)-(D) refer to corresponding subsections, which contain detailed descriptions of the specific blocks.
Figure 3: Samples of segmentation masks produced by L+GNet on the different testing sets. Model trained with the combined training sets. True positives overlaid in green, false positives overlaid in red, and false negatives overlaid in blue.
Figure 4: Visualizations of failure cases encountered with the L+GNet model. Model trained with the combined training data. True positives overlaid in green, false positives overlaid in red, and false negatives overlaid in blue.
Figure 5: Visualizations of L+GNet performance for images that the previous state of the art fails to correctly segment. L+GNet model trained with combined training data. GlassWizard results provided by the respective authors li2025glasswizard. True positives overlaid in green, false positives overlaid in red, and false negatives overlaid in blue.
...and 1 more figures

Glass Segmentation with Fusion of Learned and General Visual Features

TL;DR

Abstract

Glass Segmentation with Fusion of Learned and General Visual Features

Authors

TL;DR

Abstract

Table of Contents

Figures (6)