Table of Contents
Fetching ...

Semantic Segmentation of Transparent and Opaque Drinking Glasses with the Help of Zero-shot Learning

Annalena Blänsdorf, Tristan Wirth, Arne Rak, Thomas Pöllabauer, Volker Knauthe, Arjan Kuijper

TL;DR

This work tackles the challenge of semantically segmenting transparent drinking glasses, a task difficult due to background blending and varying viewpoints. It introduces TransCaGNet, a zero-shot segmentation model that replaces CaGNet's backbone with Trans4Trans to better handle transparency, and leverages semantic embeddings to enable unseen-class segmentation. A novel synthetic dataset of 7,000 images and a real-world 225-image evaluation set, along with a merging strategy and integration of SAM 2, constitute the key contributions, yielding up to 13.68% improvement in mean IoU and 17.88% in mean accuracy on synthetic data, and improved real-world transfer (IoU up to 5.55%, accuracy up to 5.72%). The approach demonstrates practical impact for robotic perception and accessibility tooling by enabling more robust segmentation of transparent objects under diverse conditions.

Abstract

Segmenting transparent structures in images is challenging since they are difficult to distinguish from the background. Common examples are drinking glasses, which are a ubiquitous part of our lives and appear in many different shapes and sizes. In this work we propose TransCaGNet, a modified version of the zero-shot model CaGNet. We exchange the segmentation backbone with the architecture of Trans4Trans to be capable of segmenting transparent objects. Since some glasses are rarely captured, we use zeroshot learning to be able to create semantic segmentations of glass categories not given during training. We propose a novel synthetic dataset covering a diverse set of different environmental conditions. Additionally we capture a real-world evaluation dataset since most applications take place in the real world. Comparing our model with Zeg-Clip we are able to show that TransCaGNet produces better mean IoU and accuracy values while ZegClip outperforms it mostly for unseen classes. To improve the segmentation results, we combine the semantic segmentation of the models with the segmentation results of SAM 2. Our evaluation emphasizes that distinguishing between different classes is challenging for the models due to similarity, points of view, or coverings. Taking this behavior into account, we assign glasses multiple possible categories. The modification leads to an improvement up to 13.68% for the mean IoU and up to 17.88% for the mean accuracy values on the synthetic dataset. Using our difficult synthetic dataset for training, the models produce even better results on the real-world dataset. The mean IoU is improved up to 5.55% and the mean accuracy up to 5.72% on the real-world dataset.

Semantic Segmentation of Transparent and Opaque Drinking Glasses with the Help of Zero-shot Learning

TL;DR

This work tackles the challenge of semantically segmenting transparent drinking glasses, a task difficult due to background blending and varying viewpoints. It introduces TransCaGNet, a zero-shot segmentation model that replaces CaGNet's backbone with Trans4Trans to better handle transparency, and leverages semantic embeddings to enable unseen-class segmentation. A novel synthetic dataset of 7,000 images and a real-world 225-image evaluation set, along with a merging strategy and integration of SAM 2, constitute the key contributions, yielding up to 13.68% improvement in mean IoU and 17.88% in mean accuracy on synthetic data, and improved real-world transfer (IoU up to 5.55%, accuracy up to 5.72%). The approach demonstrates practical impact for robotic perception and accessibility tooling by enabling more robust segmentation of transparent objects under diverse conditions.

Abstract

Segmenting transparent structures in images is challenging since they are difficult to distinguish from the background. Common examples are drinking glasses, which are a ubiquitous part of our lives and appear in many different shapes and sizes. In this work we propose TransCaGNet, a modified version of the zero-shot model CaGNet. We exchange the segmentation backbone with the architecture of Trans4Trans to be capable of segmenting transparent objects. Since some glasses are rarely captured, we use zeroshot learning to be able to create semantic segmentations of glass categories not given during training. We propose a novel synthetic dataset covering a diverse set of different environmental conditions. Additionally we capture a real-world evaluation dataset since most applications take place in the real world. Comparing our model with Zeg-Clip we are able to show that TransCaGNet produces better mean IoU and accuracy values while ZegClip outperforms it mostly for unseen classes. To improve the segmentation results, we combine the semantic segmentation of the models with the segmentation results of SAM 2. Our evaluation emphasizes that distinguishing between different classes is challenging for the models due to similarity, points of view, or coverings. Taking this behavior into account, we assign glasses multiple possible categories. The modification leads to an improvement up to 13.68% for the mean IoU and up to 17.88% for the mean accuracy values on the synthetic dataset. Using our difficult synthetic dataset for training, the models produce even better results on the real-world dataset. The mean IoU is improved up to 5.55% and the mean accuracy up to 5.72% on the real-world dataset.

Paper Structure

This paper contains 18 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Example images and annotations from the different datasets. The training scene \ref{['fig:light_train_1']} and \ref{['fig:light_train_2']} are illuminated with 'SUN' light. The 'SPOT' light is used in \ref{['fig:light_train_3']} for the training dataset example. For the validation the 'POINT' light illumination is applied as show in \ref{['fig:light_val_1']}. The test dataset used 'AREA' light as show in \ref{['fig:light_test_1']}. The images contain at least one of the 3D models for each category. In the image of the test scene examples of the glass shader, colored glass shader, frosted glass shader, and opaque shader are used.
  • Figure 2: Architecture of Trans-CaGNet based on the traditional architecture of CaGNetgu_context-aware_2020gu_pixel_2022. We replace the segmentation backbone of CaGNet with Trans4Transzhang_trans4trans_2021zhang_trans4trans_2021-1. The training procedure with the seen classes of Trans-CaGNet is illustrated with red arrows. The green arrows represent the fine-tuning process including the unseen classes.
  • Figure 3: From left to right: image, ground truth annotation, segmentation results of Trans-CaGNet and ZegClip using self-training with goblet unseen and all modifications. The top row contains an example of the test dataset and the bottom row one of the real-world dataset.