Auto-Vocabulary Semantic Segmentation

Osman Ülger; Maksymilian Kulicki; Yuki Asano; Martin R. Oswald

Auto-Vocabulary Semantic Segmentation

Osman Ülger, Maksymilian Kulicki, Yuki Asano, Martin R. Oswald

TL;DR

Open-ended semantic segmentation has been limited by the need to specify a fixed vocabulary. The authors propose AutoSeg, an AVS framework that autonomously generates image-specific target vocabularies from BLIP-based local region captioning (via BBoost) and uses self-guidance to produce high-resolution masks. For evaluation, they introduce LAVE, a Large Language Model based Auto-Vocabulary Evaluator that maps auto-generated classes to fixed datasets to compute mIoU. Experiments across PASCAL VOC, Context, ADE20K, and Cityscapes show strong performance in zero-label settings and competitive results with existing Open-Vocabulary Segmentation methods, establishing a new benchmark for auto-generated vocabularies. This approach enables scalable open-ended scene understanding with practical implications for robotics and real-world perception.

Abstract

Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, without training or fine-tuning. However, OVS methods typically require a human in the loop to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce Auto-Vocabulary Semantic Segmentation (AVS), advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, AutoSeg, presents a framework that autonomously identifies relevant class names using semantically enhanced BLIP embeddings and segments them afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated classes and their corresponding segments. With AVS, our method sets new benchmarks on datasets PASCAL VOC, Context, ADE20K, and Cityscapes, while showing competitive performance to OVS methods that require specified class names.

Auto-Vocabulary Semantic Segmentation

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 7 equations, 8 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Method
Local Region Captioning
Segmentation through Self-Guidance
Evaluation of Auto-Classes
Experiments
Experimental Setup
Ablations
Quantitative Analysis
Qualitative Analysis
Conclusion
Ablation on Denoising and Caption Filtering
Effects on Performance
Visualization of Embeddings in BBoost
...and 6 more sections

Figures (8)

Figure 1: AutoSeg Exemplary Results. AutoSeg is readily applicable to unseen images for open-ended segmentation for objects such as mascot and hole, such as the two images on the left. Furthermore, where established segmentation datasets have a fixed set of annotation categories, our method is able to identify and segment with more semantically precise object categories beyond the fixed-set ground truth, such as dachshund, bed and pagoda. Images are from the Road Anomaly roadanomaly, PASCAL VOC and ADE20K ADE20k datasets.
Figure 2: Semantic Segmentation Tasks in Comparison. In traditional Semantic Segmentation, an image is segmented into fixed, predefined set of classes (fixed vocabulary). In Open-Vocabulary Segmentation, the user specifies which object categories (from the open vocabulary) should be segmented: 1) either via a human-provided prompt at runtime, or 2) the OV-method is trained to output the vocabulary of a human-annotated target dataset. In contrast, Auto-Vocabulary Segmentation automatically generates relevant object categories directly from the image. This enables true open-ended scene understanding without needing human input.
Figure 3: Method Overview. BLIP encodings are clustered, aligned and denoised before being decoded into nouns by BBoost. Generated nouns serve as self-guidance to a segmentor, which predicts the final mask. When evaluating (purple), our custom evaluator LAVE processes the output, mapping predicted nouns to the fixed-vocabulary annotations.
Figure 4: Segmentation with VLMs. Example outputs on PASCAL VOC/Context (top), ADE (middle) and Cityscapes (bottom) by (left to right) directly using BBoost embeddings as masks, feeding plain BLIP embeddings to X-Decoder or AutoSeg. Notably, our method segments images in the most comprehensive and semantically accurate manner.
Figure 5: Qualitative Results. AutoSeg shows remarkable capability to identify out-of-vocabulary categories, such as hawk or coke, and segment them accurately across different datasets. Images are from the VOC/PC, ADE and CS datasets.
...and 3 more figures

Auto-Vocabulary Semantic Segmentation

TL;DR

Abstract

Auto-Vocabulary Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)