Auto-Vocabulary Semantic Segmentation
Osman Ülger, Maksymilian Kulicki, Yuki Asano, Martin R. Oswald
TL;DR
Open-ended semantic segmentation has been limited by the need to specify a fixed vocabulary. The authors propose AutoSeg, an AVS framework that autonomously generates image-specific target vocabularies from BLIP-based local region captioning (via BBoost) and uses self-guidance to produce high-resolution masks. For evaluation, they introduce LAVE, a Large Language Model based Auto-Vocabulary Evaluator that maps auto-generated classes to fixed datasets to compute mIoU. Experiments across PASCAL VOC, Context, ADE20K, and Cityscapes show strong performance in zero-label settings and competitive results with existing Open-Vocabulary Segmentation methods, establishing a new benchmark for auto-generated vocabularies. This approach enables scalable open-ended scene understanding with practical implications for robotics and real-world perception.
Abstract
Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, without training or fine-tuning. However, OVS methods typically require a human in the loop to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce Auto-Vocabulary Semantic Segmentation (AVS), advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, AutoSeg, presents a framework that autonomously identifies relevant class names using semantically enhanced BLIP embeddings and segments them afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated classes and their corresponding segments. With AVS, our method sets new benchmarks on datasets PASCAL VOC, Context, ADE20K, and Cityscapes, while showing competitive performance to OVS methods that require specified class names.
