Table of Contents
Fetching ...

Vocabulary-free Image Classification and Semantic Segmentation

Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

TL;DR

This work formalizes Vocabulary-free Image Classification and Vocabulary-free Semantic Segmentation, enabling open-ended semantic reasoning without pre-specified vocabularies. It introduces CaSED, a training-free pipeline that retrieves caption-based candidates from external databases using a frozen vision-language model and performs multimodal scoring to select labels, with UpperCaSED adding prompt ensembling for robustness. For segmentation, CaSED is extended via three variants, including DenseCaSED, which builds dense local representations from multi-scale patches and applies CaSED without training. Across diverse benchmarks, CaSED and its variants outperform many open-vocabulary baselines while using far fewer parameters, demonstrating the viability of retrieval-augmented, vocabulary-free semantic understanding in vision tasks. The approach highlights practical gains for open-world perception, though it notes biases in retrieval data and opportunities for memory and finer-grained granularity in future work.

Abstract

Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.

Vocabulary-free Image Classification and Semantic Segmentation

TL;DR

This work formalizes Vocabulary-free Image Classification and Vocabulary-free Semantic Segmentation, enabling open-ended semantic reasoning without pre-specified vocabularies. It introduces CaSED, a training-free pipeline that retrieves caption-based candidates from external databases using a frozen vision-language model and performs multimodal scoring to select labels, with UpperCaSED adding prompt ensembling for robustness. For segmentation, CaSED is extended via three variants, including DenseCaSED, which builds dense local representations from multi-scale patches and applies CaSED without training. Across diverse benchmarks, CaSED and its variants outperform many open-vocabulary baselines while using far fewer parameters, demonstrating the viability of retrieval-augmented, vocabulary-free semantic understanding in vision tasks. The approach highlights practical gains for open-world perception, though it notes biases in retrieval data and opportunities for memory and finer-grained granularity in future work.

Abstract

Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.
Paper Structure (14 sections, 9 equations, 10 figures, 15 tables)

This paper contains 14 sections, 9 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Vision-Language Model (VLM)-based classification (a) assumes a pre-defined set of target categories, i.e., the vocabulary, while our novel task (b) lifts this assumption by directly operating on the unconstrained language-induced semantic space. $f^v_\mathtt{VLM}$ and $f^t_\mathtt{VLM}$ denote the pre-trained vision and text models of a VLM, respectively. In this work, we also extend this paradigm for the task of semantic segmentation.
  • Figure 2: CaSED. Given an input image, CaSED retrieves the most relevant captions from an external database filtering them to extract candidate categories. We classify image-to-text and text-to-text, using the retrieved captions centroid as the textual counterpart of the input image.
  • Figure 3: Extending CaSED for Semantic Segmentation. We follow three strategies: (a) a class-agnostic segmenter (SAM) segments all objects, then CaSED labels each mask independently; (b) CaSED provides candidate categories for the image that are fed as input to an open-vocabulary segmentation model (SAN); (c) DenseCaSED, where we directly accumulate visual features from multi-scale patches, and perform CaSED locally.
  • Figure 4: Results of our preliminary study, showing the top-1 accuracy when matching semantic descriptions to ground-truth class names in ten different datasets. We compare BLIP-2 (VQA) and BLIP-2 (Captioning) with Closest Caption and Captions Centroid, i.e., the average representation of the retrieved captions. We additionally highlight the Upper bound for zero-shot CLIP. Representing the large semantic space as VLDs and retrieving captions from it produces semantically more similar outputs to ground-truth labels w.r.t. querying outputs from VQA-enabled VLMs, while requiring 10 times fewer parameters compared to the latter.
  • Figure 5: Qualitative results of CaSED on Caltech-101. The first three samples represent success cases, the last two shows failure cases.
  • ...and 5 more figures