Table of Contents
Fetching ...

Vocabulary-free Image Classification

Alessandro Conti, Enrico Fini, Massimiliano Mancini, Paolo Rota, Yiming Wang, Elisa Ricci

TL;DR

Vocabulary-free Image Classification (VIC) removes the requirement of a predefined category vocabulary at test time, requiring classification within an unconstrained language-induced semantic space. The proposed CaSED method retrieves captions from large external Vision-Language Databases, extracts candidate categories, and scores them with a frozen vision-language model in a training-free fashion, combining image-to-text and centroid-based text-to-text signals. Across coarse- and fine-grained datasets, CaSED outperforms complex VLMs like BLIP-2 while using far fewer parameters, demonstrating efficient and scalable handling of open-ended semantic spaces. By leveraging external caption corpora as priors, VIC with CaSED enables flexible and dynamic concept recognition suitable for evolving contexts and applications.

Abstract

Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.

Vocabulary-free Image Classification

TL;DR

Vocabulary-free Image Classification (VIC) removes the requirement of a predefined category vocabulary at test time, requiring classification within an unconstrained language-induced semantic space. The proposed CaSED method retrieves captions from large external Vision-Language Databases, extracts candidate categories, and scores them with a frozen vision-language model in a training-free fashion, combining image-to-text and centroid-based text-to-text signals. Across coarse- and fine-grained datasets, CaSED outperforms complex VLMs like BLIP-2 while using far fewer parameters, demonstrating efficient and scalable handling of open-ended semantic spaces. By leveraging external caption corpora as priors, VIC with CaSED enables flexible and dynamic concept recognition suitable for evolving contexts and applications.

Abstract

Recent advances in large vision-language models have revolutionized the image classification paradigm. Despite showing impressive zero-shot capabilities, a pre-defined set of categories, a.k.a. the vocabulary, is assumed at test time for composing the textual prompts. However, such assumption can be impractical when the semantic context is unknown and evolving. We thus formalize a novel task, termed as Vocabulary-free Image Classification (VIC), where we aim to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary. VIC is a challenging task as the semantic space is extremely large, containing millions of concepts, with hard-to-discriminate fine-grained categories. In this work, we first empirically verify that representing this semantic space by means of an external vision-language database is the most effective way to obtain semantically relevant content for classifying the image. We then propose Category Search from External Databases (CaSED), a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner. CaSED first extracts a set of candidate categories from captions retrieved from the database based on their semantic similarity to the image, and then assigns to the image the best matching candidate category according to the same vision-language model. Experiments on benchmark datasets validate that CaSED outperforms other complex vision-language frameworks, while being efficient with much fewer parameters, paving the way for future research in this direction.
Paper Structure (15 sections, 6 equations, 7 figures, 11 tables)

This paper contains 15 sections, 6 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Vision-Language Model (VLM)-based classification (a) assumes a pre-defined set of target categories, i.e. the vocabulary, while our novel task (b) lifts this assumption by directly operating on the unconstrained language-induced semantic space, without a known vocabulary. $f^v_\mathtt{VLM}$ and $f^t_\mathtt{VLM}$ denote the pre-trained vision and text models of a VLM, respectively.
  • Figure 2: Results of our preliminary study, showing the top-1 accuracy when matching semantic descriptions to ground-truth class names in ten different datasets. We compare BLIP-2 (VQA) and BLIP-2 (Captioning) with Closest Caption and Captions Centroid, i.e. the average representation of the retrieved captions. We additionally highlight the Upper bound for zero-shot CLIP. Representing the large semantic space as VLDs and retrieving captions from it produces semantically more similar outputs to ground-truth labels w.r.t. querying outputs from VQA-enabled VLMs, while requiring 10 times fewer parameters compared to the latter.
  • Figure 3: Overview of CaSED. Given an input image, CaSED retrieves the most relevant captions from an external database filtering them to extract candidate categories. We classify image-to-text and text-to-text, using the retrieved captions centroid as the textual counterpart of the input image.
  • Figure 4: Ablation on the number of retrieved captions. We report Cluster accuracy (%), Semantic similarity, and Semantic IoU (%).
  • Figure 5: Qualitative results on Caltech-101. The first three samples represent success cases, the last two shows failure cases.
  • ...and 2 more figures