What's in a Name? Beyond Class Indices for Image Recognition

Kai Han; Xiaohu Huang; Yandong Li; Sagar Vaze; Jie Li; Xuhui Jia

What's in a Name? Beyond Class Indices for Image Recognition

Kai Han, Xiaohu Huang, Yandong Li, Sagar Vaze, Jie Li, Xuhui Jia

TL;DR

This work defines Semantic Category Discovery (SCD): assigning semantic class names to images from an unconstrained vocabulary rather than a fixed label set. It combines non-parametric clustering on self-supervised features with a vision-language model (e.g., CLIP) to vote on candidate names for each cluster, and iteratively refines names and clusters; the method can use text augmentation from external sources (CC12M) and supports unsupervised and partially supervised settings. A constrained variant (CSS-$k$-means) using a Minimum Cost Flow objective improves clustering stability when labels are available, and a linear-assignment step with the Hungarian algorithm ensures unique semantic names across clusters. Across ImageNet, Stanford Dogs, and CUB, the approach yields substantial improvements over baselines (notably ~50% relative gains on ImageNet in the unsupervised setting) and demonstrates that textual features can boost clustering performance, highlighting a practical pathway toward open-vocabulary, human-aligned recognition systems.

Abstract

Existing machine learning models demonstrate excellent performance in image object recognition after training on a large-scale dataset under full supervision. However, these models only learn to map an image to a predefined class index, without revealing the actual semantic meaning of the object in the image. In contrast, vision-language models like CLIP are able to assign semantic class names to unseen objects in a 'zero-shot' manner, though they are once again provided a pre-defined set of candidate names at test-time. In this paper, we reconsider the recognition problem and task a vision-language model with assigning class names to images given only a large (essentially unconstrained) vocabulary of categories as prior information. We leverage non-parametric methods to establish meaningful relationships between images, allowing the model to automatically narrow down the pool of candidate names. Our proposed approach entails iteratively clustering the data and employing a voting mechanism to determine the most suitable class names. Additionally, we investigate the potential of incorporating additional textual features to enhance clustering performance. To achieve this, we employ the CLIP vision and text encoders to retrieve relevant texts from an external database, which can provide supplementary semantic information to inform the clustering process. Furthermore, we tackle this problem both in unsupervised and partially supervised settings, as well as with a coarse-grained and fine-grained search space as the unconstrained dictionary. Remarkably, our method leads to a roughly 50% improvement over the baseline on ImageNet in the unsupervised setting.

What's in a Name? Beyond Class Indices for Image Recognition

TL;DR

-means) using a Minimum Cost Flow objective improves clustering stability when labels are available, and a linear-assignment step with the Hungarian algorithm ensures unique semantic names across clusters. Across ImageNet, Stanford Dogs, and CUB, the approach yields substantial improvements over baselines (notably ~50% relative gains on ImageNet in the unsupervised setting) and demonstrates that textual features can boost clustering performance, highlighting a practical pathway toward open-vocabulary, human-aligned recognition systems.

Abstract

Paper Structure (30 sections, 5 equations, 6 figures, 7 tables)

This paper contains 30 sections, 5 equations, 6 figures, 7 tables.

Introduction
Related work
Clustering and category discovery
Semantic representation learning
Object discovery
Semantic category discovery
Baseline: zero-shot transfer
Unsupervised setting
Initial clustering
Name voting
Semantic refinement
Partially supervised setting
Constraining k-means
Enhancing the clustering feature with text
Experiments
...and 15 more sections

Figures (6)

Figure 1: An illustration of how our proposed tasks extend existing image recognition settings. Left: A model is trained to predict class indices for a pre-defined set of categories (e.g., supervised recognition, unsupervised clustering, category discovery etc.) Middle: A vision-language model is given a pre-defined set of class names to recognize images in a 'zero-shot' manner. Right: In our proposed setting, a model must predict an image's class name given only a large, unconstrained vocabulary. Note that the leftmost setting (blue model) uses only a visual representation, while the middle and right settings (orange models) use vision-language representations.
Figure 2: An illustration of our proposed method. Left: We first perform non-parametric clustering on deep features to get initial cluster assignments (see \ref{['sec:method:unsupervised:init']}). Middle: For each cluster, we use a VL model to assign a class name for each image from the entire open vocabulary. We select one class name for each cluster based on the most common occurrence (see \ref{['sec:method:name_voting']}). Right: Based on the voted class names, we label each image as one of these, using these assignments to form new clusters. We then iterate between name voting and re-clustering. Note: Here, we have not illustrated refinement with top-$k$ voting (see \ref{['sec:method:unsupervised:refine']}).
Figure 3: Enhancing the clustering feature with text. We retrieve the top-$r$ most relevant texts from the CC12M dataset cc12m for each image using the CLIP vision and text encoders. The textual features of the retrieved texts are combined with the visual features (obtained by a pre-trained vision encoder such as CLIP, DINO, and GCD) for the non-parametric clustering.
Figure 4: Sorted cluster sizes obtained by different initial clustering algorithms. Results are reported on ImageNet-100 dataset with DINO features.
Figure 5: Qualitative results on unlabelled instances from unknown classes. Top row: correct predictions; Middle row: partially correct predictions; Bottom row: incorrect predictions. P: prediction; L: label; S: Soft semantic similarity score.
...and 1 more figures

What's in a Name? Beyond Class Indices for Image Recognition

TL;DR

Abstract

What's in a Name? Beyond Class Indices for Image Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (6)