Table of Contents
Fetching ...

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

Yitong Chen, Wenhao Yao, Lingchen Meng, Sihong Wu, Zuxuan Wu, Yu-Gang Jiang

TL;DR

This work tackles open-world object detection under vast vocabularies, where prior vision-language alignment with coarse class names degrades as the vocabulary expands. It introduces Prova, a simple multi-modal prototype classifier that initializes alignment using detailed textual descriptions and reference images to form textual and visual prototypes, which are then fused with a conventional classifier. By employing two projection layers and four additional matrix multiplications, Prova can be plugged into diverse detectors and yields substantial gains in both supervised and open-vocabulary settings, achieving state-of-the-art results on V3Det (e.g., base AP 32.8 and novel AP 11.0) with a significantly lighter backbone compared to previous methods. The approach demonstrates strong generalization across detectors (Faster R-CNN, FCOS, DINO) and datasets (V3Det, LVIS), offering a practical, efficient path toward scalable, real-world recognition with vast vocabularies.

Abstract

Enabling models to recognize vast open-world categories has been a longstanding pursuit in object detection. By leveraging the generalization capabilities of vision-language models, current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. However, when the scale of the category vocabularies during training expands to a real-world level, previous classifiers aligned with coarse class names significantly reduce the recognition performance of these detectors. In this paper, we introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based detectors with only additional projection layers in both supervised and open-vocabulary settings. In particular, Prova improves Faster R-CNN, FCOS, and DINO by 3.3, 6.2, and 2.9 AP respectively in the supervised setting of V3Det. For the open-vocabulary setting, Prova achieves a new state-of-the-art performance with 32.8 base AP and 11.0 novel AP, which is of 2.6 and 4.3 gain over the previous methods.

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

TL;DR

This work tackles open-world object detection under vast vocabularies, where prior vision-language alignment with coarse class names degrades as the vocabulary expands. It introduces Prova, a simple multi-modal prototype classifier that initializes alignment using detailed textual descriptions and reference images to form textual and visual prototypes, which are then fused with a conventional classifier. By employing two projection layers and four additional matrix multiplications, Prova can be plugged into diverse detectors and yields substantial gains in both supervised and open-vocabulary settings, achieving state-of-the-art results on V3Det (e.g., base AP 32.8 and novel AP 11.0) with a significantly lighter backbone compared to previous methods. The approach demonstrates strong generalization across detectors (Faster R-CNN, FCOS, DINO) and datasets (V3Det, LVIS), offering a practical, efficient path toward scalable, real-world recognition with vast vocabularies.

Abstract

Enabling models to recognize vast open-world categories has been a longstanding pursuit in object detection. By leveraging the generalization capabilities of vision-language models, current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. However, when the scale of the category vocabularies during training expands to a real-world level, previous classifiers aligned with coarse class names significantly reduce the recognition performance of these detectors. In this paper, we introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based detectors with only additional projection layers in both supervised and open-vocabulary settings. In particular, Prova improves Faster R-CNN, FCOS, and DINO by 3.3, 6.2, and 2.9 AP respectively in the supervised setting of V3Det. For the open-vocabulary setting, Prova achieves a new state-of-the-art performance with 32.8 base AP and 11.0 novel AP, which is of 2.6 and 4.3 gain over the previous methods.

Paper Structure

This paper contains 36 sections, 10 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Left: When the category vocabularies scale up, previous conventional closed-set classifiers and class name alignment classifiers struggle to distinguish complex classes such as Exotic Shorthair and British Shorthair. Our key idea is to extract more detailed multi-modal prototypes to replace coarse class names with templates in the alignment classifier. Right: AP on V3Det validation set in a supervised manner. The class name alignment classifier performs even worse than the conventional classifier, and multi-modal prototype classifier, i.e. Prova, achieves the best results.
  • Figure 2: Feature space visualization of (a) Class names with fixed templates; (b) Detailed class descriptions; (c) Reference images of each class. We use CLIP-ViT-Large clip as the text encoder to extract class name prototypes, LongCLIP-Large longclip to extract description prototypes, CLIP-ViT-Large visual encoder to extract visual prototypes and then utilize t-SNE tsne after $k$-means. Compared to class name prototypes, description and visual prototypes have a cleaner feature space, which is preferable for alignment classifier.
  • Figure 3: (a) Overview of Entire Model. Images are processed by a detector, e.g. Faster R-CNN fastrcnn, FCOS FCOS or DINO dino to extract object features (RoI features for Faster R-CNN, class features for FCOS and object queries for DINO), then these object features are utilized by the bounding box prediction head to predict bounding boxes and the Multi-modal Prototypes Classifier to generate corresponding category predictions. (b) Prototype Extraction. Textual prototypes are encoded by the LongCLIP-Large longclip text encoder using detailed descriptions from the V3Det Challenge v3detcha, and visual prototypes are encoded by the CLIP-ViT-Large clip visual encoder using images from V3Det v3det training dataset and examples.
  • Figure 4: Convergence curves of DINO with conventional classifier and Prova for 24 epochs on V3Det val set.