Table of Contents
Fetching ...

Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability

Yusuke Hosoya, Masanori Suganuma, Takayuki Okatani

TL;DR

There is little difference between OVD and COD for object classes with low text-describability under equal conditions in OD pretraining, and although OVD can learn from more diverse data than OD-specific data, thereby increasing the volume of training data, it can be counterproductive for classes with low-text-describability.

Abstract

Open-vocabulary object detection (OVD), detecting specific classes of objects using only their linguistic descriptions (e.g., class names) without any image samples, has garnered significant attention. However, in real-world applications, the target class concepts is often hard to describe in text and the only way to specify target objects is to provide their image examples, yet it is often challenging to obtain a good number of samples. Thus, there is a high demand from practitioners for few-shot object detection (FSOD). A natural question arises: Can the benefits of OVD extend to FSOD for object classes that are difficult to describe in text? Compared to traditional methods that learn only predefined classes (referred to in this paper as closed-set object detection, COD), can the extra cost of OVD be justified? To answer these questions, we propose a method to quantify the ``text-describability'' of object detection datasets using the zero-shot image classification accuracy with CLIP. This allows us to categorize various OD datasets with different text-describability and emprically evaluate the FSOD performance of OVD and COD methods within each category. Our findings reveal that: i) there is little difference between OVD and COD for object classes with low text-describability under equal conditions in OD pretraining; and ii) although OVD can learn from more diverse data than OD-specific data, thereby increasing the volume of training data, it can be counterproductive for classes with low-text-describability. These findings provide practitioners with valuable guidance amidst the recent advancements of OVD methods.

Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability

TL;DR

There is little difference between OVD and COD for object classes with low text-describability under equal conditions in OD pretraining, and although OVD can learn from more diverse data than OD-specific data, thereby increasing the volume of training data, it can be counterproductive for classes with low-text-describability.

Abstract

Open-vocabulary object detection (OVD), detecting specific classes of objects using only their linguistic descriptions (e.g., class names) without any image samples, has garnered significant attention. However, in real-world applications, the target class concepts is often hard to describe in text and the only way to specify target objects is to provide their image examples, yet it is often challenging to obtain a good number of samples. Thus, there is a high demand from practitioners for few-shot object detection (FSOD). A natural question arises: Can the benefits of OVD extend to FSOD for object classes that are difficult to describe in text? Compared to traditional methods that learn only predefined classes (referred to in this paper as closed-set object detection, COD), can the extra cost of OVD be justified? To answer these questions, we propose a method to quantify the ``text-describability'' of object detection datasets using the zero-shot image classification accuracy with CLIP. This allows us to categorize various OD datasets with different text-describability and emprically evaluate the FSOD performance of OVD and COD methods within each category. Our findings reveal that: i) there is little difference between OVD and COD for object classes with low text-describability under equal conditions in OD pretraining; and ii) although OVD can learn from more diverse data than OD-specific data, thereby increasing the volume of training data, it can be counterproductive for classes with low-text-describability. These findings provide practitioners with valuable guidance amidst the recent advancements of OVD methods.

Paper Structure

This paper contains 32 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: An overview of model architectures for (a) closed-set object detection (COD) and (b) open-vocabulary object detection (OVD).
  • Figure 2: Datasets (35 in total from ODinW Elevater) sorted by our metric for the difficulty of describing object classes in text. The datasets are categorized and ranked from S1 to S3, indicating decreasing text-describability.
  • Figure 3: AP ratio of OVD/COD. DyHead vs. GLIP(A) (top) and Faster RCNN vs. F-ViT (bottom).
  • Figure 4: Detection accuracy of state-of-the-art finetuning approaches for FSOD. Results on a $K=3$ are shown. OVD methods are shaded in gray.
  • Figure 5: Detection performance for DyHead with TFA.
  • ...and 5 more figures