Table of Contents
Fetching ...

Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark

Ying Liu, Yijing Hua, Haojiang Chai, Yanbo Wang, TengQi Ye

TL;DR

The paper tackles fair evaluation for open-vocabulary object detection by introducing 3F-OVD, a fine-grained variant where novel classes are fine-grained categories. It contributes NEU-171K, a large-scale dataset for both supervised and open-vocabulary detection across two domains, plus a practical post-processing technique to reduce false positives from captions. Through extensive experiments with classic detectors and open-vocabulary models, the authors show that 3F-OVD is more challenging than existing FG-OVD and OV-VG benchmarks, underscoring the need for stronger fine-grained visual-language alignment. Overall, the work provides a new benchmark and dataset, addressing data leakage and annotation overhead while enabling more realistic evaluation of open-vocabulary FG-OD models.

Abstract

Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique.

Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark

TL;DR

The paper tackles fair evaluation for open-vocabulary object detection by introducing 3F-OVD, a fine-grained variant where novel classes are fine-grained categories. It contributes NEU-171K, a large-scale dataset for both supervised and open-vocabulary detection across two domains, plus a practical post-processing technique to reduce false positives from captions. Through extensive experiments with classic detectors and open-vocabulary models, the authors show that 3F-OVD is more challenging than existing FG-OVD and OV-VG benchmarks, underscoring the need for stronger fine-grained visual-language alignment. Overall, the work provides a new benchmark and dataset, addressing data leakage and annotation overhead while enabling more realistic evaluation of open-vocabulary FG-OD models.

Abstract

Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique.

Paper Structure

This paper contains 18 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Ground truth annotations for different evaluation tasks.Base classes are highlighted in red while novel classes are in blue. The FG-OD is the abbreviation of supervised Fine-Grained Object Detection. From the left to the right, the fine-grained classes of the cars are: "Ford Focus", "Nissan Eida", "Ford Focus".
  • Figure 2: Bounding boxes predicted by ViLD on NEU-171K-RP before (Fig. \ref{['fig:raw_rb_before']}) and after (Fig. \ref{['fig:raw_rb_after']}) applying our post-processing method. The evaluated class is "Dried Fruit Liuliumei Green" with the caption: "The packaging of Liuliumei Green is a plum-shaped plastic bag predominantly green ... and the lower-right corner contains a line of small text.". Each bounding box is labeled with the corresponding word and confidence score.
  • Figure 3: Class distribution of images in NEU-171K-RP (red) and NEU-171K-C (blue). The X-axis represents class IDs, sorted in descending order by frequency, while the Y-axis indicates the number of images per class.
  • Figure 4: Predicted bounding boxes from ViLD before and after applying our post-processing method across all classes. Each bounding box is labeled with the number of occurrences and the three most frequently predicted words.