Table of Contents
Fetching ...

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, Fabrizio Falchi

TL;DR

The paper addresses whether state-of-the-art open-vocabulary detectors can understand fine-grained object properties by introducing FG-OVD, a benchmark built on per-object dynamic vocabularies with positive captions and attribute-based negatives. It uses LLM-generated captions from a PACO-derived dataset and an evaluation protocol that includes per-object vocabularies, post-processing with class-agnostic NMS, and metrics such as mAP and Median Rank. The experiments reveal that most detectors struggle with hard negatives and fine-grained attributes, with color being easier and other attributes like pattern or transparency proving challenging; OWL and ViLD often perform best in hard settings, while Detic excels on LVIS without translating to FG-OVD. The authors propose future work including few-shot contrastive fine-tuning and exploring latent attribute representations, and provide data and code to foster further research in fine-grained open-vocabulary understanding ($N$-caption dynamic evaluation framework).

Abstract

Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end, we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color, pattern, and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD/.

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

TL;DR

The paper addresses whether state-of-the-art open-vocabulary detectors can understand fine-grained object properties by introducing FG-OVD, a benchmark built on per-object dynamic vocabularies with positive captions and attribute-based negatives. It uses LLM-generated captions from a PACO-derived dataset and an evaluation protocol that includes per-object vocabularies, post-processing with class-agnostic NMS, and metrics such as mAP and Median Rank. The experiments reveal that most detectors struggle with hard negatives and fine-grained attributes, with color being easier and other attributes like pattern or transparency proving challenging; OWL and ViLD often perform best in hard settings, while Detic excels on LVIS without translating to FG-OVD. The authors propose future work including few-shot contrastive fine-tuning and exploring latent attribute representations, and provide data and code to foster further research in fine-grained open-vocabulary understanding (-caption dynamic evaluation framework).

Abstract

Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios, where object classes are defined in free-text formats during inference. In this paper, we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end, we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect, discern, and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color, pattern, and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions, which shine in standard open-vocabulary benchmarks, struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD/.
Paper Structure (33 sections, 20 figures, 11 tables)

This paper contains 33 sections, 20 figures, 11 tables.

Figures (20)

  • Figure 1: We propose a benchmark suite to evaluate fine-grained open-vocabulary detection (FG-OVD). We build several sets of dynamic object-specific vocabularies, comprised of one positive and several negative captions, to probe the ability of open-vocabulary detectors to discern detailed object properties, like color, pattern, or material. We craft positive captions from semi-structured descriptions of objects and their parts employing a Large Language Model (LLM), while negative captions of different difficulty levels are built via attribute substitution. By manipulating negative sets according to their difficulty levels or the types of attributes altered --- categorized as Difficulty-based and Attribute-based benchmarks --- we acquire a nuanced comprehension of each detector's capabilities across various scenarios.
  • Figure 2: Examples of Dynamic Vocabularies: the image $I$ features two distinct object groups $\mathcal{G}_1$ and $\mathcal{G}_2$, each one associated with a set of captions. The positive captions $c^\text{pos}$ (marked with ✓ ) -- A light grey stone bench and A metal handbag in grey color --, are juxtaposed with three negative captions $c^\text{neg}$ (indicated by ✗ ). These positive and negative captions collectively form two vocabularies, namely $\mathcal{V}^{\mathcal{G}_1}$ assigned to $o_1$, and $\mathcal{V}^{\mathcal{G}_2}$ assigned to $o_2$ and $o_3$. The open-vocabulary detector is then applied to $I$ two times, once for each vocabulary: $\psi(I, \mathcal{V}^{\mathcal{G}_1})$ and $\psi(I, \mathcal{V}^{\mathcal{G}_2})$.
  • Figure 3: Benchmarks Examples: each benchmark tests different properties by crafting negative captions via attribute substitution.
  • Figure 4: Effect of the number of negative captions. For each one of the eight proposed benchmarks, we report the mAP (rows 1-2) and the Rank (rows 3-4) varying the number $N$ of negative captions for the different probed detectors. Notice that for Pattern and Transparency, we have a limited number of possible negatives (7 and 2, respectively).
  • Figure 5: Output scores from the probed detectors. We report examples of how detectors score vocabulary entries for a specific object. The first green caption is positive, while the other red ones are the negatives. First row: Hard vs Trivial -- we show the difference in score distributions when detectors are challenged with Hard (left) or Trivial (right). Second row: Attributes -- we show the behavior when changing specific attributes, like color (left) or material (right). Third row: Varying the number of negatives -- we show how increasing the number of negatives (left: $N=2$, right: $N=5$) strongly challenges the fine-grained discriminative abilities of many detectors.
  • ...and 15 more figures