Table of Contents
Fetching ...

GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection

Jiaming Li, Zhijia Liang, Weikai Chen, Lin Ma, Guanbin Li

Abstract

Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization. Our code will be released at https://github.com/lijm48/GUIDED.

GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection

Abstract

Fine-grained open-vocabulary object detection (FG-OVD) aims to detect novel object categories described by attribute-rich texts. While existing open-vocabulary detectors show promise at the base-category level, they underperform in fine-grained settings due to the semantic entanglement of subjects and attributes in pretrained vision-language model (VLM) embeddings -- leading to over-representation of attributes, mislocalization, and semantic drift in embedding space. We propose GUIDED, a decomposition framework specifically designed to address the semantic entanglement between subjects and attributes in fine-grained prompts. By separating object localization and fine-grained recognition into distinct pathways, HUIDED aligns each subtask with the module best suited for its respective roles. Specifically, given a fine-grained class name, we first use a language model to extract a coarse-grained subject and its descriptive attributes. Then the detector is guided solely by the subject embedding, ensuring stable localization unaffected by irrelevant or overrepresented attributes. To selectively retain helpful attributes, we introduce an attribute embedding fusion module that incorporates attribute information into detection queries in an attention-based manner. This mitigates over-representation while preserving discriminative power. Finally, a region-level attribute discrimination module compares each detected region against full fine-grained class names using a refined vision-language model with a projection head for improved alignment. Extensive experiments on FG-OVD and 3F-OVD benchmarks show that GUIDED achieves new state-of-the-art results, demonstrating the benefits of disentangled modeling and modular optimization. Our code will be released at https://github.com/lijm48/GUIDED.

Paper Structure

This paper contains 23 sections, 6 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: (a) The t-SNE visualization of CLIP text embeddings on LVIS classes and the fine-grained classes. The figure shows that some CLIP embeddings of fine-grained variants are positioned far apart. (b) The visualization of predictions of an OV detector with different class prompts(A dog vs A dog with a head). The detector focuses on the head instead of the dog, leading to incorrect localization. (c) The mean classification scores and the mean IoU of the prediction box with the ground truth box of OWL-ViTminderer2022simple under coarse-grained class queries and fine-grained class queries. The detector shows better performance on both classification and localization with coarse-grained queries than with fine-grained queries.
  • Figure 2: An overview of the proposed GUIDED framework. GUIDED adopts a three-stage pipeline, which consists of subject identification, coarse-grained object detection, and fine-grained attribute discrimination. In subject identification, GUIDED employs an LLM to extract the coarse-grained subject and its attribute embeddings. For detection, coarse-grained subject embeddings are adopted as queries to localize the candidate regions with coarse confidence. An attribute embedding fusion module selectively integrates attribute embeddings into queries. In the discrimination stage, GUIDED estimates the fine-grained score for each detected region with the full fine-grained class names using a refined CLIP with a projection head. The final score is obtained from the weighted multiplication of the detector's coarse confidence and the attribute similarity scores.
  • Figure 3: (a) Illustration of the prompt for subject identification. The prompt for extracting the associated attributes is shown in our supplementary document. (b) The architecture of the attribute-fused attention layer.
  • Figure 4: Illustration of the LLM prompt for attribute identification.
  • Figure 5: A failure case of GUIDED for our limitations.
  • ...and 1 more figures