Table of Contents
Fetching ...

HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection

Yuqi Ma, Mengyin Liu, Chao Zhu, Xu-Cheng Yin

TL;DR

This paper proposes a universal and explicit approach for frozen mainstream OVD models that boosts their attribute-level detection capabilities by highlighting fine-grained attributes in explicit linear space and uniformly improves fine-grained attribute-level OVD of various mainstream models.

Abstract

Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. However, OVD models are pretrained on large-scale image-text pairs with rich attribute words, whose latent feature space can represent the global text feature as a linear composition of fine-grained attribute tokens without highlighting them. Therefore, we propose in this paper a universal and explicit approach for frozen mainstream OVD models that boosts their attribute-level detection capabilities by highlighting fine-grained attributes in explicit linear space. Firstly, a LLM is leveraged to highlight attribute words within the input text as a zero-shot prompted task. Secondly, by strategically adjusting the token masks, the text encoders of OVD models extract both global text and attribute-specific features, which are then explicitly composited as two vectors in linear space to form the new attribute-highlighted feature for detection tasks, where corresponding scalars are hand-crafted or learned to reweight both two vectors. Notably, these scalars can be seamlessly transferred among different OVD models, which proves that such an explicit linear composition is universal. Empirical evaluation on the FG-OVD dataset demonstrates that our proposed method uniformly improves fine-grained attribute-level OVD of various mainstream models and achieves new state-of-the-art performance.

HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection

TL;DR

This paper proposes a universal and explicit approach for frozen mainstream OVD models that boosts their attribute-level detection capabilities by highlighting fine-grained attributes in explicit linear space and uniformly improves fine-grained attribute-level OVD of various mainstream models.

Abstract

Open-vocabulary object detection (OVD) models are considered to be Large Multi-modal Models (LMM), due to their extensive training data and a large number of parameters. Mainstream OVD models prioritize object coarse-grained category rather than focus on their fine-grained attributes, e.g., colors or materials, thus failed to identify objects specified with certain attributes. However, OVD models are pretrained on large-scale image-text pairs with rich attribute words, whose latent feature space can represent the global text feature as a linear composition of fine-grained attribute tokens without highlighting them. Therefore, we propose in this paper a universal and explicit approach for frozen mainstream OVD models that boosts their attribute-level detection capabilities by highlighting fine-grained attributes in explicit linear space. Firstly, a LLM is leveraged to highlight attribute words within the input text as a zero-shot prompted task. Secondly, by strategically adjusting the token masks, the text encoders of OVD models extract both global text and attribute-specific features, which are then explicitly composited as two vectors in linear space to form the new attribute-highlighted feature for detection tasks, where corresponding scalars are hand-crafted or learned to reweight both two vectors. Notably, these scalars can be seamlessly transferred among different OVD models, which proves that such an explicit linear composition is universal. Empirical evaluation on the FG-OVD dataset demonstrates that our proposed method uniformly improves fine-grained attribute-level OVD of various mainstream models and achieves new state-of-the-art performance.
Paper Structure (18 sections, 13 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 13 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Difference between (a) open-vocabulary object detection for the fine-grained category names, (b) fine-grained open vocabulary object detection for the attribute-specific descriptions, and (c) our proposed HA-FGOVD method.
  • Figure 2: The overall architecture of our proposed HA-FGOVD approach. (a) Firstly, a LLM follows the prompt to highlight the attributes in input text, which are then converted into attribute positions $\Phi_{A,T}$. (b) Secondly, $\Phi_{A,T}$ are employed to mask the attention map $QK^\top$ to obtain attribute specific feature $U_\mathrm{attri}$. (c) Finally, explicit linear composition yields new feature $U_\mathrm{new}$ from $U_\mathrm{attri}$ and $U_\mathrm{global}$ toward more attribute, which enhances the final detection results.
  • Figure 3: Attribute Words Extraction. The LLM is configured within the system message to establish a general dialogue background. This configuration defines the model's role as extracting attribute words and provides the definition of attribute words as well as the output format. In addition, 15 in-context examples are given to aid the LLM in comprehending the output rules, with the aim of improving the precision of attribute words extraction and reducing the risk of hallucinations.
  • Figure 4: 2D attention masks in text encoder of BERT architecture. (a) Default mask for global feature $U_\mathrm{global}$ and (b) attribute mask for attribute-specific feature $U_\mathrm{attri}$. Take the 2nd token "[T2]" as attribute for an example.
  • Figure 5: 2D attention masks in text encoder of CLIP architecture. (a) Default mask for global feature $U_\mathrm{global}$ and (b) attribute mask for attribute-specific feature $U_\mathrm{attri}$ . Take the 2nd token "[T2]" as attribute for an example.
  • ...and 1 more figures