Table of Contents
Fetching ...

Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection

Weihao Cao, Runqi Wang, Xiaoyue Duan, Jinchao Zhang, Ang Yang, Liping Jing

Abstract

Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts. This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label. To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. We evaluate HSA-DINO on OV-COCO, several vertical domain datasets, and modified benchmark settings. The results show that HSA-DINO performs favorably against previous state-of-the-art methods, achieving a superior trade-off between domain adaptability and open-vocabulary generalization.

Parameter-Efficient Semantic Augmentation for Enhancing Open-Vocabulary Object Detection

Abstract

Open-vocabulary object detection (OVOD) enables models to detect any object category, including unseen ones. Benefiting from large-scale pre-training, existing OVOD methods achieve strong detection performance on general scenarios (e.g., OV-COCO) but suffer severe performance drops when transferred to downstream tasks with substantial domain shifts. This degradation stems from the scarcity and weak semantics of category labels in domain-specific task, as well as the inability of existing models to capture auxiliary semantics beyond coarse-grained category label. To address these issues, we propose HSA-DINO, a parameter-efficient semantic augmentation framework for enhancing open-vocabulary object detection. Specifically, we propose a multi-scale prompt bank that leverages image feature pyramids to capture hierarchical semantics and select domain-specific local semantic prompts, progressively enriching textual representations from coarse to fine-grained levels. Furthermore, we introduce a semantic-aware router that dynamically selects the appropriate semantic augmentation strategy during inference, thereby preventing parameter updates from degrading the generalization ability of the pre-trained OVOD model. We evaluate HSA-DINO on OV-COCO, several vertical domain datasets, and modified benchmark settings. The results show that HSA-DINO performs favorably against previous state-of-the-art methods, achieving a superior trade-off between domain adaptability and open-vocabulary generalization.

Paper Structure

This paper contains 13 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Pre-trained OVOD models perform well on general domain (e.g., OV-COCO lin2014microsoft) but fail to generalize to vertical domain (e.g., ArTaxOr drange2019arthropod, DIOR li2020object, UODD jiang2021underwater) in the zero-shot setting. Although fine-tuning improves performance on vertical domain, it causes a significant degradation on general domain.
  • Figure 2: Method motivation. (a) Previous methods use predefined templates or learnable vectors prepended to the category label embeddings, ignoring detailed semantics from image features. Our multi-scale prompt bank uses hierarchical semantics from multi-scale feature pyramid to select auxiliary prompts for the category labels. (b) The dynamic routing method yu2024boosting uses the reconstruction error of an input, obtained from multiple autoencoders, as an indicator of its domain. However, the large overlap in reconstruction errors across different domains can confuse the model, leading to incorrect parameter selection. Our method explicitly models both the content and the domain of the inputs, effectively reducing this overlap and enabling more accurate routing decisions.
  • Figure 3: Overview of the proposed HSA-DINO framework. We incorporate LoRA into the image encoder during training on downstream datasets. We also introduce a multi-scale prompt bank to further enhance the model's adaptability to downstream tasks. In the test stage, we propose a semantic-aware router that can accurately identify different data distributions, enabling more precise selection between pre-trained semantics and augmented semantics for detection.
  • Figure 4: Ablation study on (a) bank size $N$, (b) prompt length $M$, (c) key matching loss weight $\lambda_{\mathrm{m}}$, and (d) orthogonal loss weight $\lambda_{\mathrm{p}}$.
  • Figure 5: Ablation on the routing decision threshold $\tau$.
  • ...and 3 more figures