Table of Contents
Fetching ...

HDINO: A Concise and Efficient Open-Vocabulary Detector

Hao Zhang, Yiqun Wang, Qinran Lin, Runze Fan, Yong Li

TL;DR

This paper proposes HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction.

Abstract

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf{49.2} mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf{0.8} mAP and \textbf{2.8} mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf{56.4} mAP and \textbf{59.2} mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.

HDINO: A Concise and Efficient Open-Vocabulary Detector

TL;DR

This paper proposes HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction.

Abstract

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage, a lightweight feature fusion module is applied to the aligned representations to enhance sensitivity to linguistic semantics. Under the Swin Transformer-T setting, HDINO-T achieves \textbf{49.2} mAP on COCO using 2.2M training images from two publicly available detection datasets, without any manual data curation and the use of grounding data, surpassing Grounding DINO-T and T-Rex2 by \textbf{0.8} mAP and \textbf{2.8} mAP, respectively, which are trained on 5.4M and 6.5M images. After fine-tuning on COCO, HDINO-T and HDINO-L further achieve \textbf{56.4} mAP and \textbf{59.2} mAP, highlighting the effectiveness and scalability of our approach. Code and models are available at https://github.com/HaoZ416/HDINO.
Paper Structure (15 sections, 5 equations, 5 figures, 4 tables)

This paper contains 15 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of different semantic alignment paradigms. Classifier-based methods align textual features only with their corresponding objects, and fusion-based methods perform alignment on global features; both exhibit limited semantic alignment capability, whereas the proposed HDINO leverages positive noisy samples to facilitate visual–textual alignment.
  • Figure 2: Overview of the HDINO model. Components in light blue are inherited from the original DINO model, modules in light red are newly introduced in HDINO, and components in light green are shared by both. The Feat Map is a linear layer that projects textual features into a unified embedding dimension, and we adopt the Language-Guided Query Selection strategy proposed in Grounding DINO to select initial anchors and remove the auxiliary queries during inference.
  • Figure 3: Visualization of positive noisy sample generation. Red boxes indicate ground-truth annotations, and blue boxes denote generated noisy samples obtained by constraining the perturbation of bounding box coordinates.
  • Figure 4: Comparison between the standard focal loss and the proposed Difficulty Weighted Classification Loss (DWCL) for positive samples. The loss is plotted as a function of the predicted probability $p$. For focal loss, $\alpha=0.25$ and $\gamma=2$ are used. For DWCL, the detection difficulty is fixed to $\mathrm{IoU}=0.5$ with $\beta_1=1$ and $\beta_2=2$, and the weighting factor $\alpha_{\mathrm{dwcl}}$ is normalized by $\mathbb{E}[1-\mathrm{IoU}]=0.25$.
  • Figure 5: 3D visualization of the proposed Difficulty Weighted Classification Loss (DWCL) for positive samples. In this visualization, hyperparameters are fixed to $\beta_1=1$ and $\beta_2=2$, and $\alpha_{\mathrm{dwcl}}$ is normalized by $\mathbb{E}[1-\mathrm{IoU}]=0.25$.