Table of Contents
Fetching ...

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan, Xiaodan Liang

TL;DR

OV-DINO tackles open-vocabulary detection by unifying diverse data sources into a detection-centric pre-training framework and introducing language-aware selective fusion to align region-level visual embeddings with language prompts. The Unified Data Integration (UniDI) pipeline converts detection, grounding, and image-text data into a common triplet format, eliminating pseudo-label noise on image-text data, while the Language-Aware Selective Fusion (LASF) module dynamically selects and fuses text-related region embeddings to enhance cross-modal alignment. On COCO and LVIS, OV-DINO achieves state-of-the-art zero-shot AP (50.6% on COCO, 40.1% on LVIS) and attains 58.4% AP after fine-tuning on COCO, demonstrating strong generalization with a one-stage, end-to-end training regime. The approach emphasizes practical impact by reducing data irregularities and improving language-guided region understanding for open-world detection tasks.

Abstract

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

TL;DR

OV-DINO tackles open-vocabulary detection by unifying diverse data sources into a detection-centric pre-training framework and introducing language-aware selective fusion to align region-level visual embeddings with language prompts. The Unified Data Integration (UniDI) pipeline converts detection, grounding, and image-text data into a common triplet format, eliminating pseudo-label noise on image-text data, while the Language-Aware Selective Fusion (LASF) module dynamically selects and fuses text-related region embeddings to enhance cross-modal alignment. On COCO and LVIS, OV-DINO achieves state-of-the-art zero-shot AP (50.6% on COCO, 40.1% on LVIS) and attains 58.4% AP after fine-tuning on COCO, demonstrating strong generalization with a one-stage, end-to-end training regime. The approach emphasizes practical impact by reducing data irregularities and improving language-guided region understanding for open-world detection tasks.

Abstract

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data format. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enhance the cross-modality alignment through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmarks, achieving state-of-the-art results with an AP of 50.6% on the COCO benchmark and 40.1% on the LVIS benchmark in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO is available at https://github.com/wanghao9610/OV-DINO.
Paper Structure (14 sections, 4 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 14 sections, 4 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of OV-DINO with Previous Methods. (a) Previous methods (e.g. GLIP li2022glip, GLIPv2 zhang2022glipv2, G-DINO liu2023gdino) adopt a two-stage paradigm. They first pre-train on large-scale Detection and Grounding data, then generate pseudo labels on Image-Text data, potentially introducing noise (red circle). (b) OV-DINO is a one-stage detection-centric method that integrates various data sources into a unified detection data format through a Unified Data Integration pipeline. It undergoes end-to-end pre-training via region-text alignment within a unified detection framework.
  • Figure 2: Illustration of Language-Aware Selective Fusion (LASF). We illustrate the processes of typical cross-modality fusion in G-DINOliu2023gdino and language-aware selective fusion. LASF entails query selection and query fusion, which includes selecting the object embedding (,) related to the text input, and fusing it with the learnable content query to improve prediction accuracy. In contrast, G-DINO directly fuses the query with text embedding. The OV-DINO with LASF achieves higher accuracy compared to G-IDNO (e.g. 87% vs 63% for "person", 93% vs 55% for "tennis racket"), highlighting the effectiveness of LASF in enhancing prediction accuracy.
  • Figure 3: Overall Framework of OV-DINO. The pre-training of OV-DINO comprises three primary data sources (Detection, Grounding, Image-Text). OV-DINO has three main components: a text encoder, a image encoder, and a language-aware detection decoder. First, we process the text inputs with Unified Data Integration pipeline to ensure embedding representation consistency across these data sources. Then, the unified prompted text inputs go through a Text Encoder to extract the text embedding, and the original image inputs undergo an Image Encoder and some Encoder Layers to output the multi-scale refined image embedding. Subsequently, we employ the Language-Aware Query Selection to select the most relevant image embedding with the text embedding as the object embedding. The selected object embedding and the learnable content queries go through the Language-Aware Decoder to fuse the content queries dynamically. Finally, OV-DINO outputs the classification scores by calculating the similarity of the projected query embedding with the text embedding through region-text alignment, and the regressed bounding boxes via an MLP layer.
  • Figure 4: Architecture of the Language-Aware Selective Fusion (LASF). The LASF module consists of two main components: language-aware query selection $\bm{\Phi_{\text{QS}}}$ and language-aware query fusion $\bm{\Phi_{\text{QF}}}$. We illustrate three variants of the LASF module based on the insertion location of the object embedding: (a) Later-LASF, (b) Middle-LASF, and (c) Early-LASF. Additionally, we also illustrate (d) Typical-CMF proposed in G-DINOliu2023gdino for clear comparison.
  • Figure 5: Illustration of the Noise in the Image-caption Dataset. The upper figure is the image, and the bottom text is the related caption for each sample. The sample on the left shows a high score of image-text similarity, while the sample on the right shows a lower score.
  • ...and 2 more figures