Table of Contents
Fetching ...

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Junjie Wang, Bin Chen, Bin Kang, Yulin Li, YiChi Chen, Weizhi Xian, Huifeng Chang, Yong Xu

TL;DR

Open-vocabulary detection faces a persistent confidence bias toward base categories, undermining novel-object detection. OV-DQUO proposes a unified DETR-based framework that (i) surfaces unknown objects via open-world pseudo-labeling, (ii) uses wildcard text embeddings to supervise unknown proposals, (iii) employs denoising text query training to distinguish novel objects from background, and (iv) balances proposal recall with Region of Query Interests. The approach achieves state-of-the-art results on OV-COCO and OV-LVIS benchmarks and demonstrates strong cross-dataset generalization, all without extra training data. Collectively, these contributions offer a practical pathway to robust open-world detection by tightly integrating open-world supervision, flexible text guidance, and discriminative training strategies.

Abstract

Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on base category data tend to assign higher confidence to trained categories and confuse novel categories with the background. To resolve this, we propose OV-DQUO, an \textbf{O}pen-\textbf{V}ocabulary DETR with \textbf{D}enoising text \textbf{Q}uery training and open-world \textbf{U}nknown \textbf{O}bjects supervision. Specifically, we introduce a wildcard matching method. This method enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy. It synthesizes foreground and background query-box pairs from open-world unknown objects to train the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the challenging OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories respectively, without the need for additional training data. Models and code are released at \url{https://github.com/xiaomoguhz/OV-DQUO}

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

TL;DR

Open-vocabulary detection faces a persistent confidence bias toward base categories, undermining novel-object detection. OV-DQUO proposes a unified DETR-based framework that (i) surfaces unknown objects via open-world pseudo-labeling, (ii) uses wildcard text embeddings to supervise unknown proposals, (iii) employs denoising text query training to distinguish novel objects from background, and (iv) balances proposal recall with Region of Query Interests. The approach achieves state-of-the-art results on OV-COCO and OV-LVIS benchmarks and demonstrates strong cross-dataset generalization, all without extra training data. Collectively, these contributions offer a practical pathway to robust open-world detection by tightly integrating open-world supervision, flexible text guidance, and discriminative training strategies.

Abstract

Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on base category data tend to assign higher confidence to trained categories and confuse novel categories with the background. To resolve this, we propose OV-DQUO, an \textbf{O}pen-\textbf{V}ocabulary DETR with \textbf{D}enoising text \textbf{Q}uery training and open-world \textbf{U}nknown \textbf{O}bjects supervision. Specifically, we introduce a wildcard matching method. This method enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy. It synthesizes foreground and background query-box pairs from open-world unknown objects to train the detector through contrastive learning, enhancing its ability to distinguish novel objects from the background. We conducted extensive experiments on the challenging OV-COCO and OV-LVIS benchmarks, achieving new state-of-the-art results of 45.6 AP50 and 39.3 mAP on novel categories respectively, without the need for additional training data. Models and code are released at \url{https://github.com/xiaomoguhz/OV-DQUO}
Paper Structure (15 sections, 16 equations, 9 figures, 6 tables)

This paper contains 15 sections, 16 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: (a) Detector confidence bias is a primary reason for suboptimal detection performance on novel categories. (b) Existing pseudo-labeling methods mainly focus on establishing region-text alignment from external caption datasets whereas ignoring the confidence bias. (c) Instead, this work directly tackles this confidence bias issue by utilizing the open-world detector to discover novel unknown objects during training and learning to match them with wildcard text embeddings.
  • Figure 2: Overview of OV-DQUO.(a) Open-world pseudo labeling pipeline, which iteratively trains the detector, generates unknown object proposals, estimates foreground probabilities, and updates the training set. (b) Denoising text query training, which enables contrastive learning with synthetic noisy query-box pairs from open-world unknown objects. (c) RoQIs selection module, which takes into account both objectness and region-text similarity for selecting regions of interest.
  • Figure 3: Visualization of confidence score distributions.
  • Figure 4: T-SNE visualization of embedding distributions.
  • Figure 5: Visualization of the predicted confidence scores for the baseline detector and OV-DQUO across each novel category in the OV-COCO benchmark.
  • ...and 4 more figures