Table of Contents
Fetching ...

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

Kuo Wang, Lechao Cheng, Weikai Chen, Pingping Zhang, Liang Lin, Fan Zhou, Guanbin Li

TL;DR

MarvelOVD tackles the noise in Vision-Language Model (VLM)–generated pseudo-labels for open-vocabulary object detection by tightly integrating an object detector as contextual guidance. It introduces online pseudo-label mining, stratified label assignment, and adaptive proposal reweighting to refine training targets and balance learning between base and novel categories. The approach achieves substantial gains on COCO and LVIS over prior pseudo-label–based methods, demonstrating robust novel-object recognition while preserving base-category performance. This detector–VLM collaboration enables scalable open-vocabulary detection without requiring additional data or supervision, with practical implications for real-world recognition of unseen objects.

Abstract

Learning from pseudo-labels that generated with VLMs~(Vision Language Models) has been shown as a promising solution to assist open vocabulary detection (OVD) in recent studies. However, due to the domain gap between VLM and vision-detection tasks, pseudo-labels produced by the VLMs are prone to be noisy, while the training design of the detector further amplifies the bias. In this work, we investigate the root cause of VLMs' biased prediction under the OVD context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets and optimizes the learning procedure in an online manner by marrying the capability of the detector with the vision-language model. Our key insight is that the detector itself can act as a strong auxiliary guidance to accommodate VLM's inability of understanding both the ``background'' and the context of a proposal within the image. Based on it, we greatly purify the noisy pseudo-labels via Online Mining and propose Adaptive Reweighting to effectively suppress the biased training boxes that are not well aligned with the target object. In addition, we also identify a neglected ``base-novel-conflict'' problem and introduce stratified label assignments to prevent it. Extensive experiments on COCO and LVIS datasets demonstrate that our method outperforms the other state-of-the-arts by significant margins. Codes are available at https://github.com/wkfdb/MarvelOVD

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

TL;DR

MarvelOVD tackles the noise in Vision-Language Model (VLM)–generated pseudo-labels for open-vocabulary object detection by tightly integrating an object detector as contextual guidance. It introduces online pseudo-label mining, stratified label assignment, and adaptive proposal reweighting to refine training targets and balance learning between base and novel categories. The approach achieves substantial gains on COCO and LVIS over prior pseudo-label–based methods, demonstrating robust novel-object recognition while preserving base-category performance. This detector–VLM collaboration enables scalable open-vocabulary detection without requiring additional data or supervision, with practical implications for real-world recognition of unseen objects.

Abstract

Learning from pseudo-labels that generated with VLMs~(Vision Language Models) has been shown as a promising solution to assist open vocabulary detection (OVD) in recent studies. However, due to the domain gap between VLM and vision-detection tasks, pseudo-labels produced by the VLMs are prone to be noisy, while the training design of the detector further amplifies the bias. In this work, we investigate the root cause of VLMs' biased prediction under the OVD context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets and optimizes the learning procedure in an online manner by marrying the capability of the detector with the vision-language model. Our key insight is that the detector itself can act as a strong auxiliary guidance to accommodate VLM's inability of understanding both the ``background'' and the context of a proposal within the image. Based on it, we greatly purify the noisy pseudo-labels via Online Mining and propose Adaptive Reweighting to effectively suppress the biased training boxes that are not well aligned with the target object. In addition, we also identify a neglected ``base-novel-conflict'' problem and introduce stratified label assignments to prevent it. Extensive experiments on COCO and LVIS datasets demonstrate that our method outperforms the other state-of-the-arts by significant margins. Codes are available at https://github.com/wkfdb/MarvelOVD
Paper Structure (35 sections, 7 equations, 5 figures, 9 tables)

This paper contains 35 sections, 7 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Improvements achieved by incorporating the detector for pseudo-label generation and the following learning phase. (a) The distribution of pseudo-labels generated by CLIP and our method. "Mis-class" means boxes labeled as wrong categories and "noise" indicates boxes that should not be considered as pseudo-labels. The VLM (CLIP) has low "mis-class" rate but fails to distinguish noisy boxes. Our method discriminates the noises by combining the characteristics of the detector, and hence significantly improves the quality of the pseudo labels. (b) The red box indicates pseudo-label and the blue boxes represent the matched training boxes. Adaptive proposal reweighting computes independent weights according to the prediction of detector and the confidence from pseudo-label, leading the training to focus on more reliable instances (e.g. the lower right training box).
  • Figure 2: The framework of our method, which improves the quality of pseudo-labels while optimizing the following learning process by dynamically incorporating the detector during the training. We first assign candidate boxes to the images with CLIP and a proposal generator. Then we select noisy pseudo-labels according to the CLIP scores to burn-in the detector. After burn-in, the detector initially obtains the capacity to recognize novel concepts. Based on it, we dynamically estimate the novelty of each candidate box and combine the corresponding CLIP prediction to select precise pseudo-labels. We adopt stratified label assignment to generate training boxes, while the loss weights for the novel training boxes are independently computed based on the detector's prediction.
  • Figure 3: Visualization of the quality of our dynamically generated pseudo-labels, with red dashed lines indicating the quality of the original CLIP-based pseudo-labels.
  • Figure 4: Effects of different dependence controler $\lambda$ and $\lambda^\prime$.
  • Figure 5: Visualization of stratified label assignment and online object mining.