Table of Contents
Fetching ...

NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection

Yupeng Zhang, Ruize Han, Zhiwei Chen, Wei Feng, Liang Wan

Abstract

Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework-NoOVD-which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs). Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation-without requiring additional data-thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects. Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.

NoOVD: Novel Category Discovery and Embedding for Open-Vocabulary Object Detection

Abstract

Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework-NoOVD-which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs). Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation-without requiring additional data-thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects. Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.
Paper Structure (16 sections, 10 equations, 3 figures, 7 tables)

This paper contains 16 sections, 10 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Training process for OVD using frozen CLIP. (a) Commonly used training process, (b) Our training process.
  • Figure 2: Illustration of the training process of NoOVD. We use K-FPN to extract the pyramid embeddings from frozen CLIP to identify latent novel-category objects. Besides the image-text alignment with $\mathcal{L}_{\text{cons}}$, we also align the features of the RoI head with the features from K-FPN via knowledge self-distillation with $\mathcal{L}_{\text{kd}}$.
  • Figure 3: Overall of K-FPN (the CLIP Image Encoder is taken as an example with ViT-B/16).