Table of Contents
Fetching ...

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

Zhao Wang, Aoxue Li, Fengwei Zhou, Zhenguo Li, Qi Dou

TL;DR

This paper tackles open-vocabulary object detection by addressing overfitting to base classes and unreliable matching between proposals and class embeddings. It introduces MIC, which combines meta prompt learning to simulate novel-class emergence and learn robust foreground/background prompts with an instance contrastive learning objective that uses a class-balanced memory bank. MIC achieves state-of-the-art results on LVIS without knowledge distillation, ensemble models, or extra training data, and demonstrates strong transfer to COCO and Objects365. The approach offers a data-efficient, scalable path toward reliable open-vocabulary detection with improved discriminability among visually similar classes.

Abstract

Classical object detectors are incapable of detecting novel class objects that are not encountered before. Regarding this issue, Open-Vocabulary Object Detection (OVOD) is proposed, which aims to detect the objects in the candidate class list. However, current OVOD models are suffering from overfitting on the base classes, heavily relying on the large-scale extra data, and complex training process. To overcome these issues, we propose a novel framework with Meta prompt and Instance Contrastive learning (MIC) schemes. Firstly, we simulate a novel-class-emerging scenario to help the prompt learner that learns class and background prompts generalize to novel classes. Secondly, we design an instance-level contrastive strategy to promote intra-class compactness and inter-class separation, which benefits generalization of the detector to novel class objects. Without using knowledge distillation, ensemble model or extra training data during detector training, our proposed MIC outperforms previous SOTA methods trained with these complex techniques on LVIS. Most importantly, MIC shows great generalization ability on novel classes, e.g., with $+4.3\%$ and $+1.9\% \ \mathrm{AP}$ improvement compared with previous SOTA on COCO and Objects365, respectively.

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

TL;DR

This paper tackles open-vocabulary object detection by addressing overfitting to base classes and unreliable matching between proposals and class embeddings. It introduces MIC, which combines meta prompt learning to simulate novel-class emergence and learn robust foreground/background prompts with an instance contrastive learning objective that uses a class-balanced memory bank. MIC achieves state-of-the-art results on LVIS without knowledge distillation, ensemble models, or extra training data, and demonstrates strong transfer to COCO and Objects365. The approach offers a data-efficient, scalable path toward reliable open-vocabulary detection with improved discriminability among visually similar classes.

Abstract

Classical object detectors are incapable of detecting novel class objects that are not encountered before. Regarding this issue, Open-Vocabulary Object Detection (OVOD) is proposed, which aims to detect the objects in the candidate class list. However, current OVOD models are suffering from overfitting on the base classes, heavily relying on the large-scale extra data, and complex training process. To overcome these issues, we propose a novel framework with Meta prompt and Instance Contrastive learning (MIC) schemes. Firstly, we simulate a novel-class-emerging scenario to help the prompt learner that learns class and background prompts generalize to novel classes. Secondly, we design an instance-level contrastive strategy to promote intra-class compactness and inter-class separation, which benefits generalization of the detector to novel class objects. Without using knowledge distillation, ensemble model or extra training data during detector training, our proposed MIC outperforms previous SOTA methods trained with these complex techniques on LVIS. Most importantly, MIC shows great generalization ability on novel classes, e.g., with and improvement compared with previous SOTA on COCO and Objects365, respectively.
Paper Structure (42 sections, 6 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 42 sections, 6 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: (a) In OVOD, the detector aims to detect any objects within an object vocabulary in an input image. Previous method, e.g, DetPro, can easily misclassify some highly similar classes (puffin v.s. bird). Our method improves the model generalization ability, which can be more discriminative to these similar categories. Note that every point indicates a category in the latent space; (b) The error rate of predicting novel objects as base ones.
  • Figure 2: Overview of our proposed method. The training stage is divided into two consecutive parts: i) meta prompt learning and ii) detector training. During i) meta prompt learning, to simulate a novel-class-emerging scenario, we sample a batch-wise varying object vocabulary with $\mathcal{C}_S$ from $\mathcal{C}_B$, which improves the generalization ability of learned foreground prompt. Also, we integrate the learnable background prompt to help the model distinguish foreground and background proposals. Further, in ii) detector training, we introduce an instance-level contrastive learning scheme to promote intra-class compactness and inter-class separation. During iii) inference stage, we use the learned foreground prompt representation to generate class embeddings for novel classes.
  • Figure 3: Sampling strategy in MPL. We study the effect of sampled classes.
  • Figure 4: t-SNE visualization of class embeddings of LVIS. We randomly sample 200 novel and base classes from LVIS and use t-SNE to visualize the class embeddings.
  • Figure 5: Qualitative detection visualization results of our proposed method MIC and DetPro. Our method could better distinguish similar classes, detect smaller objects, and produce less false positives under diverse complex scenarios.
  • ...and 3 more figures