Table of Contents
Fetching ...

Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection

Chuhan Zhang, Chaoyang Zhu, Pingcheng Dong, Long Chen, Dong Zhang

TL;DR

This work tackles open-vocabulary object detection by eliminating the need for extra supervision and instead leveraging a cyclic transfer between language-derived priors and VLM-based regional features. It introduces semantic priors injected into language queries and a regional contrastive distillation loss to align detector region embeddings with the VLM visual-semantic space, forming a cross-modal loop through a DETR-like architecture. The approach yields state-of-the-art results on OV-COCO and competitive results on OV-LVIS, with performance steadily improving as the teacher model strengthens, and without incurring inference-time overhead. The findings highlight the effectiveness of harnessing VLMs/MLLMs’ regional structure for robust base-to-novel generalization and offer a data-efficient path for open-vocabulary perception in real-world settings.

Abstract

In pursuit of detecting unstinted objects that extend beyond predefined categories, prior arts of open-vocabulary object detection (OVD) typically resort to pretrained vision-language models (VLMs) for base-to-novel category generalization. However, to mitigate the misalignment between upstream image-text pretraining and downstream region-level perception, additional supervisions are indispensable, eg, image-text pairs or pseudo annotations generated via self-training strategies. In this work, we propose CCKT-Det trained without any extra supervision. The proposed framework constructs a cyclic and dynamic knowledge transfer from language queries and visual region features extracted from VLMs, which forces the detector to closely align with the visual-semantic space of VLMs. Specifically, 1) we prefilter and inject semantic priors to guide the learning of queries, and 2) introduce a regional contrastive loss to improve the awareness of queries on novel objects. CCKT-Det can consistently improve performance as the scale of VLMs increases, all while requiring the detector at a moderate level of computation overhead. Comprehensive experimental results demonstrate that our method achieves performance gain of +2.9% and +10.2% AP50 over previous state-of-the-arts on the challenging COCO benchmark, both without and with a stronger teacher model.

Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection

TL;DR

This work tackles open-vocabulary object detection by eliminating the need for extra supervision and instead leveraging a cyclic transfer between language-derived priors and VLM-based regional features. It introduces semantic priors injected into language queries and a regional contrastive distillation loss to align detector region embeddings with the VLM visual-semantic space, forming a cross-modal loop through a DETR-like architecture. The approach yields state-of-the-art results on OV-COCO and competitive results on OV-LVIS, with performance steadily improving as the teacher model strengthens, and without incurring inference-time overhead. The findings highlight the effectiveness of harnessing VLMs/MLLMs’ regional structure for robust base-to-novel generalization and offer a data-efficient path for open-vocabulary perception in real-world settings.

Abstract

In pursuit of detecting unstinted objects that extend beyond predefined categories, prior arts of open-vocabulary object detection (OVD) typically resort to pretrained vision-language models (VLMs) for base-to-novel category generalization. However, to mitigate the misalignment between upstream image-text pretraining and downstream region-level perception, additional supervisions are indispensable, eg, image-text pairs or pseudo annotations generated via self-training strategies. In this work, we propose CCKT-Det trained without any extra supervision. The proposed framework constructs a cyclic and dynamic knowledge transfer from language queries and visual region features extracted from VLMs, which forces the detector to closely align with the visual-semantic space of VLMs. Specifically, 1) we prefilter and inject semantic priors to guide the learning of queries, and 2) introduce a regional contrastive loss to improve the awareness of queries on novel objects. CCKT-Det can consistently improve performance as the scale of VLMs increases, all while requiring the detector at a moderate level of computation overhead. Comprehensive experimental results demonstrate that our method achieves performance gain of +2.9% and +10.2% AP50 over previous state-of-the-arts on the challenging COCO benchmark, both without and with a stronger teacher model.

Paper Structure

This paper contains 12 sections, 7 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The evolution of extracting novel concepts in OVD models. Compared to existing methods by using extra captions in (a) or pseudo annotations & self-training strategies in (b), we propose leveraging semantic priors to reveal novel concepts and employing contrastive knowledge distillation paradigm in (c) to align the enriched teacher space with the region-aware student space.
  • Figure 2: The overall architecture of our CCKT-Det. By querying the existence of object categories within an input image, we dynamically guide object queries to explore novel concepts using semantic priors, which enables awareness of novel categories.
  • Figure 3: Illustration of our contrastive knowledge transfer scheme. We first align semantic-enriched regional embeddings with teacher's visual-semantic space through a contrastive loss. Regional embeddings with the lowest Hungarian matching cost are then considered as positive pairs for distillation, enabling explicit alignment of base objects and implicit learning of novel objects.
  • Figure 4: While most models tend to enhance performance with stronger backbones, typically resulting in increased computational demands, our method achieves competitive results utilizing the default ResNet50 backbone.
  • Figure 5: More visualization results of CCKT-Det++.
  • ...and 2 more figures