Table of Contents
Fetching ...

CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

Qibo Chen, Weizhong Jin, Jianyue Ge, Mengdi Liu, Yuchao Yan, Jian Jiang, Li Yu, Xuanjiang Guo, Shuchang Li, Jianzhong Chen

TL;DR

CP-DETR tackles universal object detection by integrating language-anchored prompts with a DETR-based detector through a prompt visual hybrid encoder. It introduces progressive single-scale fusion and multi-scale fusion gating to enable efficient cross-modal interaction, aided by auxiliary supervision and a prompt multi-label loss. The model supports text prompts, visual prompts, and optimized prompts via a unified concept prompt generator, achieving strong zero-shot and competitive full-shot performance across LVIS, COCO, and ODinW benchmarks using only publicly available data. This approach reduces alignment bias and demonstrates practical impact by delivering high accuracy with a single pre-training weight and enabling interactive visual-prompt guidance for labeling workflows.

Abstract

Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.

CP-DETR: Concept Prompt Guide DETR Toward Stronger Universal Object Detection

TL;DR

CP-DETR tackles universal object detection by integrating language-anchored prompts with a DETR-based detector through a prompt visual hybrid encoder. It introduces progressive single-scale fusion and multi-scale fusion gating to enable efficient cross-modal interaction, aided by auxiliary supervision and a prompt multi-label loss. The model supports text prompts, visual prompts, and optimized prompts via a unified concept prompt generator, achieving strong zero-shot and competitive full-shot performance across LVIS, COCO, and ODinW benchmarks using only publicly available data. This approach reduces alignment bias and demonstrates practical impact by delivering high accuracy with a single pre-training weight and enabling interactive visual-prompt guidance for labeling workflows.

Abstract

Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13.

Paper Structure

This paper contains 34 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overall architecture of CP-DETR. First, the concept prompt generator (shown in green dashed box) encodes textual descriptions, referring boxes, or annotations as concept prompts. Then, the detector encodes the image as multi-scale feature maps and performs a cross-modal fusion of concepts and images using the proposed hybrid encoder (shown as the red dashed box). Finally, the transformer decoder predicts results.
  • Figure 2: The overall architecture of the visual prompt encoder. Coordinates of 2D boxes are encoded as query and query position vectors, and the concept prompt is aggregated from image features via three layers of deformable cross-attention.
  • Figure 3: Ablation results for the super-class representation length of optimized prompt in CP-DETR-T.
  • Figure 4: Visualizations of CP-DETR-L zero-shot outputs.
  • Figure 5: Visualizations of CP-DETR-L visual prompt outputs. Row 1 use of a class of boxes as inputs. Row 2 use of two classes of boxes as inputs. Row 3 use of a class of boxes and text "person.tree" as inputs.