Table of Contents
Fetching ...

DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

Siheng Wang, Yanshu Li, Bohan Hu, Zhengdao Li, Haibo Zhan, Linshan Li, Weiming Liu, Ruizhi Qian, Guangxin Wu, Hao Zhang, Jifeng Shen, Piotr Koniusz, Zhengtao Yao, Junhao Dong, Qiang Sun

Abstract

Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

Abstract

Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

Paper Structure

This paper contains 32 sections, 27 equations, 2 figures, 8 tables, 3 algorithms.

Figures (2)

  • Figure 1: Three-staged pipeline of DeCo-DETR. (a) DHCP constructs a hierarchical prototype memory from region-level descriptions via LLaVA generation and CLIP-based filtering, capturing both coarse and fine-grained semantics. (b) Hi-Know DPA projects detector queries into the shared embedding space and enhances them through prototype aggregation for efficient open-set knowledge transfer. (c) PD-DuGi decouples localization and semantic alignment into two optimization streams, mitigating task interference. At inference, the prototype pool provides semantic priors, and the decoupled decoder jointly predicts bounding boxes and category semantics without a text encoder.
  • Figure 2: Qualitative comparison between DeCo-DETR and the baseline. DeCo-DETR shows stronger open-vocabulary generalization by accurately localizing and recognizing novel categories.