Table of Contents
Fetching ...

MR-GDINO: Efficient Open-World Continual Object Detection

Bowen Dong, Zitong Huang, Guanglei Yang, Lei Zhang, Wangmeng Zuo

TL;DR

This work defines Open-World Continual Object Detection (OW-COD) and introduces the OW-COD benchmark to evaluate detectors on old, new, and unseen categories under few-shot continual updates. It proposes MR-GDINO, a memory-based baseline built on a frozen open-world detector, which uses two compact memories (concept memory and VL interaction memory) and a retrieval mechanism over a scalable memory pool to select the best memories per input. Experiments show that existing continual detectors suffer severe forgetting for unseen categories, while MR-GDINO substantially mitigates forgetting with only about 0.1% additional parameters, achieving state-of-the-art performance on old, new, and unseen categories. The approach offers a flexible, scalable, and efficient pathway toward robust open-world continual detection suitable for real-world deployment.

Abstract

Open-world (OW) recognition and detection models show strong zero- and few-shot adaptation abilities, inspiring their use as initializations in continual learning methods to improve performance. Despite promising results on seen classes, such OW abilities on unseen classes are largely degenerated due to catastrophic forgetting. To tackle this challenge, we propose an open-world continual object detection task, requiring detectors to generalize to old, new, and unseen categories in continual learning scenarios. Based on this task, we present a challenging yet practical OW-COD benchmark to assess detection abilities. The goal is to motivate OW detectors to simultaneously preserve learned classes, adapt to new classes, and maintain open-world capabilities under few-shot adaptations. To mitigate forgetting in unseen categories, we propose MR-GDINO, a strong, efficient and scalable baseline via memory and retrieval mechanisms within a highly scalable memory pool. Experimental results show that existing continual detectors suffer from severe forgetting for both seen and unseen categories. In contrast, MR-GDINO largely mitigates forgetting with only 0.1% activated extra parameters, achieving state-of-the-art performance for old, new, and unseen categories.

MR-GDINO: Efficient Open-World Continual Object Detection

TL;DR

This work defines Open-World Continual Object Detection (OW-COD) and introduces the OW-COD benchmark to evaluate detectors on old, new, and unseen categories under few-shot continual updates. It proposes MR-GDINO, a memory-based baseline built on a frozen open-world detector, which uses two compact memories (concept memory and VL interaction memory) and a retrieval mechanism over a scalable memory pool to select the best memories per input. Experiments show that existing continual detectors suffer severe forgetting for unseen categories, while MR-GDINO substantially mitigates forgetting with only about 0.1% additional parameters, achieving state-of-the-art performance on old, new, and unseen categories. The approach offers a flexible, scalable, and efficient pathway toward robust open-world continual detection suitable for real-world deployment.

Abstract

Open-world (OW) recognition and detection models show strong zero- and few-shot adaptation abilities, inspiring their use as initializations in continual learning methods to improve performance. Despite promising results on seen classes, such OW abilities on unseen classes are largely degenerated due to catastrophic forgetting. To tackle this challenge, we propose an open-world continual object detection task, requiring detectors to generalize to old, new, and unseen categories in continual learning scenarios. Based on this task, we present a challenging yet practical OW-COD benchmark to assess detection abilities. The goal is to motivate OW detectors to simultaneously preserve learned classes, adapt to new classes, and maintain open-world capabilities under few-shot adaptations. To mitigate forgetting in unseen categories, we propose MR-GDINO, a strong, efficient and scalable baseline via memory and retrieval mechanisms within a highly scalable memory pool. Experimental results show that existing continual detectors suffer from severe forgetting for both seen and unseen categories. In contrast, MR-GDINO largely mitigates forgetting with only 0.1% activated extra parameters, achieving state-of-the-art performance for old, new, and unseen categories.

Paper Structure

This paper contains 25 sections, 5 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: (a) Pretrained open-world (OW) detectors liu2023grounding show strong generalization abilities on unseen data but cannot benefit from few-shot annotations. (b) Continual detectors deng2024zero built on OW detectors with continual learning show improved mAP on seen data but suffer from forgetting for unseen objects. (c) Our OW continual detector MR-GDINO via memory and retrieval improves detection abilities on seen classes while preserving OW abilities on unseen classes.
  • Figure 2: Overview of our proposed MR-GDINO. MR-GDINO is based on a frozen pretrained open-world object detector with explicit visual-language interaction modules (e.g., Grounding DINO liu2023grounding). During each step $t$ of training, MR-GDINO initializes concept memory $\theta^{t}_{\text{con}}$ and visual-language interaction memory $\theta^{t}_{\text{inc}}$ from corresponding parameters in the $t-1$ step, and optimizes both parameters by $t$-th training set. After training, $\theta^{t}_{\text{con}}$ and $\theta^{t}_{\text{inc}}$ are memorized into the memory pool $\mathbb{B}$. During open-world inference scenarios, MR-GDINO uses the global embedding of input image $\mathbf{I}$ to retrieve the optimal parameters $(\psi^{\text{opt}}, \theta^{\text{opt}}_{\text{con}}, \theta^{\text{opt}}_{\text{inc}})$ and use these parameters for accurate predictions.
  • Figure 3: Overview of the proposed visual-language interaction memory. Specifically, MR-GDINO adopts LoRA hu2022lora modules as $\theta_{\text{inc}}$ in Q/K/V projections of VL feature enhancer $f_{\mathbf{VL}}$.
  • Figure 4: Qualitative results of zero-shot Grounding DINO liu2023grounding, ZiRa deng2024zero, and MR-GDINO. Compared to ZS GDINO and state-of-the-art ZiRa, MR-GDINO can generate more accurate bounding boxes with higher confidence on both seen and unseen classes.