Table of Contents
Fetching ...

Dynamic Object Queries for Transformer-based Incremental Object Detection

Jichuan Zhang, Wei Li, Shuang Cheng, Ya-Li Li, Shengjin Wang

TL;DR

This work addresses catastrophic forgetting in incremental object detection by introducing Dynamic object Query Assembly based DETR (DyQ-DETR). The method incrementally expands class-specific queries, uses isolated bipartite matching and disentangled self-attention to decouple old and new knowledge, and employs risk-balanced partial calibration for exemplar replay to handle incomplete labels. Key contributions include dynamic query integration with phase-wise losses, an efficient decoder design, and a risk-aware exemplar strategy that yields strong improvements on COCO 2017 against state-of-the-art IOD methods, with limited parameter overhead. The approach offers a scalable path to stability-plasticity in continual visual learning, with practical implications for robotics, autonomous systems, and open-world detection tasks.

Abstract

Incremental object detection (IOD) aims to sequentially learn new classes, while maintaining the capability to locate and identify old ones. As the training data arrives with annotations only with new classes, IOD suffers from catastrophic forgetting. Prior methodologies mainly tackle the forgetting issue through knowledge distillation and exemplar replay, ignoring the conflict between limited model capacity and increasing knowledge. In this paper, we explore \textit{dynamic object queries} for incremental object detection built on Transformer architecture. We propose the \textbf{Dy}namic object \textbf{Q}uery-based \textbf{DE}tection \textbf{TR}ansformer (DyQ-DETR), which incrementally expands the model representation ability to achieve stability-plasticity tradeoff. First, a new set of learnable object queries are fed into the decoder to represent new classes. These new object queries are aggregated with those from previous phases to adapt both old and new knowledge well. Second, we propose the isolated bipartite matching for object queries in different phases, based on disentangled self-attention. The interaction among the object queries at different phases is eliminated to reduce inter-class confusion. Thanks to the separate supervision and computation over object queries, we further present the risk-balanced partial calibration for effective exemplar replay. Extensive experiments demonstrate that DyQ-DETR significantly surpasses the state-of-the-art methods, with limited parameter overhead. Code will be made publicly available.

Dynamic Object Queries for Transformer-based Incremental Object Detection

TL;DR

This work addresses catastrophic forgetting in incremental object detection by introducing Dynamic object Query Assembly based DETR (DyQ-DETR). The method incrementally expands class-specific queries, uses isolated bipartite matching and disentangled self-attention to decouple old and new knowledge, and employs risk-balanced partial calibration for exemplar replay to handle incomplete labels. Key contributions include dynamic query integration with phase-wise losses, an efficient decoder design, and a risk-aware exemplar strategy that yields strong improvements on COCO 2017 against state-of-the-art IOD methods, with limited parameter overhead. The approach offers a scalable path to stability-plasticity in continual visual learning, with practical implications for robotics, autonomous systems, and open-world detection tasks.

Abstract

Incremental object detection (IOD) aims to sequentially learn new classes, while maintaining the capability to locate and identify old ones. As the training data arrives with annotations only with new classes, IOD suffers from catastrophic forgetting. Prior methodologies mainly tackle the forgetting issue through knowledge distillation and exemplar replay, ignoring the conflict between limited model capacity and increasing knowledge. In this paper, we explore \textit{dynamic object queries} for incremental object detection built on Transformer architecture. We propose the \textbf{Dy}namic object \textbf{Q}uery-based \textbf{DE}tection \textbf{TR}ansformer (DyQ-DETR), which incrementally expands the model representation ability to achieve stability-plasticity tradeoff. First, a new set of learnable object queries are fed into the decoder to represent new classes. These new object queries are aggregated with those from previous phases to adapt both old and new knowledge well. Second, we propose the isolated bipartite matching for object queries in different phases, based on disentangled self-attention. The interaction among the object queries at different phases is eliminated to reduce inter-class confusion. Thanks to the separate supervision and computation over object queries, we further present the risk-balanced partial calibration for effective exemplar replay. Extensive experiments demonstrate that DyQ-DETR significantly surpasses the state-of-the-art methods, with limited parameter overhead. Code will be made publicly available.
Paper Structure (21 sections, 5 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of DyQ-DETR. Built on Detection Transformer, a new set of queries is assigned for the newly-arriving classes at each step and different groups of queries are responsible for detecting specific classes annotated in corresponding steps. $Q_t$ denotes the query group in step $t$. SA and CA refer to the self-attention and cross-attention modules, respectively.
  • Figure 2: The overview of our proposed DyQ-DETR. The dynamic object queries serve as the input for Transformer decoder in incremental learning. At each incremental time step $t$, for an image $x \in D_t$, the training loss is independently computed. The total loss is the weighted sum of the knowledge distillation loss $\mathcal{L}_i^{DETR}(1 \leq i < t)$ caused by pseudo labels and the standard DETR loss $\mathcal{L}_t^{DETR}$ caused by ground-truth labels.
  • Figure 3: Illustration of risk-balanced exemplar selection. We choose the middle part with moderate risk score to serve as the exemplars.
  • Figure 4: IOD results ($AP/AP_{50}, \%$) in the multi-phase 40+20×2 and 40+10×4 settings. The results of all other works are from liu2023continual.
  • Figure 5: Comparison of parameter (left) and complexity overhead (right) with the addition of 100 queries at each step. DSA denotes disentangled self-attention.
  • ...and 5 more figures