Table of Contents
Fetching ...

CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection

Mingyi Guo, Yuyang Liu, Zhiyuan Yan, Zongying Lin, Peixi Peng, Yonghong Tian

TL;DR

This work tackles catastrophic forgetting in incremental object detection (IOD) caused by background drift. It introduces CASA, which builds a class-agnostic shared attributes base from vision-language foundation models and uses LLM-generated textual attributes, selecting and freezing relevant attributes via an assignment mechanism to enable incremental learning with minimal parameter overhead. The method updates only the attribute usage and embedding alignment across tasks, preserving performance on old classes while integrating new ones, achieving state-of-the-art results on COCO in both two-phase and multi-phase settings with a 0.7% parameter storage increase. By leveraging cross-modal representations and a flexible attribute-based sharing strategy, CASA demonstrates strong generalization to evolving class sets and provides an efficient pathway for open-world incremental detection.

Abstract

Incremental object detection is fundamentally challenged by catastrophic forgetting. A major factor contributing to this issue is background shift, where background categories in sequential tasks may overlap with either previously learned or future unseen classes. To address this, we propose a novel method called Class-Agnostic Shared Attribute Base (CASA) that encourages the model to learn category-agnostic attributes shared across incremental classes. Our approach leverages an LLM to generate candidate textual attributes, selects the most relevant ones based on the current training data, and records their importance in an assignment matrix. For subsequent tasks, the retained attributes are frozen, and new attributes are selected from the remaining candidates, ensuring both knowledge retention and adaptability. Extensive experiments on the COCO dataset demonstrate the state-of-the-art performance of our method.

CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection

TL;DR

This work tackles catastrophic forgetting in incremental object detection (IOD) caused by background drift. It introduces CASA, which builds a class-agnostic shared attributes base from vision-language foundation models and uses LLM-generated textual attributes, selecting and freezing relevant attributes via an assignment mechanism to enable incremental learning with minimal parameter overhead. The method updates only the attribute usage and embedding alignment across tasks, preserving performance on old classes while integrating new ones, achieving state-of-the-art results on COCO in both two-phase and multi-phase settings with a 0.7% parameter storage increase. By leveraging cross-modal representations and a flexible attribute-based sharing strategy, CASA demonstrates strong generalization to evolving class sets and provides an efficient pathway for open-world incremental detection.

Abstract

Incremental object detection is fundamentally challenged by catastrophic forgetting. A major factor contributing to this issue is background shift, where background categories in sequential tasks may overlap with either previously learned or future unseen classes. To address this, we propose a novel method called Class-Agnostic Shared Attribute Base (CASA) that encourages the model to learn category-agnostic attributes shared across incremental classes. Our approach leverages an LLM to generate candidate textual attributes, selects the most relevant ones based on the current training data, and records their importance in an assignment matrix. For subsequent tasks, the retained attributes are frozen, and new attributes are selected from the remaining candidates, ensuring both knowledge retention and adaptability. Extensive experiments on the COCO dataset demonstrate the state-of-the-art performance of our method.
Paper Structure (36 sections, 6 equations, 4 figures, 5 tables)

This paper contains 36 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of our proposed Class-Agnostic Shared Attribute (CASA). We leverage LLMs to generate the shared attribute base $E_a$ and then select the most relevant ones $\hat{E}_a^t$ based on the current training data, documenting their significance in an attribute assignment matrix $A^t$. In subsequent tasks, we retain and freeze these selected attributes, continuing the process by choosing from the remaining candidates and appending them after $\hat{E}_a^{t-1}$, and updating the attribute assignment matrix.
  • Figure 2: Attribute scores in two-phase setting.The first five classes belong to the initial phase, while the latter five classes are part of the second phase.
  • Figure 3: Impact of the hyperparameters $\lambda_1$ and $\lambda_2$
  • Figure 4: Visualization results for 40+40 setting. The red boxes show object classes learned in the previous phase, while the green boxes represent those learned in the current phase.