Incremental Object Detection with CLIP

Ziyue Huang; Yupeng He; Qingjie Liu; Yunhong Wang

Incremental Object Detection with CLIP

Ziyue Huang, Yupeng He, Qingjie Liu, Yunhong Wang

TL;DR

The paper addresses forward compatibility in incremental object detection, where images contain current-stage classes $C_i$, base classes $C_{base}$, and potential future novel classes $C_{novel}$, causing data ambiguity. It proposes Incremental Object Detection with CLIP (IODC), which uses CLIP to build a growable language space for base, broad, and novel classes and aligns visual features with text embeddings via a linear classifier, while simultaneously identifying unknown objects with the CLIP image encoder to generate pseudo-annotations. The method introduces three key components: text feature alignment, broad-class knowledge transfer via category mapping, and CLIP-based unknown object detection, all designed to improve learning of novel classes in early stages. Empirically, IODC outperforms state-of-the-art incremental detection methods on PASCAL VOC 2007 across multiple two-stage settings, particularly enhancing performance on novel classes, and reduces the dependency on additional datasets by exploiting CLIP’s open-vocabulary capabilities.

Abstract

In contrast to the incremental classification task, the incremental detection task is characterized by the presence of data ambiguity, as an image may have differently labeled bounding boxes across multiple continuous learning stages. This phenomenon often impairs the model's ability to effectively learn new classes. However, existing research has paid less attention to the forward compatibility of the model, which limits its suitability for incremental learning. To overcome this obstacle, we propose leveraging a visual-language model such as CLIP to generate text feature embeddings for different class sets, which enhances the feature space globally. We then employ super-classes to replace the unavailable novel classes in the early learning stage to simulate the incremental scenario. Finally, we utilize the CLIP image encoder to accurately identify potential objects. We incorporate the finely recognized detection boxes as pseudo-annotations into the training process, thereby further improving the detection performance. We evaluate our approach on various incremental learning settings using the PASCAL VOC 2007 dataset, and our approach outperforms state-of-the-art methods, particularly for recognizing the new classes.

Incremental Object Detection with CLIP

TL;DR

The paper addresses forward compatibility in incremental object detection, where images contain current-stage classes

, base classes

, and potential future novel classes

, causing data ambiguity. It proposes Incremental Object Detection with CLIP (IODC), which uses CLIP to build a growable language space for base, broad, and novel classes and aligns visual features with text embeddings via a linear classifier, while simultaneously identifying unknown objects with the CLIP image encoder to generate pseudo-annotations. The method introduces three key components: text feature alignment, broad-class knowledge transfer via category mapping, and CLIP-based unknown object detection, all designed to improve learning of novel classes in early stages. Empirically, IODC outperforms state-of-the-art incremental detection methods on PASCAL VOC 2007 across multiple two-stage settings, particularly enhancing performance on novel classes, and reduces the dependency on additional datasets by exploiting CLIP’s open-vocabulary capabilities.

Abstract

Paper Structure (16 sections, 2 figures, 3 tables)

This paper contains 16 sections, 2 figures, 3 tables.

Introduction
Related work
Methodology
Problem Formulation
Text Feature Alignment
Broad Classes and Category Mapping
Identify Unknown Classes
Experiments
Datasets and Evaluation
Experimental Settings
Implementation Details
Results
Discussions and Analysis
Ablation Studies
Motivation of CLIP
...and 1 more sections

Figures (2)

Figure 1: Approach Overview: We trained the model in $T_1$ and $T_2$ for incremental. The labeled boxes obtained at $T_1$ and $T_2$ stages are different. (a) CLIP Text Encoder: We use base and broad classes to generate text features in $T_1$, and base and novel classes in $T_2$. (b) CLIP Image Encoder: The proposals with the highest prediction of background category are sent to the CLIP image encoder for identification. We will modify the gt-label of these proposals, which are identified as the broad classes in $T_{1}$ and the base classes in $T_{2}$. R2U1 and R2B2 in yellow are used to illustrate this process. At the same time, we add these identified proposals with prediction scores higher than $tr$ to the dataset as pseudo bounding boxes after a step of NMS. When we encounter the same image in the future, we will only sample a few pseudo bounding boxes for training.
Figure 2: Experimental results of IODML on PASCAL VOC 2007 under 15+5 setting. With the increase of distillation weight, only the mAP of novel classes in $T_2$ stage has a significant decline.

Incremental Object Detection with CLIP

TL;DR

Abstract

Incremental Object Detection with CLIP

Authors

TL;DR

Abstract

Table of Contents

Figures (2)