VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

Junsu Kim; Yunhoe Ku; Jihyeon Kim; Junuk Cha; Seungryul Baek

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

Junsu Kim, Yunhoe Ku, Jihyeon Kim, Junuk Cha, Seungryul Baek

TL;DR

This paper tackles catastrophic forgetting in CIOD by introducing Vision-Language Model assisted Pseudo-Labeling (VLM-PL). It derives pseudo GTs from a pre-trained detector and then uses a VLM with carefully designed prompts to verify and refine these pseudo labels, eliminating reliance on past-model predictions. The refined pseudo GTs are combined with real GTs from new tasks to train a new detector, achieving state-of-the-art results on Pascal VOC and MS COCO in both multi-scenario and dual-scenario settings without replay. The approach leverages prompt-tuning and region-aware QA from Ferret, along with CLIP-based visual features, to maintain robust performance as new classes are added, demonstrating practical impact for continual object detection systems.

Abstract

In the field of Class Incremental Object Detection (CIOD), creating models that can continuously learn like humans is a major challenge. Pseudo-labeling methods, although initially powerful, struggle with multi-scenario incremental learning due to their tendency to forget past knowledge. To overcome this, we introduce a new approach called Vision-Language Model assisted Pseudo-Labeling (VLM-PL). This technique uses Vision-Language Model (VLM) to verify the correctness of pseudo ground-truths (GTs) without requiring additional model training. VLM-PL starts by deriving pseudo GTs from a pre-trained detector. Then, we generate custom queries for each pseudo GT using carefully designed prompt templates that combine image and text features. This allows the VLM to classify the correctness through its responses. Furthermore, VLM-PL integrates refined pseudo and real GTs from upcoming training, effectively combining new and old knowledge. Extensive experiments conducted on the Pascal VOC and MS COCO datasets not only highlight VLM-PL's exceptional performance in multi-scenario but also illuminate its effectiveness in dual-scenario by achieving state-of-the-art results in both.

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

TL;DR

Abstract

Paper Structure (18 sections, 4 figures, 6 tables)

This paper contains 18 sections, 4 figures, 6 tables.

Introduction
Related works
Continual learning
Vision-Language Models
Preliminaries
Transformer-based detector
Region-specific conversation
Methods
Process configuration
Pseudo-labeling
Vision-Language Model assistance
Experiments
Dataset and metrics
Implementation and experiments
Results and analysis
...and 3 more sections

Figures (4)

Figure 1: Workflow of our proposed method: This schematic illustrates the sequential steps of our method. It starts with pseudo-labeling by a pre-trained model $\mathbf{M}_{old}$, followed by refining through the Vision-Language Model. Custom-generated prompts are used for each pseudo ground-truth (GT). This refining process filters out incorrect pseudo GTs to yield reliable pseudo GTs. These annotations are then used to train a detector $\mathbf{M}_{new}$, incorporating previous knowledge with the updated dataset.
Figure 2: Overview of the VLM-assisted Pseudo Labeling: The sequence begins with the detector $\mathbf{M}_{old}$, applying pseudo labeling to identify potential objects (e.g.Boat, Car, and Cat within the input image), alongside their corresponding bounding box locations. Each identified object and its location are encapsulated into a prompt template. This template integrates placeholders for $<$image feature$>$ and $<$region feature$>$, where the former is substituted with the overall image features and the latter with features corresponding to the specific region of interest. The prompts are classified by the VLM for reliability, using responses such as 'yes' or 'no' to verify each pseudo GT. Subsequently, the refined pseudo GTs are combined with new GTs from the new task for training the detector $\mathbf{M}_{new}$.
Figure 3: Illustration of incorrect pseudo GT examples generated during a multi-incremental learning scenario (i.e.5+5+5+5) on the Pascal everingham2010pascal dataset. (a) depicts an incorrect pseudo GT where a 'bicycle' is mislabeled; (b) shows both 'bicycle' and 'bird' misidentified; (c) highlights a case where all annotations are incorrect; and (d) indicates mislabeled 'cow' and 'bird' instances.
Figure 4: Qualitative results of both conventional pseudo labeling, as used in OW-DETR gupta2022ow and SDDGR kim2024SDDGR, and VLM-assisted pseudo labeling in a multi-incremental scenario (for example, 5(T1)+5(T2)+5(T3)+5(T4) on the Pascal dataset) are presented here. The effects of VLM assistance can be observed from (a) to (b), (c) to (d), and (e) to (f). This is especially noticeable in (a) to (b) of the second row and (e) to (f) of the third row, which indicate that all pseudo GTs are incorrect.

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

TL;DR

Abstract

VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)