Table of Contents
Fetching ...

DeCLIP: Decoupled Prompting for CLIP-based Multi-Label Class-Incremental Learning

Kaile Du, Zihan Ye, Junzhou Xie, Yixi Shen, Yuyang Li, Fuyuan Hu, Ling Shao, Guangcan Liu, Joost van de Weijer, Fan Lyu

TL;DR

DeCLIP is proposed, a replay-free and parameter-efficient framework that decouples CLIP representations via a one-to-one class-specific prompting scheme that prevents semantic confusion across labels and decouples multi-label images into per-class views compatible with CLIP pre-training.

Abstract

Multi-label class-incremental learning (MLCIL) continuously expands the label space while recognizing multiple co-occurring classes, making it prone to catastrophic forgetting and high false-positive rates (FPR). Extending CLIP to MLCIL is non-trivial because co-occurring categories violate CLIP's single image-text alignment paradigm and task-level partial labeling induces high FPR. We propose DeCLIP, a replay-free and parameter-efficient framework that decouples CLIP representations via a one-to-one class-specific prompting scheme. By assigning each category its own prompt space, DeCLIP prevents semantic confusion across labels and decouples multi-label images into per-class views compatible with CLIP pre-training. The learned prompts are preserved as knowledge anchors, mitigating catastrophic forgetting without replay. We further introduce Adaptive Similarity Tempering (AST), a task-aware strategy that suppresses FPR without dataset-specific tuning. Experiments on MS-COCO and PASCAL VOC show that DeCLIP consistently outperforms prior methods with minimal trainable parameters.

DeCLIP: Decoupled Prompting for CLIP-based Multi-Label Class-Incremental Learning

TL;DR

DeCLIP is proposed, a replay-free and parameter-efficient framework that decouples CLIP representations via a one-to-one class-specific prompting scheme that prevents semantic confusion across labels and decouples multi-label images into per-class views compatible with CLIP pre-training.

Abstract

Multi-label class-incremental learning (MLCIL) continuously expands the label space while recognizing multiple co-occurring classes, making it prone to catastrophic forgetting and high false-positive rates (FPR). Extending CLIP to MLCIL is non-trivial because co-occurring categories violate CLIP's single image-text alignment paradigm and task-level partial labeling induces high FPR. We propose DeCLIP, a replay-free and parameter-efficient framework that decouples CLIP representations via a one-to-one class-specific prompting scheme. By assigning each category its own prompt space, DeCLIP prevents semantic confusion across labels and decouples multi-label images into per-class views compatible with CLIP pre-training. The learned prompts are preserved as knowledge anchors, mitigating catastrophic forgetting without replay. We further introduce Adaptive Similarity Tempering (AST), a task-aware strategy that suppresses FPR without dataset-specific tuning. Experiments on MS-COCO and PASCAL VOC show that DeCLIP consistently outperforms prior methods with minimal trainable parameters.

Paper Structure

This paper contains 11 sections, 6 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison of prompt-based methods. (a) Many-to-many prompts (L2P wang2022learning, DualPrompt dualprompt2022wang) couple co-occurring classes, causing semantic confusion and overconfident false positives (FP). (b) One-to-many prompts (MULTI-LANE de2024less) still share prompt space across co-occurring classes. (c) DeCLIP uses one-to-one class-specific prompts for semantic decoupling and AST for false-positive suppression.
  • Figure 2: The overall pipeline of DeCLIP. (a) Training stage on task $t$. For each class $c \in \mathcal{C}^{t}$, positive and negative prompts guide the frozen encoders to decouple features into class-specific components, from which positive and negative similarities are computed for optimization. (b) Inference stage after task $t$. All learned prompts corresponding to classes $c \in \mathcal{C}^{1:t}$ are fixed and jointly applied. The positive and negative similarities are adaptively calibrated through the AST module to produce the final prediction.
  • Figure 3: (a) Confidence distributions with and without AST, showing that AST significantly suppresses false positives and reduces the FPR from 25.4% to 2.4% in VOC B4-C2. (b) Comparison between AL and AST, indicating that AST is tailored to CLIP-based MLCIL and yields better performance across both Last and Average evaluations.
  • Figure 4: Parameter-performance comparison of methods in MS-COCO B40-C10.
  • Figure 5: Ablation of class-specific prompts for decoupling and AST under the last and average metrics (%).
  • ...and 3 more figures