Table of Contents
Fetching ...

Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors

Haodong Lu, Xinyu Zhang, Kristen Moore, Jason Xue, Lina Yao, Anton van den Hengel, Dong Gong

TL;DR

This work addresses catastrophic forgetting in class-incremental learning with CLIP by introducing Textual Prototype-guided Prompt Tuning (TPPT). TPPT-V anchors visual prompts to fixed textual prototypes, mitigating drift, while TPPT-VT further learns textual prompts and enforces diversity to maintain embedding space health. Across multiple benchmarks, TPPT variants outperform prior prompt-based continual learning methods, especially on fine-grained tasks, with favorable efficiency and robustness. The approach leverages CLIP's intrinsic text–image structure to achieve strong cross-modal alignment during continual adaptation, offering a practical and scalable solution.

Abstract

Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, providing rich multi-modal embeddings that support lightweight, incremental prompt tuning. Existing methods often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementations, that introduce additional-and possibly unnecessary-complexity, underutilizing CLIP's intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we jointly optimizes visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP's intrinsic guidance for continual adaptation.

Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors

TL;DR

This work addresses catastrophic forgetting in class-incremental learning with CLIP by introducing Textual Prototype-guided Prompt Tuning (TPPT). TPPT-V anchors visual prompts to fixed textual prototypes, mitigating drift, while TPPT-VT further learns textual prompts and enforces diversity to maintain embedding space health. Across multiple benchmarks, TPPT variants outperform prior prompt-based continual learning methods, especially on fine-grained tasks, with favorable efficiency and robustness. The approach leverages CLIP's intrinsic text–image structure to achieve strong cross-modal alignment during continual adaptation, offering a practical and scalable solution.

Abstract

Continual learning (CL) enables deep networks to acquire new knowledge while avoiding catastrophic forgetting. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, providing rich multi-modal embeddings that support lightweight, incremental prompt tuning. Existing methods often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementations, that introduce additional-and possibly unnecessary-complexity, underutilizing CLIP's intrinsic capabilities. In this paper, we propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we jointly optimizes visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP's intrinsic guidance for continual adaptation.

Paper Structure

This paper contains 20 sections, 6 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: Conceptual illustrations of: (a) standard Cross-Entropy (CE), (b) our proposed TPPT-V, (c) a naïve multi-modal extension of TPPT-V, and (d) our proposed TPPT-VT. (a) Prior methods coopvptl2pdualpromptcodapromptattriclipproof use CE loss to adapt PTMs, but suffer from representation drift gama2014surveylu2018learning, leading to forgetting. (b) TPPT-V introduces a textual prototypical contrastive loss to anchor visual features and mitigate drift. (c) A naïve extension that also tunes textual prompts may improve textual prototype quality but risks collapse to trivial solutions minderer2205simplekim2022transferringmapleliang2022mind. (d) TPPT-VT addresses this by regularizing multi-modal prompt learning with diversity constraints on textual prototypes.
  • Figure 2: Analysis of (a) representation drift (lower is better) and (b) feature embedding diversity (higher is better to prevent collapse). In (a), we measure how far Task 1 embeddings deviate after learning all tasks. Training with CE loss alone leads to significant drift (Fig. \ref{['fig:concept']}(a)), whereas our proposed $\mathcal{L}_\text{TPCL}$ mitigates this (Fig. \ref{['fig:concept']}(b)). In (b), we assess embedding diversity via average pairwise distances over incremental stages. The red dotted line represents the diversity of the pre-trained CLIP model, denoted as ZS-CLIP. Naïve multi-modal training reduces diversity, especially on CUB, resulting in lower scores and potential collapse (Fig. \ref{['fig:concept']}(c)). Our diversity-regularized approach alleviates this, as shown in Fig. \ref{['fig:concept']}(d).
  • Figure 3: The overall framework of our 2 proposed methods. (1) The learned visual representations are guided by static textual prototypes (TPPT-V). We alleviate the forgetting issue by guiding visual representations with consistent textual prototypes, preventing drift of representations in the embedding space. (2) To improve upon the static textual prototypes, we propose to learn textual prompts for prototypes (TPPT-VT), and regulate the learning process by encouraging diversity. (3) Benefiting from the textual prototype anchors, our proposed methods remain simple yet effective, unlike previous methods that use delicate, complex designs.
  • Figure 4: Experiment results across incremental stages. All methods are trained with the same exemplar size using the same pre-trained weights and backbone model of CLIP with ViT-B/16.
  • Figure 5: Representation drift $\downarrow$ (lower is better) across incremental stages for (a) CUB and (b) Aircraft. Representation drift measures the divergence in the means of visual features for each class compared to the last incremental stage.
  • ...and 7 more figures