Table of Contents
Fetching ...

Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning

Haoran Chen, Houze Xu, Micah Goldblum, Daoguo Dong, Zuxuan Wu

TL;DR

We address class-incremental learning with CLIP by decoupling vision and text updates to preserve cross-modal alignment. The proposed DMC framework uses a two-stage process where the image encoder is adapted with a frozen text anchor, followed by freezing the image encoder and learning soft prompts, with Gaussian memory for replay. An enhanced variant, DMC-OT, adds optimal-transport calibration to align memory statistics across encoder updates and incorporates task-augmented prompts to boost inter-task separability. Experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 show state-of-the-art results, with DMC-OT providing an average improvement of about 1.80% over DMC, confirming the benefit of cross-modal calibration and task-aware prompting for continual learning in multimodal models.

Abstract

Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.

Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning

TL;DR

We address class-incremental learning with CLIP by decoupling vision and text updates to preserve cross-modal alignment. The proposed DMC framework uses a two-stage process where the image encoder is adapted with a frozen text anchor, followed by freezing the image encoder and learning soft prompts, with Gaussian memory for replay. An enhanced variant, DMC-OT, adds optimal-transport calibration to align memory statistics across encoder updates and incorporates task-augmented prompts to boost inter-task separability. Experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 show state-of-the-art results, with DMC-OT providing an average improvement of about 1.80% over DMC, confirming the benefit of cross-modal calibration and task-aware prompting for continual learning in multimodal models.

Abstract

Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.

Paper Structure

This paper contains 25 sections, 14 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Comparison between hard and soft prompts in CLIP-based class-incremental learning. Handcrafted hard prompts (top) offer stable but inflexible representations, leading to consistent yet limited adaptation. In contrast, soft prompts (bottom) are adaptive but biased toward newly introduced classes, causing severe performance degradation for earlier tasks.
  • Figure 2: Overview of the proposed DMC-OT training framework. (a) Image Encoder Adaptation: The image encoder is fine-tuned while freezing the text encoder with handcrafted hard prompts. This stage strengthens visual representations through the contrastive loss $\mathcal{L}_{\text{CLIP}}$. (b) Task-Augmented Soft Prompt Training: With the image encoder frozen, we train soft textual prompts via the cross-entropy loss $\mathcal{L}_{\text{CE}}$, combining class-specific and task-specific prompts through element-wise addition. The task-specific prompts are regularized by an orthogonality loss $\mathcal{L}_{\text{Ortho}}$ to ensure inter-task separability and feature diversity. During this stage, old calibrated class prototypes are modeled as multivariate Gaussians, from which features are sampled to implicitly replay past knowledge.
  • Figure 3: Hyperparameter sensitivity analysis.