Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning

Haoran Chen; Houze Xu; Micah Goldblum; Daoguo Dong; Zuxuan Wu

Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning

Haoran Chen, Houze Xu, Micah Goldblum, Daoguo Dong, Zuxuan Wu

TL;DR

We address class-incremental learning with CLIP by decoupling vision and text updates to preserve cross-modal alignment. The proposed DMC framework uses a two-stage process where the image encoder is adapted with a frozen text anchor, followed by freezing the image encoder and learning soft prompts, with Gaussian memory for replay. An enhanced variant, DMC-OT, adds optimal-transport calibration to align memory statistics across encoder updates and incorporates task-augmented prompts to boost inter-task separability. Experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 show state-of-the-art results, with DMC-OT providing an average improvement of about 1.80% over DMC, confirming the benefit of cross-modal calibration and task-aware prompting for continual learning in multimodal models.

Abstract

Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.

Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning

TL;DR

Abstract

Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)