Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion
Linlan Huang, Xusheng Cao, Haori Lu, Xialei Liu
TL;DR
This paper tackles class-incremental learning with CLIP by addressing forgetting through two innovations: adaptive representation adjustment guided by textual features to better separate neighboring old and new categories, and a decomposed parameter fusion strategy that stabilizes adapter fine-tuning without expanding parameter count. The method, RAPF, jointly optimizes a hinge-based text-guided separation loss and a cross-entropy objective, followed by a post-training fusion of adapter parameters in an orthonormal basis to balance plasticity and stability. Empirical results on CIFAR100, ImageNet100, ImageNet-R, and CUB200 show state-of-the-art performance, with notable gains over strong baselines and robustness in exemplar-free regimes. The work demonstrates that leveraging language information and structured parameter fusion can substantially mitigate forgetting in CLIP-based continual learning, with practical impact for scalable, privacy-preserving continual vision systems. $p(y_i|x_i)$ is computed via cosine similarity between image-adapter outputs and text embeddings with temperature $\tau$, and neighboring category conflicts are mitigated using $\mathcal{L}_{hinge}$ and Gaussian-replay samples, enabling effective incremental updates without storing all past data.
Abstract
Class-incremental learning is a challenging problem, where the goal is to train a model that can classify data from an increasing number of classes over time. With the advancement of vision-language pre-trained models such as CLIP, they demonstrate good generalization ability that allows them to excel in class-incremental learning with completely frozen parameters. However, further adaptation to downstream tasks by simply fine-tuning the model leads to severe forgetting. Most existing works with pre-trained models assume that the forgetting of old classes is uniform when the model acquires new knowledge. In this paper, we propose a method named Adaptive Representation Adjustment and Parameter Fusion (RAPF). During training for new data, we measure the influence of new classes on old ones and adjust the representations, using textual features. After training, we employ a decomposed parameter fusion to further mitigate forgetting during adapter module fine-tuning. Experiments on several conventional benchmarks show that our method achieves state-of-the-art results. Our code is available at \url{https://github.com/linlany/RAPF}.
