Table of Contents
Fetching ...

Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion

Linlan Huang, Xusheng Cao, Haori Lu, Xialei Liu

TL;DR

This paper tackles class-incremental learning with CLIP by addressing forgetting through two innovations: adaptive representation adjustment guided by textual features to better separate neighboring old and new categories, and a decomposed parameter fusion strategy that stabilizes adapter fine-tuning without expanding parameter count. The method, RAPF, jointly optimizes a hinge-based text-guided separation loss and a cross-entropy objective, followed by a post-training fusion of adapter parameters in an orthonormal basis to balance plasticity and stability. Empirical results on CIFAR100, ImageNet100, ImageNet-R, and CUB200 show state-of-the-art performance, with notable gains over strong baselines and robustness in exemplar-free regimes. The work demonstrates that leveraging language information and structured parameter fusion can substantially mitigate forgetting in CLIP-based continual learning, with practical impact for scalable, privacy-preserving continual vision systems. $p(y_i|x_i)$ is computed via cosine similarity between image-adapter outputs and text embeddings with temperature $\tau$, and neighboring category conflicts are mitigated using $\mathcal{L}_{hinge}$ and Gaussian-replay samples, enabling effective incremental updates without storing all past data.

Abstract

Class-incremental learning is a challenging problem, where the goal is to train a model that can classify data from an increasing number of classes over time. With the advancement of vision-language pre-trained models such as CLIP, they demonstrate good generalization ability that allows them to excel in class-incremental learning with completely frozen parameters. However, further adaptation to downstream tasks by simply fine-tuning the model leads to severe forgetting. Most existing works with pre-trained models assume that the forgetting of old classes is uniform when the model acquires new knowledge. In this paper, we propose a method named Adaptive Representation Adjustment and Parameter Fusion (RAPF). During training for new data, we measure the influence of new classes on old ones and adjust the representations, using textual features. After training, we employ a decomposed parameter fusion to further mitigate forgetting during adapter module fine-tuning. Experiments on several conventional benchmarks show that our method achieves state-of-the-art results. Our code is available at \url{https://github.com/linlany/RAPF}.

Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion

TL;DR

This paper tackles class-incremental learning with CLIP by addressing forgetting through two innovations: adaptive representation adjustment guided by textual features to better separate neighboring old and new categories, and a decomposed parameter fusion strategy that stabilizes adapter fine-tuning without expanding parameter count. The method, RAPF, jointly optimizes a hinge-based text-guided separation loss and a cross-entropy objective, followed by a post-training fusion of adapter parameters in an orthonormal basis to balance plasticity and stability. Empirical results on CIFAR100, ImageNet100, ImageNet-R, and CUB200 show state-of-the-art performance, with notable gains over strong baselines and robustness in exemplar-free regimes. The work demonstrates that leveraging language information and structured parameter fusion can substantially mitigate forgetting in CLIP-based continual learning, with practical impact for scalable, privacy-preserving continual vision systems. is computed via cosine similarity between image-adapter outputs and text embeddings with temperature , and neighboring category conflicts are mitigated using and Gaussian-replay samples, enabling effective incremental updates without storing all past data.

Abstract

Class-incremental learning is a challenging problem, where the goal is to train a model that can classify data from an increasing number of classes over time. With the advancement of vision-language pre-trained models such as CLIP, they demonstrate good generalization ability that allows them to excel in class-incremental learning with completely frozen parameters. However, further adaptation to downstream tasks by simply fine-tuning the model leads to severe forgetting. Most existing works with pre-trained models assume that the forgetting of old classes is uniform when the model acquires new knowledge. In this paper, we propose a method named Adaptive Representation Adjustment and Parameter Fusion (RAPF). During training for new data, we measure the influence of new classes on old ones and adjust the representations, using textual features. After training, we employ a decomposed parameter fusion to further mitigate forgetting during adapter module fine-tuning. Experiments on several conventional benchmarks show that our method achieves state-of-the-art results. Our code is available at \url{https://github.com/linlany/RAPF}.
Paper Structure (20 sections, 11 equations, 7 figures, 15 tables)

This paper contains 20 sections, 11 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Semantically similar categories pose significant challenges in CIL across tasks, in which language information can help to pick out the adjacent old and new classes when new data are encountered. Then the image feature representation of the old category can be adjusted accordingly. Additionally, a decomposed parameter fusion strategy is further adapted to reduce forgetting. We decompose the parameters learned from two consecutive tasks into shared knowledge and task-specific knowledge. Then, we fuse the parameters based on this decomposition.
  • Figure 2: The framework of our method. The neighboring categories separation module computes the similarity of text features to identify neighboring categories. We sample the distribution of the old class and calculate the hinge loss. In the parameter fusion module, we first decompose $\mathbf{W}_{t}$ and $\mathbf{W}_{t-1}$ into the same standard orthogonal basis $\mathbf{B}$. Then, we calculate a soft mask $\mathbf{M}$ from the difference of the decomposed parameters $\mathbf{R}_t$ and $\mathbf{R}_{t-1}$, which acts as the fusion weight. Finally, we reconstruct the parameter $\mathbf{W}$ from the fused parameter $\mathbf{R}$ and the basis $\mathbf{B}$.
  • Figure 3: Accuracy curve of our method with other SOTA baselines on CIFAR100, ImageNet100 and ImageNet-R.
  • Figure 4: Comparison of methods in terms of accuracy and learnable parameters.
  • Figure 5: The confusion matrixes of the first 5 tasks in the ImageNet100 B0 Inc10 experiment and their difference. We only show the first 5 tasks for better readability.
  • ...and 2 more figures