Table of Contents
Fetching ...

Sculpting [CLS] Features for Pre-Trained Model-Based Class-Incremental Learning

Murat Onur Yildirim, Elif Ceren Gok Yildirim, Joaquin Vanschoren

TL;DR

This work tackles class-incremental learning (CIL) with pre-trained models by addressing catastrophic forgetting through a parameter-efficient approach. It introduces LuCA, a residual adaptor plus calibrator module, and its deployment as TOSCA, a sparse, final-token (CLS) oriented adaptation that preserves the PTM’s feature hierarchy while enabling task-specific refinement. Training freezes the backbone and optimizes a sparse LuCA configuration alongside a prototypical classifier, with inference driven by entropy-based fusion across task-specific branches. The method achieves state-of-the-art results across six benchmarks with significantly fewer parameters and lower runtime than layer-wise adapters or prompt-based methods, validating a stable-plasticity balance and practical applicability for exemplar-free CIL. The work is reinforced by theoretical grounding on feature-manifold preservation and neuroscientific inspiration, and it demonstrates strong generalization and efficiency advantages for PTM-based continual learning.

Abstract

Class-incremental learning requires models to continually acquire knowledge of new classes without forgetting old ones. Although pre-trained models have demonstrated strong performance in class-incremental learning, they remain susceptible to catastrophic forgetting when learning new concepts. Excessive plasticity in the models breaks generalizability and causes forgetting, while strong stability results in insufficient adaptation to new classes. This necessitates effective adaptation with minimal modifications to preserve the general knowledge of pre-trained models. To address this challenge, we first introduce a new parameter-efficient fine-tuning module 'Learn and Calibrate', or LuCA, designed to acquire knowledge through an adapter-calibrator couple, enabling effective adaptation with well-refined feature representations. Second, for each learning session, we deploy a sparse LuCA module on top of the last token just before the classifier, which we refer to as 'Token-level Sparse Calibration and Adaptation', or TOSCA. This strategic design improves the orthogonality between the modules and significantly reduces both training and inference complexity. By leaving the generalization capabilities of the pre-trained models intact and adapting exclusively via the last token, our approach achieves a harmonious balance between stability and plasticity. Extensive experiments demonstrate TOSCA's state-of-the-art performance while introducing ~8 times fewer parameters compared to prior methods.

Sculpting [CLS] Features for Pre-Trained Model-Based Class-Incremental Learning

TL;DR

This work tackles class-incremental learning (CIL) with pre-trained models by addressing catastrophic forgetting through a parameter-efficient approach. It introduces LuCA, a residual adaptor plus calibrator module, and its deployment as TOSCA, a sparse, final-token (CLS) oriented adaptation that preserves the PTM’s feature hierarchy while enabling task-specific refinement. Training freezes the backbone and optimizes a sparse LuCA configuration alongside a prototypical classifier, with inference driven by entropy-based fusion across task-specific branches. The method achieves state-of-the-art results across six benchmarks with significantly fewer parameters and lower runtime than layer-wise adapters or prompt-based methods, validating a stable-plasticity balance and practical applicability for exemplar-free CIL. The work is reinforced by theoretical grounding on feature-manifold preservation and neuroscientific inspiration, and it demonstrates strong generalization and efficiency advantages for PTM-based continual learning.

Abstract

Class-incremental learning requires models to continually acquire knowledge of new classes without forgetting old ones. Although pre-trained models have demonstrated strong performance in class-incremental learning, they remain susceptible to catastrophic forgetting when learning new concepts. Excessive plasticity in the models breaks generalizability and causes forgetting, while strong stability results in insufficient adaptation to new classes. This necessitates effective adaptation with minimal modifications to preserve the general knowledge of pre-trained models. To address this challenge, we first introduce a new parameter-efficient fine-tuning module 'Learn and Calibrate', or LuCA, designed to acquire knowledge through an adapter-calibrator couple, enabling effective adaptation with well-refined feature representations. Second, for each learning session, we deploy a sparse LuCA module on top of the last token just before the classifier, which we refer to as 'Token-level Sparse Calibration and Adaptation', or TOSCA. This strategic design improves the orthogonality between the modules and significantly reduces both training and inference complexity. By leaving the generalization capabilities of the pre-trained models intact and adapting exclusively via the last token, our approach achieves a harmonious balance between stability and plasticity. Extensive experiments demonstrate TOSCA's state-of-the-art performance while introducing ~8 times fewer parameters compared to prior methods.

Paper Structure

This paper contains 32 sections, 11 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: Prompt-based methods influence the self-attention process of a PTM, either from the input layer alone or across all layers. Adapter-based methods enable task-specific adaptations by inserting lightweight neural modules into the PTM’s layers. In contrast, we propose a single trainable module that operates exclusively on the final [CLS] token representation, efficiently adapting and calibrating features just before classification. This design offers a streamlined and effective alternative to both prompt- and adapter-based methods.
  • Figure 2: LuCA.
  • Figure 3: Performance curve of different methods under different settings. All methods are initialized with ViT-B/16-IN1K. We annotate the relative improvement of TOSCA above the runner-up method with numerical numbers at the last incremental stage.
  • Figure 4: Performance evaluation of TOSCA across different perspectives. (a) Memory & computational cost highlights TOSCA’s efficiency, (b) Hyperparameter analysis illustrates effect of $\ell_1$ strength ($\lambda$) and projection dimension ($r$) on accuracy, (c) Design and component ablation presents the impact of different components and flows on accuracy.
  • Figure A: Performance curve of different methods on different benchmarks. All methods are initialized with ViT-B/16-IN21K. We annotate the relative improvement of TOSCA above the runner-up method with numerical numbers at the last incremental stage.