Table of Contents
Fetching ...

Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers

Xuan Rao, Simian Xu, Zheng Li, Bo Zhao, Derong Liu, Mingming Ha, Cesare Alippi

TL;DR

This work tackles distribution drift in class-incremental learning (CIL) with pre-trained Vision Transformers (ViTs) by introducing Sequential Learning with Drift Compensation (SLDC). SLDC models the latent-space evolution between consecutive tasks with two operator variants: a linear α1-SLDC and a weakly nonlinear α2-SLDC, plus distillation-enhanced β1/β2 forms to curb overwriting and preserve prior knowledge; they are complemented by an auxiliary data enrichment (ADE) strategy to improve operator estimation when data are scarce. The method refines a classifier after each task by sampling Gaussian features from compensated distributions and, in some configurations, augments backbone updates with a feature distillation loss and a norm constraint. Extensive experiments on CIFAR-100, ImageNet-R, CUB-200, and Cars-196 show that SLDC substantially improves SeqFT, and when combined with KD and ADE, approaches the performance of joint training across long sequences, highlighting its practical potential for continual learning with large pre-trained models. Code is released at the authors’ repository.

Abstract

Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: https://github.com/raoxuan98-hash/sldc.git.

Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers

TL;DR

This work tackles distribution drift in class-incremental learning (CIL) with pre-trained Vision Transformers (ViTs) by introducing Sequential Learning with Drift Compensation (SLDC). SLDC models the latent-space evolution between consecutive tasks with two operator variants: a linear α1-SLDC and a weakly nonlinear α2-SLDC, plus distillation-enhanced β1/β2 forms to curb overwriting and preserve prior knowledge; they are complemented by an auxiliary data enrichment (ADE) strategy to improve operator estimation when data are scarce. The method refines a classifier after each task by sampling Gaussian features from compensated distributions and, in some configurations, augments backbone updates with a feature distillation loss and a norm constraint. Extensive experiments on CIFAR-100, ImageNet-R, CUB-200, and Cars-196 show that SLDC substantially improves SeqFT, and when combined with KD and ADE, approaches the performance of joint training across long sequences, highlighting its practical potential for continual learning with large pre-trained models. Code is released at the authors’ repository.

Abstract

Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: https://github.com/raoxuan98-hash/sldc.git.

Paper Structure

This paper contains 41 sections, 45 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of the SLDC framework. The framework consists of three phases: (1) Sequential fine-tuning with optional distillation (SeqFT/SeqKD); (2) Distribution compensation using an approximated transition operator, either linear ($\alpha_{1}$-SLDC) or weak-nonlinear ($\alpha_2$-SLDC), to align (compensate) previous feature distributions with the new one; (3) Classifier refinement using synthetic Gaussian features sampled from the compensated Gaussian distributions.
  • Figure 2: Performance comparison of SLDC methods on a 20-task sequence, demonstrating state-of-the-art results both with and without knowledge distillation.
  • Figure 3: Comparative performance of SLDC methods on hybrid CIL tasks comprising four distinct datasets: CIFAR-100, Cars-196, CUB-200, and ImageNet-R
  • Figure 4: Performance comparison with varying temperature parameters $\alpha_{\rm temp}$ in $\alpha_{1}$-SLDC
  • Figure 5: Performance evaluation of $\alpha_{2}$-SLDC with varying regularization coefficients $\gamma_{\alpha_{2}} \in \{0.1, 0.5, 1.0, 2.0\}$
  • ...and 4 more figures

Theorems & Definitions (6)

  • Definition 1: Latent Space Transition Operator
  • proof
  • proof
  • proof
  • Remark 1
  • proof