Table of Contents
Fetching ...

Representation Calibration and Uncertainty Guidance for Class-Incremental Learning based on Vision Language Model

Jiantao Tan, Peixian Ma, Tong Yu, Wentao Zhang, Ruixuan Wang

TL;DR

Class-incremental learning with Vision-Language Models faces forgetting and cross-task confusion as new classes are added. The authors propose a two-stage framework: first, task-specific adapters are trained on a frozen VLM image encoder; second, a cross-task representation calibration via Mixture of Projectors maps task-specific embeddings into a unified space, complemented by an entropy-based inference to select the most appropriate calibrated feature. Empirical results across CIFAR100, ImageNet-R, Cars196, Skin40, and Mini-ImageNet demonstrate state-of-the-art performance under exemplar-free settings and good generalization across pre-trained backbones. The approach is efficient in parameter usage and benefits from parallelizable multi-branch inference, suggesting practical applicability for scalable continual learning with VLMs.

Abstract

Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.

Representation Calibration and Uncertainty Guidance for Class-Incremental Learning based on Vision Language Model

TL;DR

Class-incremental learning with Vision-Language Models faces forgetting and cross-task confusion as new classes are added. The authors propose a two-stage framework: first, task-specific adapters are trained on a frozen VLM image encoder; second, a cross-task representation calibration via Mixture of Projectors maps task-specific embeddings into a unified space, complemented by an entropy-based inference to select the most appropriate calibrated feature. Empirical results across CIFAR100, ImageNet-R, Cars196, Skin40, and Mini-ImageNet demonstrate state-of-the-art performance under exemplar-free settings and good generalization across pre-trained backbones. The approach is efficient in parameter usage and benefits from parallelizable multi-branch inference, suggesting practical applicability for scalable continual learning with VLMs.

Abstract

Class-incremental learning requires a learning system to continually learn knowledge of new classes and meanwhile try to preserve previously learned knowledge of old classes. As current state-of-the-art methods based on Vision-Language Models (VLMs) still suffer from the issue of differentiating classes across learning tasks. Here a novel VLM-based continual learning framework for image classification is proposed. In this framework, task-specific adapters are added to the pre-trained and frozen image encoder to learn new knowledge, and a novel cross-task representation calibration strategy based on a mixture of light-weight projectors is used to help better separate all learned classes in a unified feature space, alleviating class confusion across tasks. In addition, a novel inference strategy guided by prediction uncertainty is developed to more accurately select the most appropriate image feature for class prediction. Extensive experiments on multiple datasets under various settings demonstrate the superior performance of our method compared to existing ones.

Paper Structure

This paper contains 16 sections, 3 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Left: Performance comparison of different methods on Cars196 and ImageNet-R in 10 task setting. Right: The image-text similarities of a class's image across all classes on CIFAR100 after incremental learning.
  • Figure 2: The proposed framework for exemplar-free class-incremental learning. For each new task, task-specific adapters are first trained (Training Stage-I), and then task-shared Mixture of Projectors (MoP) is optimized to help calibrate image embeddings from different task-specific adapted image encoders. A gating network in MoP adaptively adjusts image embeddings from each task-specific image encoder to help separate classes in the task-shared feature space. During inference, a novel feature selection strategy based on prediction uncertainty is proposed for more accurate class prediction.
  • Figure 3: Sensitivity analysis of $M$ (Left) and $\boldsymbol{N_p}$ (Right) on ImageNet-R (red) and Cars196 (blue) under the 10-task setting.
  • Figure 4: Further analysis of MoP (Left) and the inference strategy (Right) on CIFAR100 and ImageNet-R. The leftest bar is set as baseline in each group and the numbers on the other bars represent the relative changes compared to the baseline. The black error bar represents standard deviation.
  • Figure 5: Performance of different methods over the whole continual learning process on ImageNet-R and Cars196 under the 5-, 10- and 20-task settings.
  • ...and 1 more figures