Table of Contents
Fetching ...

Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

Zitong Huang, Ze Chen, Zhixing Chen, Erjin Zhou, Xinxing Xu, Rick Siow Mong Goh, Yong Liu, Wangmeng Zuo, Chunmei Feng

TL;DR

This paper tackles Few-shot Class-Incremental Learning (FSCIL) by leveraging Vision-Language priors, specifically CLIP, and introducing Learning Prompt with Distribution-based Feature Replay (LP-DiF). The approach uses learnable prompts to adapt CLIP across sessions and constructs per-class Gaussian feature distributions estimated from real and VAE-synthesized features to enable pseudo-feature replay, reducing catastrophic forgetting. Key contributions include a VAE-guided feature synthesis module, a diagonal-covariance Gaussian model for old classes, and a training objective that combines current-session learning with replay of old knowledge. Empirical results on CIFAR-100, mini-ImageNet, CUB-200, SUN-397, and CUB-200* show LP-DiF achieving SOTA performance and closely approaching a joint-training upper bound, with lightweight memory requirements and efficient training time. The work highlights the practical value of combining CLIP-based foundations with prompt tuning and probabilistic replay for robust FSCIL performance.

Abstract

Few-shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes based on very limited training data without forgetting the old ones encountered. Existing studies solely relied on pure visual networks, while in this paper we solved FSCIL by leveraging the Vision-Language model (e.g., CLIP) and propose a simple yet effective framework, named Learning Prompt with Distribution-based Feature Replay (LP-DiF). We observe that simply using CLIP for zero-shot evaluation can substantially outperform the most influential methods. Then, prompt tuning technique is involved to further improve its adaptation ability, allowing the model to continually capture specific knowledge from each session. To prevent the learnable prompt from forgetting old knowledge in the new session, we propose a pseudo-feature replay approach. Specifically, we preserve the old knowledge of each class by maintaining a feature-level Gaussian distribution with a diagonal covariance matrix, which is estimated by the image features of training images and synthesized features generated from a VAE. When progressing to a new session, pseudo-features are sampled from old-class distributions combined with training images of the current session to optimize the prompt, thus enabling the model to learn new knowledge while retaining old knowledge. Experiments on three prevalent benchmarks, i.e., CIFAR100, mini-ImageNet, CUB-200, and two more challenging benchmarks, i.e., SUN-397 and CUB-200$^*$ proposed in this paper showcase the superiority of LP-DiF, achieving new state-of-the-art (SOTA) in FSCIL. Code is publicly available at https://github.com/1170300714/LP-DiF.

Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

TL;DR

This paper tackles Few-shot Class-Incremental Learning (FSCIL) by leveraging Vision-Language priors, specifically CLIP, and introducing Learning Prompt with Distribution-based Feature Replay (LP-DiF). The approach uses learnable prompts to adapt CLIP across sessions and constructs per-class Gaussian feature distributions estimated from real and VAE-synthesized features to enable pseudo-feature replay, reducing catastrophic forgetting. Key contributions include a VAE-guided feature synthesis module, a diagonal-covariance Gaussian model for old classes, and a training objective that combines current-session learning with replay of old knowledge. Empirical results on CIFAR-100, mini-ImageNet, CUB-200, SUN-397, and CUB-200* show LP-DiF achieving SOTA performance and closely approaching a joint-training upper bound, with lightweight memory requirements and efficient training time. The work highlights the practical value of combining CLIP-based foundations with prompt tuning and probabilistic replay for robust FSCIL performance.

Abstract

Few-shot Class-Incremental Learning (FSCIL) aims to continuously learn new classes based on very limited training data without forgetting the old ones encountered. Existing studies solely relied on pure visual networks, while in this paper we solved FSCIL by leveraging the Vision-Language model (e.g., CLIP) and propose a simple yet effective framework, named Learning Prompt with Distribution-based Feature Replay (LP-DiF). We observe that simply using CLIP for zero-shot evaluation can substantially outperform the most influential methods. Then, prompt tuning technique is involved to further improve its adaptation ability, allowing the model to continually capture specific knowledge from each session. To prevent the learnable prompt from forgetting old knowledge in the new session, we propose a pseudo-feature replay approach. Specifically, we preserve the old knowledge of each class by maintaining a feature-level Gaussian distribution with a diagonal covariance matrix, which is estimated by the image features of training images and synthesized features generated from a VAE. When progressing to a new session, pseudo-features are sampled from old-class distributions combined with training images of the current session to optimize the prompt, thus enabling the model to learn new knowledge while retaining old knowledge. Experiments on three prevalent benchmarks, i.e., CIFAR100, mini-ImageNet, CUB-200, and two more challenging benchmarks, i.e., SUN-397 and CUB-200 proposed in this paper showcase the superiority of LP-DiF, achieving new state-of-the-art (SOTA) in FSCIL. Code is publicly available at https://github.com/1170300714/LP-DiF.
Paper Structure (12 sections, 11 equations, 8 figures, 11 tables)

This paper contains 12 sections, 11 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Comparison of FSCIL methods in terms of Average Accuracy (%) on the test set of mini-ImageNet benchmark russakovsky2015imagenet under 5-shot setting. Red-highlighted bars indicate SOTA vision-based models (e.g., CNN he2016deep), while orange highlights show V-L pretrained models enhancing FSCIL, significantly outperforming those vision-based counterparts. Our method, marked in green, achieves 93.76%, surpassing CLIP+BDF by 9.13%, and comparable to the theoretical upper bound (UB) that highlights in blue achieved through learning prompt in joint-training manner.
  • Figure 2: Overview of our proposed LP-DiF. (a) In each session, we first train a VAE kingma2013autowang2023improving comprised of the V-L model and lightweight components, i.e., MLPs and learnable prompt, based on few training data and textual information of this session. (b) We preserve the knowledge of each class by estimating their feature-level statistical distribution. The mean vector and diagonal covariance matrix of the distribution are estimated by both the features of real images and the synthesized features from trained VAE. (c) Prompt is trained jointly with the combination of the real image of the current session and the pseudo-features sampled from old-class distributions.
  • Figure 3: Histogram visualization of the statistical distribution of image features. We take the image features with different dimensions (dim) of classes $c_1$ and $c_2$ as example selected from the mini-ImageNet russakovsky2015imagenet benchmark by the image encoder of CLIP (ViT-B/16) radford2021learning. Each sub-figure shows the distribution with histogram of corresponding random variable $Z_{cd}$, where $c$ and $d$ denotes the index of class and feature dimension respectively. Obviously, 1) each dimension of the image features per class approximates Gaussian distribution; 2) distributions of same dimension vary in different classes, e.g., $Z_{c_{1}1}$vs.$Z_{c_{2}1}$ .
  • Figure 4: Accuracy curves of our LP-DiF and comparison with counterparts on (a) SUN-397 and (b) CUB200*. our LP-DiF method significantly surpasses both CLIP and BiDistFSCIL, and attains performance levels that are very close to the respective upper bounds.
  • Figure 5: Ablation studies of our LP-DiF. (a) Comparison with the method of incorporating a Linear Classifier (LC) into a pre-trained image encoder for training on three common benchmarks. (b) Analysis of $M$ on three common benchmarks. (c) Analysis of $B$ and $\lambda_\text{o}$ in terms of Avg on CUB-200.
  • ...and 3 more figures