Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion
Haidong Kang, Ketong Qian, Yi Lu
TL;DR
This paper tackles catastrophic forgetting and training-cost growth in FSCIL by proposing a training-free paradigm that replaces gradient updates with a conditional diffusion process. A frozen image-space diffusion model, conditioned on LLM-generated textual priors encoded by CLIP, synthesizes high-fidelity class exemplars, which are fused with real few-shot observations to form robust prototypes in CLIP space. The CD-FSCIL framework demonstrates state-of-the-art performance and reduced computation/memory overhead on miniImageNet, CIFAR-100, and CUB-200, indicating a practical shift toward training-free continual adaptation. By integrating multimodal priors with diffusion-based generation, the approach preserves base knowledge while enabling effective learning of novel classes without gradient-based optimization.
Abstract
Efforts to overcome catastrophic forgetting in Few-Shot Class-Incremental Learning (FSCIL) have primarily focused on developing more effective gradient-based optimization strategies. In contrast, little attention has been paid to the training cost explosion that inevitably arises as the number of novel classes increases, a consequence of relying on gradient learning even under extreme data scarcity. More critically, since FSCIL typically provides only a few samples for each new class, gradient-based updates not only induce severe catastrophic forgetting on base classes but also hinder adaptation to novel ones. This paper seeks to break this long-standing limitation by asking: Can we design a training-free FSCIL paradigm that entirely removes gradient optimization? We provide an affirmative answer by uncovering an intriguing connection between gradient-based optimization and the Conditional Diffusion process. Building on this observation, we propose a Conditional Diffusion-driven FSCIL (CD-FSCIL) framework that substitutes the conventional gradient update process with a diffusion-based generative transition, enabling training-free incremental adaptation while effectively mitigating forgetting. Furthermore, to enhance representation under few-shot constraints, we introduce a multimodal learning strategy that integrates visual features with natural language descriptions automatically generated by Large Language Models (LLMs). This synergy substantially alleviates the sample scarcity issue and improves generalization across novel classes. Extensive experiments on mainstream FSCIL benchmarks demonstrate that our method not only achieves state-of-the-art performance but also drastically reduces computational and memory overhead, marking a paradigm shift toward training-free continual adaptation.
