Table of Contents
Fetching ...

Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion

Haidong Kang, Ketong Qian, Yi Lu

TL;DR

This paper tackles catastrophic forgetting and training-cost growth in FSCIL by proposing a training-free paradigm that replaces gradient updates with a conditional diffusion process. A frozen image-space diffusion model, conditioned on LLM-generated textual priors encoded by CLIP, synthesizes high-fidelity class exemplars, which are fused with real few-shot observations to form robust prototypes in CLIP space. The CD-FSCIL framework demonstrates state-of-the-art performance and reduced computation/memory overhead on miniImageNet, CIFAR-100, and CUB-200, indicating a practical shift toward training-free continual adaptation. By integrating multimodal priors with diffusion-based generation, the approach preserves base knowledge while enabling effective learning of novel classes without gradient-based optimization.

Abstract

Efforts to overcome catastrophic forgetting in Few-Shot Class-Incremental Learning (FSCIL) have primarily focused on developing more effective gradient-based optimization strategies. In contrast, little attention has been paid to the training cost explosion that inevitably arises as the number of novel classes increases, a consequence of relying on gradient learning even under extreme data scarcity. More critically, since FSCIL typically provides only a few samples for each new class, gradient-based updates not only induce severe catastrophic forgetting on base classes but also hinder adaptation to novel ones. This paper seeks to break this long-standing limitation by asking: Can we design a training-free FSCIL paradigm that entirely removes gradient optimization? We provide an affirmative answer by uncovering an intriguing connection between gradient-based optimization and the Conditional Diffusion process. Building on this observation, we propose a Conditional Diffusion-driven FSCIL (CD-FSCIL) framework that substitutes the conventional gradient update process with a diffusion-based generative transition, enabling training-free incremental adaptation while effectively mitigating forgetting. Furthermore, to enhance representation under few-shot constraints, we introduce a multimodal learning strategy that integrates visual features with natural language descriptions automatically generated by Large Language Models (LLMs). This synergy substantially alleviates the sample scarcity issue and improves generalization across novel classes. Extensive experiments on mainstream FSCIL benchmarks demonstrate that our method not only achieves state-of-the-art performance but also drastically reduces computational and memory overhead, marking a paradigm shift toward training-free continual adaptation.

Breaking Forgetting: Training-Free Few-Shot Class-Incremental Learning via Conditional Diffusion

TL;DR

This paper tackles catastrophic forgetting and training-cost growth in FSCIL by proposing a training-free paradigm that replaces gradient updates with a conditional diffusion process. A frozen image-space diffusion model, conditioned on LLM-generated textual priors encoded by CLIP, synthesizes high-fidelity class exemplars, which are fused with real few-shot observations to form robust prototypes in CLIP space. The CD-FSCIL framework demonstrates state-of-the-art performance and reduced computation/memory overhead on miniImageNet, CIFAR-100, and CUB-200, indicating a practical shift toward training-free continual adaptation. By integrating multimodal priors with diffusion-based generation, the approach preserves base knowledge while enabling effective learning of novel classes without gradient-based optimization.

Abstract

Efforts to overcome catastrophic forgetting in Few-Shot Class-Incremental Learning (FSCIL) have primarily focused on developing more effective gradient-based optimization strategies. In contrast, little attention has been paid to the training cost explosion that inevitably arises as the number of novel classes increases, a consequence of relying on gradient learning even under extreme data scarcity. More critically, since FSCIL typically provides only a few samples for each new class, gradient-based updates not only induce severe catastrophic forgetting on base classes but also hinder adaptation to novel ones. This paper seeks to break this long-standing limitation by asking: Can we design a training-free FSCIL paradigm that entirely removes gradient optimization? We provide an affirmative answer by uncovering an intriguing connection between gradient-based optimization and the Conditional Diffusion process. Building on this observation, we propose a Conditional Diffusion-driven FSCIL (CD-FSCIL) framework that substitutes the conventional gradient update process with a diffusion-based generative transition, enabling training-free incremental adaptation while effectively mitigating forgetting. Furthermore, to enhance representation under few-shot constraints, we introduce a multimodal learning strategy that integrates visual features with natural language descriptions automatically generated by Large Language Models (LLMs). This synergy substantially alleviates the sample scarcity issue and improves generalization across novel classes. Extensive experiments on mainstream FSCIL benchmarks demonstrate that our method not only achieves state-of-the-art performance but also drastically reduces computational and memory overhead, marking a paradigm shift toward training-free continual adaptation.

Paper Structure

This paper contains 21 sections, 12 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the proposed CD-FSCIL framework. (a) An image-level conditional diffusion model learns to denoise visual samples $\mathbf{v}_t$ into clean images $\mathbf{v}_0$, guided by textual prototypes $\mathbf{p}_c$. (b) A frozen CLIP encoder extracts visual features $\mathbf{x}$ and textual embeddings $\mathbf{p}_c$, aligning them in a shared semantic space. (c) During inference, the diffusion model generates representative visual samples, which are then encoded by the CLIP encoder (b) to form prototypes$\hat{\mathbf{x}}_c$. Classification is performed via cosine similarity between the query feature $\mathbf{x}_q$ and $\hat{\mathbf{x}}_c$, achieving training-free incremental adaptation.
  • Figure 2: CD-FSCIL v.s. peer competitor for FSCIL task in CUB200 dataset.
  • Figure 3: CD-FSCIL v.s. peer competitor for FSCIL task in CIFAR-100 dataset.