Table of Contents
Fetching ...

Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion

Nan Song, Xiaofeng Yang, Ze Yang, Guosheng Lin

TL;DR

This study identifies and categorizes the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting, and develops an In-Context Generation (ICGen) paradigm that allows the diffusion model to be conditioned upon the input vision context, which facilitates the few-shot generation and mitigates the issue of previous concepts forgetting.

Abstract

Lifelong few-shot customization for text-to-image diffusion aims to continually generalize existing models for new tasks with minimal data while preserving old knowledge. Current customization diffusion models excel in few-shot tasks but struggle with catastrophic forgetting problems in lifelong generations. In this study, we identify and categorize the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting. To address these challenges, we first devise a data-free knowledge distillation strategy to tackle relevant concepts forgetting. Unlike existing methods that rely on additional real data or offline replay of original concept data, our approach enables on-the-fly knowledge distillation to retain the previous concepts while learning new ones, without accessing any previous data. Second, we develop an In-Context Generation (ICGen) paradigm that allows the diffusion model to be conditioned upon the input vision context, which facilitates the few-shot generation and mitigates the issue of previous concepts forgetting. Extensive experiments show that the proposed Lifelong Few-Shot Diffusion (LFS-Diffusion) method can produce high-quality and accurate images while maintaining previously learned knowledge.

Towards Lifelong Few-Shot Customization of Text-to-Image Diffusion

TL;DR

This study identifies and categorizes the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting, and develops an In-Context Generation (ICGen) paradigm that allows the diffusion model to be conditioned upon the input vision context, which facilitates the few-shot generation and mitigates the issue of previous concepts forgetting.

Abstract

Lifelong few-shot customization for text-to-image diffusion aims to continually generalize existing models for new tasks with minimal data while preserving old knowledge. Current customization diffusion models excel in few-shot tasks but struggle with catastrophic forgetting problems in lifelong generations. In this study, we identify and categorize the catastrophic forgetting problems into two folds: relevant concepts forgetting and previous concepts forgetting. To address these challenges, we first devise a data-free knowledge distillation strategy to tackle relevant concepts forgetting. Unlike existing methods that rely on additional real data or offline replay of original concept data, our approach enables on-the-fly knowledge distillation to retain the previous concepts while learning new ones, without accessing any previous data. Second, we develop an In-Context Generation (ICGen) paradigm that allows the diffusion model to be conditioned upon the input vision context, which facilitates the few-shot generation and mitigates the issue of previous concepts forgetting. Extensive experiments show that the proposed Lifelong Few-Shot Diffusion (LFS-Diffusion) method can produce high-quality and accurate images while maintaining previously learned knowledge.

Paper Structure

This paper contains 21 sections, 4 equations, 11 figures, 8 tables, 2 algorithms.

Figures (11)

  • Figure 1: Lifelong few-shot text-to-image diffusion. Lifelong few-shot learning aims to continually learn multiple sessions of tasks without forgetting the previous ones. In each session, the training dataset contains a few images of a new concept. The model learns the new concept during the training phase, enabling it to generate both the novel concept from the current session and previously learned concepts from past sessions. Current customization diffusion models suffer from catastrophic forgetting problems in lifelong generation tasks.
  • Figure 2: We identify two catastrophic forgetting problems in lifelong generations: Relevant Concepts Forgetting (RCF) and Previous Concepts forgetting (PCF). Left: RCF refers to forgetting the relevant concepts related to the new concepts. After training on session 1 "$V_1$ cat", the model tends to generate relevant concept "Cat" all similar to "$V_1$ cat". Right: PCF denotes forgetting the concepts learned in the previous sessions. For instance, in session $i$, the previous concept "blue $V_2$ chair" fails to be generated when learning a new concept "$V_i$ flower".
  • Figure 3: Our framework for lifelong few-shot text-to-image diffusion mainly includes two stages: (1) the Diffusion Model Training stage to learn the diffusion model ${\Phi}_{\epsilon}$ using the training data in the current session $\mathcal{D}_{train}^i$, (2) the In-Context Generation stage to generate images using the test prompt $P_{test}^i$ for session $S^i$ and all the previous sessions. The Diffusion Model Training stage includes the normal diffusion process with the few-shot training data ($L_{DM}$) and the denoising process to distill the knowledge from the previous session to the current session ($L_{KD}$). Vision context $Z_0^{S^i}$ is also sampled to help the In-Context Generation. During the inference for session $S^i$, a vision context-guided text-to-image diffusion is performed without additional training. In-Context Generation can improve the few-shot learning performance and prevent the forgetting of concepts in the previous sessions.
  • Figure 4: Data-free knowledge distillation. Top: the reverse diffusion process for the behavior of the teacher and student models without distillation. The student model exhibits a tendency to forget the relevant "cat" concept and can only generate "$V_1$ cat" instead. Bottom: Data-free knowledge distillation for timestep $t$ and timestep $t-1$. For each timestep $t$, we initiate the diffusion process from the identical last teacher latent ${z}_{t}^*$ to prevent discrepancies from accumulating in the teacher and student model outputs. This process involves applying the knowledge distillation loss between the teacher model's noise prediction and the student model, serving to regulate the student model while preserving the relevant concept.
  • Figure 5: Analysis of data-free knowledge distillation along different trajectories. We present the last DDPM timestep decoding results for the student model. We employ a DDPM scheduler with 25 timesteps during training for distillation. (a)Following the student model reverse diffusion trajectory will cause artifacts when the training steps increase. (b)Utilizing the teacher latent as input for both the teacher and student models at each timestep, followed by distillation between the model outputs, can help eliminate the artifacts problem.
  • ...and 6 more figures