Table of Contents
Fetching ...

Reward Incremental Learning in Text-to-Image Generation

Maorong Wang, Jiafeng Mao, Xueting Wang, Toshihiko Yamasaki

TL;DR

This paper proposes Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead, enabling stable performance across sequential reward tasks and demonstrates the efficacy of RID in achieving consistent, high-quality generation in RIL scenarios.

Abstract

The recent success of denoising diffusion models has significantly advanced text-to-image generation. While these large-scale pretrained models show excellent performance in general image synthesis, downstream objectives often require fine-tuning to meet specific criteria such as aesthetics or human preference. Reward gradient-based strategies are promising in this context, yet existing methods are limited to single-reward tasks, restricting their applicability in real-world scenarios that demand adapting to multiple objectives introduced incrementally over time. In this paper, we first define this more realistic and unexplored problem, termed Reward Incremental Learning (RIL), where models are desired to adapt to multiple downstream objectives incrementally. Additionally, while the models adapt to the ever-emerging new objectives, we observe a unique form of catastrophic forgetting in diffusion model fine-tuning, affecting both metric-wise and visual structure-wise image quality. To address this catastrophic forgetting challenge, we propose Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead, enabling stable performance across sequential reward tasks. The experimental results demonstrate the efficacy of RID in achieving consistent, high-quality generation in RIL scenarios. The source code of our work will be publicly available upon acceptance.

Reward Incremental Learning in Text-to-Image Generation

TL;DR

This paper proposes Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead, enabling stable performance across sequential reward tasks and demonstrates the efficacy of RID in achieving consistent, high-quality generation in RIL scenarios.

Abstract

The recent success of denoising diffusion models has significantly advanced text-to-image generation. While these large-scale pretrained models show excellent performance in general image synthesis, downstream objectives often require fine-tuning to meet specific criteria such as aesthetics or human preference. Reward gradient-based strategies are promising in this context, yet existing methods are limited to single-reward tasks, restricting their applicability in real-world scenarios that demand adapting to multiple objectives introduced incrementally over time. In this paper, we first define this more realistic and unexplored problem, termed Reward Incremental Learning (RIL), where models are desired to adapt to multiple downstream objectives incrementally. Additionally, while the models adapt to the ever-emerging new objectives, we observe a unique form of catastrophic forgetting in diffusion model fine-tuning, affecting both metric-wise and visual structure-wise image quality. To address this catastrophic forgetting challenge, we propose Reward Incremental Distillation (RID), a method that mitigates forgetting with minimal computational overhead, enabling stable performance across sequential reward tasks. The experimental results demonstrate the efficacy of RID in achieving consistent, high-quality generation in RIL scenarios. The source code of our work will be publicly available upon acceptance.

Paper Structure

This paper contains 16 sections, 9 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: We define Reward Incremental Learning (RIL), a novel task that aims to fine-tune a diffusion model with a sequence of downstream reward tasks incrementally. We adapted the current state-of-the-art method draftalignprop in diffusion fine-tuning to the RIL setting and formed the baseline method (cf. Sec. \ref{['subsec:baseline']}). We observed the catastrophic forgetting problem in both visual structure and metrics (cf. Sec. \ref{['subsec:forgetting']}) as the adapted baseline model is fine-tuned with more target objectives. Moreover, we propose Reward Incremental Distillation (RID), a computationally efficient approach that leverages EMA distillation and LoRA adapter group to mitigate forgetting.
  • Figure 2: The overview of our proposed RID. RID has two main components: LoRA adapter group and momentum distillation. Through the combined use of group LoRA adapter and EMA distillation, RID achieves improved robustness against forgetting, generating images that not only align well with target objectives but also maintain high general quality.
  • Figure 3: Comparison of the conventional LoRA adapter and the proposed LoRA adapter group in a pretrained diffusion layer. Unlike the existing diffusion fine-tuning strategies that train a single pair of LoRA adapters, the adapter group expands and initializes a pair of new LoRA matrices when a new reward task arrives. The adapter group separates parameters for different reward tasks, for better knowledge retention.
  • Figure 4: Comparison between the naive full-step distillation strategy and the proposed last-step distillation strategy. (a) The full-step strategy aligns pixel-wise outputs across all denoising steps, starting from random noise $\bm{z}_N$, but suffers from high computational cost and error accumulation over time steps. (b) The last-step strategy only aligns the final diffusion step, reducing the computation and mitigating error accumulation, leading to a more efficient and stable fine-tuning process.
  • Figure 5: Qualitative comparison of generation results using a fixed task sequence in the RIL setting. In generation, instead of reusing the prompts in the training dataset, we use novel prompts in the test dataset. From the figure, we can see that the adapted baseline suffers from a severe forgetting issue as more tuning tasks are introduced, while RID consistently improves upon the original Stable Diffusion model.
  • ...and 1 more figures