Table of Contents
Fetching ...

DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models

Daewon Chae, June Suk Choi, Jinkyu Kim, Kimin Lee

TL;DR

DiffExp tackles slow convergence in reward-based fine-tuning of text-to-image diffusion models caused by insufficient exploration of high-reward samples. It introduces two complementary strategies—dynamic CFG-scale scheduling and random prompt weighting—to promote diverse, high-signal samples during online optimization, and demonstrates improved sample efficiency for DDPO and AlignProp, including on SDXL and challenging prompts like DrawBench. Empirical results show DiffExp yields higher reward scores, better image-text alignment and aesthetics, and strong generalization to unseen prompts, often with ~20% fewer samples. The approach offers a practical, model-agnostic augmentation for reward-driven diffusion fine-tuning with broad applicability to modern high-resolution models.

Abstract

Fine-tuning text-to-image diffusion models to maximize rewards has proven effective for enhancing model performance. However, reward fine-tuning methods often suffer from slow convergence due to online sample generation. Therefore, obtaining diverse samples with strong reward signals is crucial for improving sample efficiency and overall performance. In this work, we introduce DiffExp, a simple yet effective exploration strategy for reward fine-tuning of text-to-image models. Our approach employs two key strategies: (a) dynamically adjusting the scale of classifier-free guidance to enhance sample diversity, and (b) randomly weighting phrases of the text prompt to exploit high-quality reward signals. We demonstrate that these strategies significantly enhance exploration during online sample generation, improving the sample efficiency of recent reward fine-tuning methods, such as DDPO and AlignProp.

DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models

TL;DR

DiffExp tackles slow convergence in reward-based fine-tuning of text-to-image diffusion models caused by insufficient exploration of high-reward samples. It introduces two complementary strategies—dynamic CFG-scale scheduling and random prompt weighting—to promote diverse, high-signal samples during online optimization, and demonstrates improved sample efficiency for DDPO and AlignProp, including on SDXL and challenging prompts like DrawBench. Empirical results show DiffExp yields higher reward scores, better image-text alignment and aesthetics, and strong generalization to unseen prompts, often with ~20% fewer samples. The approach offers a practical, model-agnostic augmentation for reward-driven diffusion fine-tuning with broad applicability to modern high-resolution models.

Abstract

Fine-tuning text-to-image diffusion models to maximize rewards has proven effective for enhancing model performance. However, reward fine-tuning methods often suffer from slow convergence due to online sample generation. Therefore, obtaining diverse samples with strong reward signals is crucial for improving sample efficiency and overall performance. In this work, we introduce DiffExp, a simple yet effective exploration strategy for reward fine-tuning of text-to-image models. Our approach employs two key strategies: (a) dynamically adjusting the scale of classifier-free guidance to enhance sample diversity, and (b) randomly weighting phrases of the text prompt to exploit high-quality reward signals. We demonstrate that these strategies significantly enhance exploration during online sample generation, improving the sample efficiency of recent reward fine-tuning methods, such as DDPO and AlignProp.

Paper Structure

This paper contains 31 sections, 9 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: (a) Generated images during the reward optimization process with our proposed method called DiffExp given the prompt "a dolphin riding a bike," which is often challenging for existing reward fine-tuning approaches, including our baseline DDPO ddpo. (b) We also provide corresponding reward curves against the number of reward queries, where our method indeed improves its sample efficiency during reward optimization process, capturing good reward signals for reward fine-tuning.
  • Figure 2: An overview of our proposed method called DiffExp, which consists of two main steps: (a) random prompt weighting and (b) dynamic scheduling of the CFG (classifier-free guidance) scale. In (a), word embeddings of the given prompt are randomly and differently weighted, which are then consumed by the image generation process, increasing the diversity of generated images. Further, in (b), the CFG scale of the denoising process is dynamically scheduled to control models to generate high-quality and diverse images, which is often challenging with a constantly set CFG scale.
  • Figure 3: Comparison of the generated images with different CFG scale scheduling strategies: (a) constantly high CFG scale, (b) constantly low CFG scale, and (c) dynamically scheduled CFG scale. We observe that a model often shows increased sample diversity with a low CFG scale but suffers from degraded image quality, which is generally the opposite with a high CFG scale. Instead, dynamically scheduling the CFG scale balances sample diversity and image quality.
  • Figure 4: Reward curves for training prompts. At each checkpoint, we generate 10 images per seen prompt and use the average of their reward scores. Our sampling method is employed only during fine-tuning, not for plotting this curve.
  • Figure 5: Generated samples from baselines and ours with (a) Aesthetic and (b) PickScore reward models. Notably, ours generates images with comparably high aesthetic quality (see (a)) and produces images with better image-text alignment given the prompts (see (b)). Note that images in the same column are generated with the same random seed.
  • ...and 14 more figures