DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models
Daewon Chae, June Suk Choi, Jinkyu Kim, Kimin Lee
TL;DR
DiffExp tackles slow convergence in reward-based fine-tuning of text-to-image diffusion models caused by insufficient exploration of high-reward samples. It introduces two complementary strategies—dynamic CFG-scale scheduling and random prompt weighting—to promote diverse, high-signal samples during online optimization, and demonstrates improved sample efficiency for DDPO and AlignProp, including on SDXL and challenging prompts like DrawBench. Empirical results show DiffExp yields higher reward scores, better image-text alignment and aesthetics, and strong generalization to unseen prompts, often with ~20% fewer samples. The approach offers a practical, model-agnostic augmentation for reward-driven diffusion fine-tuning with broad applicability to modern high-resolution models.
Abstract
Fine-tuning text-to-image diffusion models to maximize rewards has proven effective for enhancing model performance. However, reward fine-tuning methods often suffer from slow convergence due to online sample generation. Therefore, obtaining diverse samples with strong reward signals is crucial for improving sample efficiency and overall performance. In this work, we introduce DiffExp, a simple yet effective exploration strategy for reward fine-tuning of text-to-image models. Our approach employs two key strategies: (a) dynamically adjusting the scale of classifier-free guidance to enhance sample diversity, and (b) randomly weighting phrases of the text prompt to exploit high-quality reward signals. We demonstrate that these strategies significantly enhance exploration during online sample generation, improving the sample efficiency of recent reward fine-tuning methods, such as DDPO and AlignProp.
