Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation
Taeyoung Yun, Dinghuai Zhang, Jinkyoo Park, Ling Pan
TL;DR
The paper proposes PAG, a Generative Flow Networks-based framework for prompt adaptation in text-to-image diffusion that samples prompts from an unnormalized reward distribution to achieve both high aesthetic quality and diversity. It identifies and mitigates mode collapse and progressive neural plasticity loss (dormant neurons) during naive GFlowNet fine-tuning through flow reactivation, reward-prioritized sampling, and progressive reward decomposition. The method is validated on multiple prompt datasets and diffusion models, showing robustness across reward definitions and strong zero-shot transfer to unseen diffusion models, with ablations confirming the necessity of its components. This approach offers a practical, model-parameter-free pathway to achieving diverse, high-quality image generation guided by black-box rewards, with broad applicability to different T2I systems.
Abstract
Recent advances in text-to-image diffusion models have achieved impressive image generation capabilities. However, it remains challenging to control the generation process with desired properties (e.g., aesthetic quality, user intention), which can be expressed as black-box reward functions. In this paper, we focus on prompt adaptation, which refines the original prompt into model-preferred prompts to generate desired images. While prior work uses reinforcement learning (RL) to optimize prompts, we observe that applying RL often results in generating similar postfixes and deterministic behaviors. To this end, we introduce \textbf{P}rompt \textbf{A}daptation with \textbf{G}FlowNets (\textbf{PAG}), a novel approach that frames prompt adaptation as a probabilistic inference problem. Our key insight is that leveraging Generative Flow Networks (GFlowNets) allows us to shift from reward maximization to sampling from an unnormalized density function, enabling both high-quality and diverse prompt generation. However, we identify that a naive application of GFlowNets suffers from mode collapse and uncovers a previously overlooked phenomenon: the progressive loss of neural plasticity in the model, which is compounded by inefficient credit assignment in sequential prompt generation. To address this critical challenge, we develop a systematic approach in PAG with flow reactivation, reward-prioritized sampling, and reward decomposition for prompt adaptation. Extensive experiments validate that PAG successfully learns to sample effective and diverse prompts for text-to-image generation. We also show that PAG exhibits strong robustness across various reward functions and transferability to different text-to-image models.
