Table of Contents
Fetching ...

Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation

Taeyoung Yun, Dinghuai Zhang, Jinkyoo Park, Ling Pan

TL;DR

The paper proposes PAG, a Generative Flow Networks-based framework for prompt adaptation in text-to-image diffusion that samples prompts from an unnormalized reward distribution to achieve both high aesthetic quality and diversity. It identifies and mitigates mode collapse and progressive neural plasticity loss (dormant neurons) during naive GFlowNet fine-tuning through flow reactivation, reward-prioritized sampling, and progressive reward decomposition. The method is validated on multiple prompt datasets and diffusion models, showing robustness across reward definitions and strong zero-shot transfer to unseen diffusion models, with ablations confirming the necessity of its components. This approach offers a practical, model-parameter-free pathway to achieving diverse, high-quality image generation guided by black-box rewards, with broad applicability to different T2I systems.

Abstract

Recent advances in text-to-image diffusion models have achieved impressive image generation capabilities. However, it remains challenging to control the generation process with desired properties (e.g., aesthetic quality, user intention), which can be expressed as black-box reward functions. In this paper, we focus on prompt adaptation, which refines the original prompt into model-preferred prompts to generate desired images. While prior work uses reinforcement learning (RL) to optimize prompts, we observe that applying RL often results in generating similar postfixes and deterministic behaviors. To this end, we introduce \textbf{P}rompt \textbf{A}daptation with \textbf{G}FlowNets (\textbf{PAG}), a novel approach that frames prompt adaptation as a probabilistic inference problem. Our key insight is that leveraging Generative Flow Networks (GFlowNets) allows us to shift from reward maximization to sampling from an unnormalized density function, enabling both high-quality and diverse prompt generation. However, we identify that a naive application of GFlowNets suffers from mode collapse and uncovers a previously overlooked phenomenon: the progressive loss of neural plasticity in the model, which is compounded by inefficient credit assignment in sequential prompt generation. To address this critical challenge, we develop a systematic approach in PAG with flow reactivation, reward-prioritized sampling, and reward decomposition for prompt adaptation. Extensive experiments validate that PAG successfully learns to sample effective and diverse prompts for text-to-image generation. We also show that PAG exhibits strong robustness across various reward functions and transferability to different text-to-image models.

Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation

TL;DR

The paper proposes PAG, a Generative Flow Networks-based framework for prompt adaptation in text-to-image diffusion that samples prompts from an unnormalized reward distribution to achieve both high aesthetic quality and diversity. It identifies and mitigates mode collapse and progressive neural plasticity loss (dormant neurons) during naive GFlowNet fine-tuning through flow reactivation, reward-prioritized sampling, and progressive reward decomposition. The method is validated on multiple prompt datasets and diffusion models, showing robustness across reward definitions and strong zero-shot transfer to unseen diffusion models, with ablations confirming the necessity of its components. This approach offers a practical, model-parameter-free pathway to achieving diverse, high-quality image generation guided by black-box rewards, with broad applicability to different T2I systems.

Abstract

Recent advances in text-to-image diffusion models have achieved impressive image generation capabilities. However, it remains challenging to control the generation process with desired properties (e.g., aesthetic quality, user intention), which can be expressed as black-box reward functions. In this paper, we focus on prompt adaptation, which refines the original prompt into model-preferred prompts to generate desired images. While prior work uses reinforcement learning (RL) to optimize prompts, we observe that applying RL often results in generating similar postfixes and deterministic behaviors. To this end, we introduce \textbf{P}rompt \textbf{A}daptation with \textbf{G}FlowNets (\textbf{PAG}), a novel approach that frames prompt adaptation as a probabilistic inference problem. Our key insight is that leveraging Generative Flow Networks (GFlowNets) allows us to shift from reward maximization to sampling from an unnormalized density function, enabling both high-quality and diverse prompt generation. However, we identify that a naive application of GFlowNets suffers from mode collapse and uncovers a previously overlooked phenomenon: the progressive loss of neural plasticity in the model, which is compounded by inefficient credit assignment in sequential prompt generation. To address this critical challenge, we develop a systematic approach in PAG with flow reactivation, reward-prioritized sampling, and reward decomposition for prompt adaptation. Extensive experiments validate that PAG successfully learns to sample effective and diverse prompts for text-to-image generation. We also show that PAG exhibits strong robustness across various reward functions and transferability to different text-to-image models.

Paper Structure

This paper contains 44 sections, 14 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: Comparison of adapted prompts and their corresponding images of Prompist hao2024optimizing (based on reward-maximizing RL) and our method, PAG. While Promptist leads to mode collapse in the prompt space and converges to similar outputs, PAG achieves high image quality while painting generation diversity.
  • Figure 2: The high-level illustration of PAG. Given an initial prompt, LM generates adapted prompts by PAG. Then, we generate images from prompts and get a reward. Using observations, we fine-tune LM as a GFlowNet policy to generate prompts proportional to reward.
  • Figure 3: Mode collapse issue in prompt adaptation with a naive application of GFlowNets. The proportion of dormant neurons steadily increases (left), while the diversity of generated prompts significantly decreases over training iterations (right).
  • Figure 4: Reward and diversity of prompts generated by each method with different initial prompt datasets.
  • Figure 5: Images generated by optimized prompts using Stable Diffusion v1.4 (with the same seed to visualize the effect solely on prompts). Our method generates diverse and highly aesthetic images based on adapted prompts of high quality and diversity.
  • ...and 9 more figures