Dynamic Prompt Optimizing for Text-to-Image Generation

Wenyi Mo; Tianyu Zhang; Yalong Bai; Bing Su; Ji-Rong Wen; Qing Yang

Dynamic Prompt Optimizing for Text-to-Image Generation

Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen, Qing Yang

TL;DR

This work tackles prompt sensitivity in diffusion-based text-to-image generation by introducing Prompt Auto-Editing (PAE), a two-stage framework that converts plain prompts into Dynamic Fine-Control Prompts (DF-Prompts) to modulate per-word influence over denoising steps. In Stage 1, a plain-prompt refinement model $\mathcal{E}_{\mathrm{ReP}}$ is trained via autoregressive learning on automatically filtered prompt–image data, producing refined prompts $\mathbf{s}^{\mathrm{ReP}}$. In Stage 2, a policy model $\mathcal{E}_{\mathrm{DFP}}$ initialized from $\mathcal{E}_{\mathrm{ReP}}$ optimizes DF-prompts through online PPO, predicting triples $\langle x_i, \tau_i, w_i\rangle$ that form $A^{\mathrm{DFP}}$ and yield $s^{\mathrm{DFP}} = s \oplus A^{\mathrm{DFP}}$, guided by a reward combining CLIP alignment, aesthetic quality, and human preferences with KL regularization. Across Lexica.art, DiffusionDB, and COCO, PAE demonstrates quantitative improvements in human-preference metrics and aesthetic scores, and qualitative results show richer textures and styles without sacrificing semantic fidelity. This approach enables automated, fine-grained control over image generation, offering practical benefits for creators and researchers seeking high-quality, semantically faithful outputs from diffusion models.

Abstract

Text-to-image generative models, specifically those based on diffusion models like Imagen and Stable Diffusion, have made substantial advancements. Recently, there has been a surge of interest in the delicate refinement of text prompts. Users assign weights or alter the injection time steps of certain words in the text prompts to improve the quality of generated images. However, the success of fine-control prompts depends on the accuracy of the text prompts and the careful selection of weights and time steps, which requires significant manual intervention. To address this, we introduce the \textbf{P}rompt \textbf{A}uto-\textbf{E}diting (PAE) method. Besides refining the original prompts for image generation, we further employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts. The reward function during training encourages the model to consider aesthetic score, semantic consistency, and user preferences. Experimental results demonstrate that our proposed method effectively improves the original prompts, generating visually more appealing images while maintaining semantic alignment. Code is available at https://github.com/Mowenyii/PAE.

Dynamic Prompt Optimizing for Text-to-Image Generation

TL;DR

is trained via autoregressive learning on automatically filtered prompt–image data, producing refined prompts

. In Stage 2, a policy model

initialized from

optimizes DF-prompts through online PPO, predicting triples

that form

and yield

, guided by a reward combining CLIP alignment, aesthetic quality, and human preferences with KL regularization. Across Lexica.art, DiffusionDB, and COCO, PAE demonstrates quantitative improvements in human-preference metrics and aesthetic scores, and qualitative results show richer textures and styles without sacrificing semantic fidelity. This approach enables automated, fine-grained control over image generation, offering practical benefits for creators and researchers seeking high-quality, semantically faithful outputs from diffusion models.

Abstract

Paper Structure (13 sections, 4 equations, 12 figures, 10 tables)

This paper contains 13 sections, 4 equations, 12 figures, 10 tables.

Introduction
Related work
Method
Definitions of Dynamic Fine-control Prompt
Overview of PAE
Finetuning for Plain Prompt Refinement
RL for DF-Prompt Generation
Experiments
Experimental Setup
Implementation Details
Evaluation and Analysis
Ablation Study
Conclusion

Figures (12)

Figure 1: Generation results with the same seed using dynamic fine-control prompt (one plain token is extended into a triple of $\left\langle \text{token}, \text{effect range}, \text{weight} \right\rangle$). It can be seen that (a) increasing the weight of anime to 1.5 can amplify the sense of anime; (b) applying the word detailed in the first 15% denoising timesteps can generate more natural texture details than applying it in all timesteps.
Figure 2: The training process of PAE. (Stage 1) We select the training prompts based on a confidence score $\mathcal{S}$ as shown in \ref{['eq:confidence_score']}, then fine-tune a pre-trained language model. The result is $\mathcal{E}_\mathrm{ReP}$, a model that produces refined prompts. (Stage 2) We initialize the policy model $\mathcal{E}_\mathrm{DFP}$ using $\mathcal{E}_\mathrm{ReP}$. We add two linear headers to this model. These headers, along with the one predicting word tokens, use the same model's intermediate representation for their predictions. We then transform these predictions into DF-prompts. These DF-prompts modify the text injection mode of the diffusion model $\mathcal{M}$, which in turn affects the output images. During the online exploration, we use the original plain prompt $\mathbf{s}$, the optimized DF-prompt ${\mathbf{s}}^\mathrm{DFP}$, and their respective images $\mathbf{I}$ and ${\mathbf{I}}^\mathrm{DFP}$ to compute the reward $R$. Finally, we update the policy model by minimizing a loss function as defined in \ref{['eq:online_object']}.
Figure 3: Generated images using Stable Diffusion v1.4 with short prompts, Promptist hao2022optimizing, and our method. In each column, the images are generated using the same random seed. Our method shows the ability to moderately expand the semantic content, such as "in a scenic environment", "with gorgeous hair face illustration", "on a ship deck" and "for 50 years." These expansions stimulate users' imagination while enhancing the comprehensiveness and aesthetic quality of the image.
Figure 4: Our method generate the DF-Prompt, which corresponds to the generated images with more detailed textures and a richer background for a better visual effect than the refined prompt. The images are generated using the same random seed in each column.
Figure 5: (a) The 15 most frequently generated modifiers. (b$\sim$d) The frequency of different combinations of settings.
...and 7 more figures

Dynamic Prompt Optimizing for Text-to-Image Generation

TL;DR

Abstract

Dynamic Prompt Optimizing for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)