Table of Contents
Fetching ...

Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

Hongji Yang, Yucheng Zhou, Wencheng Han, Jianbing Shen

TL;DR

The paper tackles the challenge of crafting effective prompts for text-to-image generation by introducing a self-rewarding framework that uses a single LVLM as both prompt solver and evaluator. It constructs a data-efficient pipeline, pre-trains via supervised rewriting and evaluation, and then iteratively improves prompts through reinforcement learning guided by AI-generated rewards (aesthetic and alignment scores). Key contributions include formalizing RL from AI Feedback with Direct Preference Optimization and demonstrating state-of-the-art performance on Beautiful-Prompt and DiffusionDB with reduced data requirements. The work extends LVLM capabilities to not only rewrite prompts but alsojudge image quality, enabling fully autonomous self-improvement in prompt optimization. Practically, this approach reduces reliance on human annotations and shows promise for more accessible, high-quality text-to-image generation.

Abstract

Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.

Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation

TL;DR

The paper tackles the challenge of crafting effective prompts for text-to-image generation by introducing a self-rewarding framework that uses a single LVLM as both prompt solver and evaluator. It constructs a data-efficient pipeline, pre-trains via supervised rewriting and evaluation, and then iteratively improves prompts through reinforcement learning guided by AI-generated rewards (aesthetic and alignment scores). Key contributions include formalizing RL from AI Feedback with Direct Preference Optimization and demonstrating state-of-the-art performance on Beautiful-Prompt and DiffusionDB with reduced data requirements. The work extends LVLM capabilities to not only rewrite prompts but alsojudge image quality, enabling fully autonomous self-improvement in prompt optimization. Practically, this approach reduces reliance on human annotations and shows promise for more accessible, high-quality text-to-image generation.

Abstract

Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.

Paper Structure

This paper contains 26 sections, 10 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The motivation of our prompt optimization pipeline. When training, the large model continuously improves prompt rewriting and image quality evaluation capabilities through self-play without any external sources or models. The generated images from the user prompts and the rewritten prompts. It can be observed that the image from the modified prompt has higher aesthetics.
  • Figure 2: Four types of framework in prompt optimizing. The main difference among them is the reward generation. (a) The reward function (or Metrics) is pre-defined for reinforcement learning, which typically involves mathematical equations for text or images, such as work hao2024optimizing. (b) A feedback model is trained using a large annotated dataset (typical manual score) and then is employed for RL, like PPO, which is employed in cao2023beautifulpromptrosenman2023neuroprompts. (c) Rewards are generated through a fixed external LVLM. The upper bound of evaluation depends on external models. (d) Rewriter and Reward Model share the same weight, achieving self-improvement by an iterative method to generating answers and self-judgment.
  • Figure 3: The overall framework in our prompt optimizing framework. It involves five steps, arranged from left to right, (a) Prompt Rewriter sample multiple candidates $\textbf{y}$, (b) Diffusion Model generates images from the candidates, (c) Image Evaluator act as image evaluate models, to generate image evaluate responses $\textbf{R}$, (d) Response Judge act as judge models to judge the response from evaluator and get response evaluation $\textbf{E}$, and (e) Optimization with the response from evaluator and judge, and then update the LVLM.
  • Figure 4: Human evaluation results. The result of $\mathop{\mathrm{LVLM}}\limits_{DPO_2}$ are more preferred by human compared with the result of User Prompt and $\mathop{\mathrm{LVLM}}\limits_{SFT}$.
  • Figure 5: Generated prompt length (bar graph) and aesthetic score improvement (line graph) compared with raw prompt.
  • ...and 3 more figures