Table of Contents
Fetching ...

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, Feng Yang

TL;DR

Parrot reframes text-to-image fine-tuning as a multi-objective optimization problem and employs batch-wise non-dominated selection to approximate the Pareto frontier across four quality rewards. It jointly optimizes a prompt expansion network and a diffusion-based T2I model, with reward-specific prompts enabling inference-time trade-off control and an original-prompt-centered guidance mechanism to preserve fidelity. Empirical results, including user studies, show Parrot outperforms baselines across aesthetics, human preference, text-image alignment, and sentiment, validating the effectiveness of Pareto-aware training for multi-reward diffusion models. The approach offers scalable, controllable improvements for multi-criterion image generation and highlights the importance of Pareto-aware optimization in T2I systems.

Abstract

Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

TL;DR

Parrot reframes text-to-image fine-tuning as a multi-objective optimization problem and employs batch-wise non-dominated selection to approximate the Pareto frontier across four quality rewards. It jointly optimizes a prompt expansion network and a diffusion-based T2I model, with reward-specific prompts enabling inference-time trade-off control and an original-prompt-centered guidance mechanism to preserve fidelity. Empirical results, including user studies, show Parrot outperforms baselines across aesthetics, human preference, text-image alignment, and sentiment, validating the effectiveness of Pareto-aware training for multi-reward diffusion models. The approach offers scalable, controllable improvements for multi-criterion image generation and highlights the importance of Pareto-aware optimization in T2I systems.

Abstract

Recent works have demonstrated that using reinforcement learning (RL) with multiple quality rewards can improve the quality of generated images in text-to-image (T2I) generation. However, manually adjusting reward weights poses challenges and may cause over-optimization in certain metrics. To solve this, we propose Parrot, which addresses the issue through multi-objective optimization and introduces an effective multi-reward optimization strategy to approximate Pareto optimal. Utilizing batch-wise Pareto optimal selection, Parrot automatically identifies the optimal trade-off among different rewards. We use the novel multi-reward optimization algorithm to jointly optimize the T2I model and a prompt expansion network, resulting in significant improvement of image quality and also allow to control the trade-off of different rewards using a reward related prompt during inference. Furthermore, we introduce original prompt-centered guidance at inference time, ensuring fidelity to user input after prompt expansion. Extensive experiments and a user study validate the superiority of Parrot over several baselines across various quality criteria, including aesthetics, human preference, text-image alignment, and image sentiment.
Paper Structure (9 sections, 5 equations, 3 figures)

This paper contains 9 sections, 5 equations, 3 figures.

Figures (3)

  • Figure 1: Parrot visual examples. Parrot consistently improves the quality of generated images across multiple criteria: aesthetics, human preference, text-image alignment, and image sentiment. Each column shows generated images using the same seed.
  • Figure 2: Overview of Parrot. During the training, $N$ images are sampled from the T2I model using the expanded prompt from the prompt expansion network. Multiple quality rewards are calculated for each image, and the Pareto-optimal set is identified using the non-dominated sorting algorithm. These optimal images are then used to perform policy gradient update of the parameters of T2I model and prompt expansion network jointly. During the inference, both the original prompt and the expanded prompt are provided to the T2I model, enabling better faithfulness while adding detail.
  • Figure 3: Comparison of Parrot and diffusion-based RL baselines. From left to right, we provide results of Stable diffusion 1.5 rombach2022high (1st column), DPOK fan2023dpok (2nd column) with the weighted sum, Promptist hao2022optimizing (3rd column), and Parrot (4th column).