Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Fanyue Wei; Wei Zeng; Zhenyang Li; Dawei Yin; Lixin Duan; Wen Li

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Fanyue Wei, Wei Zeng, Zhenyang Li, Dawei Yin, Lixin Duan, Wen Li

TL;DR

This paper tackles the gap in personalized text-to-image generation where diffusion models struggle to preserve structural fidelity to reference subjects. It proposes a reinforcement learning framework based on deterministic policy gradient (DPG) that treats the diffusion denoiser as a policy and learns a reward model to supervise personalization, incorporating a novel “look forward” mechanism to align final images with reference structure and a complex reward (e.g., DINO) to capture personalized features. The method shows substantial improvements in visual fidelity while maintaining text alignment on DreamBooth and Custom Diffusion benchmarks, demonstrating the versatility of flexible reward design. The approach offers a scalable platform for integrating diverse supervision signals into diffusion-based T2I personalization with potential extensions to additional rewards and tasks, while also raising considerations for privacy and misuse in personalized image synthesis.

Abstract

Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using diffusion-based generation models, the visual structure and details of the object are often unexpectedly changed during the diffusion process. One major reason is that these diffusion-based approaches typically adopt a simple reconstruction objective during training, which can hardly enforce appropriate structural consistency between the generated and the reference images. To this end, in this paper, we design a novel reinforcement learning framework by utilizing the deterministic policy gradient method for personalized text-to-image generation, with which various objectives, differential or even non-differential, can be easily incorporated to supervise the diffusion models to improve the quality of the generated images. Experimental results on personalized text-to-image generation benchmark datasets demonstrate that our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment. Our code is available at: \url{https://github.com/wfanyue/DPG-T2I-Personalization}.

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

TL;DR

Abstract

Paper Structure (16 sections, 11 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 11 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Diffusion Models for Image Generation
Personalized Text-to-Image Generation
Reinforcement Learning for Text-to-image Generation
Method
Preliminaries
DPG framework for T2I Personalization
Learning to "Look Forward"
Learning Complex Reward
Experiments
Experimental Setup
Qualitative Results.
Quantitative Results
Ablation Studies
...and 1 more sections

Figures (6)

Figure 1: Our proposed framework utilizes the DPG algorithm to capture the visual consistency and supervises the generation model with flexible objectives, differential or even non-differential.
Figure 2: During the denoising process, in the early timesteps ($t \approx T$), the diffusion model attempts to represent the outline and structure of the subject, whereas in the later steps ($t \approx 0$), the model focuses on the visual details.
Figure 3: Our proposed framework of DPG equipped with "looking forward" can further introduce more flexible supervision with a learnable reward model for the personalized generation model (e.g., Stable Diffusion).
Figure 4: In this figure, we present the reference images alongside the images generated by Custom Diffusion, DreamBooth and our method. As demonstrated, given the challenging textual prompts, the images generated by Ours best preserve the high fidelity of the personalized attributes, including color, expressions, texture and etc.
Figure 5: The convergence of the Q-function is illustrated in the subfigures. Subfigure (a) presents the training loss of the Q-function for the reconstruction reward, while subfigure (b) relates to the DINO reward.
...and 1 more figures

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

TL;DR

Abstract

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)