Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning
Fanyue Wei, Wei Zeng, Zhenyang Li, Dawei Yin, Lixin Duan, Wen Li
TL;DR
This paper tackles the gap in personalized text-to-image generation where diffusion models struggle to preserve structural fidelity to reference subjects. It proposes a reinforcement learning framework based on deterministic policy gradient (DPG) that treats the diffusion denoiser as a policy and learns a reward model to supervise personalization, incorporating a novel “look forward” mechanism to align final images with reference structure and a complex reward (e.g., DINO) to capture personalized features. The method shows substantial improvements in visual fidelity while maintaining text alignment on DreamBooth and Custom Diffusion benchmarks, demonstrating the versatility of flexible reward design. The approach offers a scalable platform for integrating diverse supervision signals into diffusion-based T2I personalization with potential extensions to additional rewards and tasks, while also raising considerations for privacy and misuse in personalized image synthesis.
Abstract
Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using diffusion-based generation models, the visual structure and details of the object are often unexpectedly changed during the diffusion process. One major reason is that these diffusion-based approaches typically adopt a simple reconstruction objective during training, which can hardly enforce appropriate structural consistency between the generated and the reference images. To this end, in this paper, we design a novel reinforcement learning framework by utilizing the deterministic policy gradient method for personalized text-to-image generation, with which various objectives, differential or even non-differential, can be easily incorporated to supervise the diffusion models to improve the quality of the generated images. Experimental results on personalized text-to-image generation benchmark datasets demonstrate that our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment. Our code is available at: \url{https://github.com/wfanyue/DPG-T2I-Personalization}.
