Table of Contents
Fetching ...

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Fanyue Wei, Wei Zeng, Zhenyang Li, Dawei Yin, Lixin Duan, Wen Li

TL;DR

This paper tackles the gap in personalized text-to-image generation where diffusion models struggle to preserve structural fidelity to reference subjects. It proposes a reinforcement learning framework based on deterministic policy gradient (DPG) that treats the diffusion denoiser as a policy and learns a reward model to supervise personalization, incorporating a novel “look forward” mechanism to align final images with reference structure and a complex reward (e.g., DINO) to capture personalized features. The method shows substantial improvements in visual fidelity while maintaining text alignment on DreamBooth and Custom Diffusion benchmarks, demonstrating the versatility of flexible reward design. The approach offers a scalable platform for integrating diverse supervision signals into diffusion-based T2I personalization with potential extensions to additional rewards and tasks, while also raising considerations for privacy and misuse in personalized image synthesis.

Abstract

Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using diffusion-based generation models, the visual structure and details of the object are often unexpectedly changed during the diffusion process. One major reason is that these diffusion-based approaches typically adopt a simple reconstruction objective during training, which can hardly enforce appropriate structural consistency between the generated and the reference images. To this end, in this paper, we design a novel reinforcement learning framework by utilizing the deterministic policy gradient method for personalized text-to-image generation, with which various objectives, differential or even non-differential, can be easily incorporated to supervise the diffusion models to improve the quality of the generated images. Experimental results on personalized text-to-image generation benchmark datasets demonstrate that our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment. Our code is available at: \url{https://github.com/wfanyue/DPG-T2I-Personalization}.

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

TL;DR

This paper tackles the gap in personalized text-to-image generation where diffusion models struggle to preserve structural fidelity to reference subjects. It proposes a reinforcement learning framework based on deterministic policy gradient (DPG) that treats the diffusion denoiser as a policy and learns a reward model to supervise personalization, incorporating a novel “look forward” mechanism to align final images with reference structure and a complex reward (e.g., DINO) to capture personalized features. The method shows substantial improvements in visual fidelity while maintaining text alignment on DreamBooth and Custom Diffusion benchmarks, demonstrating the versatility of flexible reward design. The approach offers a scalable platform for integrating diverse supervision signals into diffusion-based T2I personalization with potential extensions to additional rewards and tasks, while also raising considerations for privacy and misuse in personalized image synthesis.

Abstract

Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using diffusion-based generation models, the visual structure and details of the object are often unexpectedly changed during the diffusion process. One major reason is that these diffusion-based approaches typically adopt a simple reconstruction objective during training, which can hardly enforce appropriate structural consistency between the generated and the reference images. To this end, in this paper, we design a novel reinforcement learning framework by utilizing the deterministic policy gradient method for personalized text-to-image generation, with which various objectives, differential or even non-differential, can be easily incorporated to supervise the diffusion models to improve the quality of the generated images. Experimental results on personalized text-to-image generation benchmark datasets demonstrate that our proposed approach outperforms existing state-of-the-art methods by a large margin on visual fidelity while maintaining text-alignment. Our code is available at: \url{https://github.com/wfanyue/DPG-T2I-Personalization}.
Paper Structure (16 sections, 11 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 11 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our proposed framework utilizes the DPG algorithm to capture the visual consistency and supervises the generation model with flexible objectives, differential or even non-differential.
  • Figure 2: During the denoising process, in the early timesteps ($t \approx T$), the diffusion model attempts to represent the outline and structure of the subject, whereas in the later steps ($t \approx 0$), the model focuses on the visual details.
  • Figure 3: Our proposed framework of DPG equipped with "looking forward" can further introduce more flexible supervision with a learnable reward model for the personalized generation model (e.g., Stable Diffusion).
  • Figure 4: In this figure, we present the reference images alongside the images generated by Custom Diffusion, DreamBooth and our method. As demonstrated, given the challenging textual prompts, the images generated by Ours best preserve the high fidelity of the personalized attributes, including color, expressions, texture and etc.
  • Figure 5: The convergence of the Q-function is illustrated in the subfigures. Subfigure (a) presents the training loss of the Q-function for the reconstruction reward, while subfigure (b) relates to the DINO reward.
  • ...and 1 more figures