Table of Contents
Fetching ...

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Yanting Miao, William Loh, Suraj Kothawade, Pascal Poupart, Abdullah Rashwan, Yeqing Li

TL;DR

The paper tackles subject-driven text-to-image generation by addressing training efficiency and overfitting in diffusion-based models. It introduces the λ-Harmonic reward to provide a robust feedback signal and enable early stopping, and couples it with a Bradley-Terry-based preference model to form Reward Preference Optimization (RPO), which finetunes only the UNet. RPO requires a small setup, uses as little as 3% of DreamBooth negative samples, and achieves competitive to state-of-the-art results on DreamBench, with CLIP-I around 0.833 and CLIP-T around 0.314. Through extensive ablations on the reward, validation weight, and learning losses, the approach demonstrates efficient training, strong text-to-image alignment, and maintained subject fidelity, offering a practical pathway for efficient subject-driven generation in real-world applications.

Abstract

Text-to-image generative models have recently attracted considerable interest, enabling the synthesis of high-quality images from textual prompts. However, these models often lack the capability to generate specific subjects from given reference images or to synthesize novel renditions under varying conditions. Methods like DreamBooth and Subject-driven Text-to-Image (SuTI) have made significant progress in this area. Yet, both approaches primarily focus on enhancing similarity to reference images and require expensive setups, often overlooking the need for efficient training and avoiding overfitting to the reference images. In this work, we present the $λ$-Harmonic reward function, which provides a reliable reward signal and enables early stopping for faster training and effective regularization. By combining the Bradley-Terry preference model, the $λ$-Harmonic reward function also provides preference labels for subject-driven generation tasks. We propose Reward Preference Optimization (RPO), which offers a simpler setup (requiring only $3\%$ of the negative samples used by DreamBooth) and fewer gradient steps for fine-tuning. Unlike most existing methods, our approach does not require training a text encoder or optimizing text embeddings and achieves text-image alignment by fine-tuning only the U-Net component. Empirically, $λ$-Harmonic proves to be a reliable approach for model selection in subject-driven generation tasks. Based on preference labels and early stopping validation from the $λ$-Harmonic reward function, our algorithm achieves a state-of-the-art CLIP-I score of 0.833 and a CLIP-T score of 0.314 on DreamBench.

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

TL;DR

The paper tackles subject-driven text-to-image generation by addressing training efficiency and overfitting in diffusion-based models. It introduces the λ-Harmonic reward to provide a robust feedback signal and enable early stopping, and couples it with a Bradley-Terry-based preference model to form Reward Preference Optimization (RPO), which finetunes only the UNet. RPO requires a small setup, uses as little as 3% of DreamBooth negative samples, and achieves competitive to state-of-the-art results on DreamBench, with CLIP-I around 0.833 and CLIP-T around 0.314. Through extensive ablations on the reward, validation weight, and learning losses, the approach demonstrates efficient training, strong text-to-image alignment, and maintained subject fidelity, offering a practical pathway for efficient subject-driven generation in real-world applications.

Abstract

Text-to-image generative models have recently attracted considerable interest, enabling the synthesis of high-quality images from textual prompts. However, these models often lack the capability to generate specific subjects from given reference images or to synthesize novel renditions under varying conditions. Methods like DreamBooth and Subject-driven Text-to-Image (SuTI) have made significant progress in this area. Yet, both approaches primarily focus on enhancing similarity to reference images and require expensive setups, often overlooking the need for efficient training and avoiding overfitting to the reference images. In this work, we present the -Harmonic reward function, which provides a reliable reward signal and enables early stopping for faster training and effective regularization. By combining the Bradley-Terry preference model, the -Harmonic reward function also provides preference labels for subject-driven generation tasks. We propose Reward Preference Optimization (RPO), which offers a simpler setup (requiring only of the negative samples used by DreamBooth) and fewer gradient steps for fine-tuning. Unlike most existing methods, our approach does not require training a text encoder or optimizing text embeddings and achieves text-image alignment by fine-tuning only the U-Net component. Empirically, -Harmonic proves to be a reliable approach for model selection in subject-driven generation tasks. Based on preference labels and early stopping validation from the -Harmonic reward function, our algorithm achieves a state-of-the-art CLIP-I score of 0.833 and a CLIP-T score of 0.314 on DreamBench.
Paper Structure (42 sections, 16 equations, 17 figures, 8 tables)

This paper contains 42 sections, 16 equations, 17 figures, 8 tables.

Figures (17)

  • Figure 1: We illustrate the $\lambda$-Harmonic reward function applied to the subject-driven generation task. Leveraging preference labels produced by the $\lambda$-Harmonic reward function, alongside a few reference images, our preference-based algorithm efficiently generates unseen scenes that are both faithful to the reference images and the textual prompts.
  • Figure 2: Overview of the finetuning phase for RPO. First, the base diffusion model generates a few images based on novel training prompts. Second, we compute the rewards for both reference and generated images using Equation (\ref{['eq:method_reward_fn']}). Then, preference labels are sampled according to the preference distribution, as defined in Equation (\ref{['eq:method_preference_model']}). Finally, the diffusion model is trained by minimizing both the similarity loss (Equation (\ref{['eq:method_similar_loss']})) and preference loss (Equation (\ref{['eq:method_preference_loss']})).
  • Figure 3: Qualitative comparison with other subject-driven text-to-image methods, adapted from chen2024subject
  • Figure 4: Changes in the $0.3$-Harmonic reward value during RPO training process.
  • Figure 5: Different $\lambda_{\text{val}}$'s will lead to different results. A small $\lambda_{\text{val}}$ assigns a higher a weight for text-to-image alignment and leads to diverse generation. A large $\lambda_{\text{val}}$ may also cause overfitting.
  • ...and 12 more figures