Table of Contents
Fetching ...

DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization

Shamil Ayupov, Maksim Nakhodnov, Anastasia Yaschenko, Andrey Kuznetsov, Aibek Alanov

TL;DR

This work addresses the fidelity–prompt alignment trade-off in personalized diffusion by adapting Direct Preference Optimization (DPO) to automatically generate better–worse pairs from model outputs, using external quality metrics to avoid manual labeling. A novel angle-based filtering and multi-step training scheme directs updates toward desired regions of the trade-off space, enabling controllable emphasis on concept fidelity, prompt adherence, or a balanced mix. Empirical results across SD2 and SDXL backbones show improvements in both Image Similarity and Text Similarity, surpassing the Pareto frontier and proving robust to different backbones, with a confirming user study. The approach offers a scalable, automated, and tunable framework for personalized diffusion that can be deployed with modest additional computational costs relative to standard fine-tuning pipelines.

Abstract

Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at https://github.com/ControlGenAI/DreamBoothDPO.

DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization

TL;DR

This work addresses the fidelity–prompt alignment trade-off in personalized diffusion by adapting Direct Preference Optimization (DPO) to automatically generate better–worse pairs from model outputs, using external quality metrics to avoid manual labeling. A novel angle-based filtering and multi-step training scheme directs updates toward desired regions of the trade-off space, enabling controllable emphasis on concept fidelity, prompt adherence, or a balanced mix. Empirical results across SD2 and SDXL backbones show improvements in both Image Similarity and Text Similarity, surpassing the Pareto frontier and proving robust to different backbones, with a confirming user study. The approach offers a scalable, automated, and tunable framework for personalized diffusion that can be deployed with modest additional computational costs relative to standard fine-tuning pipelines.

Abstract

Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at https://github.com/ControlGenAI/DreamBoothDPO.

Paper Structure

This paper contains 32 sections, 7 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: (a) Individual optimization of the IS ($\lambda = 0$) and TS ($\lambda = 1$) allows to improve the target metric but drastically degrades the other one. (b) The weighted combination allows stabilized training but remains oversensitive to the weighting coefficient. (c) Multistep training can lead to significant improvements; however, it lacks effective directional control.
  • Figure 2: Pairs of images with different IS/TS balances. The prompts for the columns from left to right are: "a $V^{\star}$ sitting beneath table and chairs", "a $V^{\star}$ is sitting underneath a seat on a bus", "a $V^{\star}$ wearing a traditional sari", "a $V^{\star}$ in a serene Zen garden with koi ponds", "a $V^{\star}$ sitting on a rock area at water's edge of a lake", "a $V^{\star}$ laying on the floor chewing on a stick".
  • Figure 3: An outline of the proposed method. First, we fine-tune the personalized model and generate a diverse set of images, capturing the model's output variability. Then, images are scored and used to create a paired dataset for DPO training. The process can be repeated to form a multi-step training.
  • Figure 4: (a) The distribution of the weighted score function for different samples exhibits an unimodal behavior, failing to capture the High TS & High IS region. (b) The distribution of angles shows that filtering by threshold fails to remove pairs from both modes, while angle filtering can separate the required region. (c) Depiction of possible pairs for one selected (red) sample. While threshold filtering captures harmful samples from High TS & Low IS and Low TS & High IS regions, angle filtering selects a small fraction of pairs with high positive signals.
  • Figure 5: (a) Angle-based filtering allows for finer directional control. (b) Reducing the number of images per prompt negatively affects the performance. (c) 2-step $N=1000$ (large triangle) setup improves performance of 1-step setup and lowers computational costs of $N=4000$.
  • ...and 12 more figures