Table of Contents
Fetching ...

Personalized Preference Fine-tuning of Diffusion Models

Meihua Dang, Anikait Singh, Linqi Zhou, Stefano Ermon, Jiaming Song

TL;DR

This work tackles personalized preference alignment for text-to-image diffusion by learning per-user reward embeddings through a vision-language model and conditioning the diffusion process with cross-attention. It introduces PPD, a multi-reward objective that jointly optimizes several user-specific rewards and allows inference-time interpolation between them. Evaluations on Pick-a-Pic show that PPD achieves strong alignment with real users using only few-shot data, including an 83% accuracy judge agreement and an overall win-rate around 81% across seen and unseen users. The approach enables scalable personalization without training separate models per user and demonstrates robust generalization to new users and rewards.

Abstract

RLHF techniques like DPO can significantly improve the generation quality of text-to-image diffusion models. However, these methods optimize for a single reward that aligns model generation with population-level preferences, neglecting the nuances of individual users' beliefs or values. This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users. Specifically, our approach (1) leverages a vision-language model (VLM) to extract personal preference embeddings from a small set of pairwise preference examples, and then (2) incorporates the embeddings into diffusion models through cross attention. Conditioning on user embeddings, the text-to-image models are fine-tuned with the DPO objective, simultaneously optimizing for alignment with the preferences of multiple users. Empirical results demonstrate that our method effectively optimizes for multiple reward functions and can interpolate between them during inference. In real-world user scenarios, with as few as four preference examples from a new user, our approach achieves an average win rate of 76\% over Stable Cascade, generating images that more accurately reflect specific user preferences.

Personalized Preference Fine-tuning of Diffusion Models

TL;DR

This work tackles personalized preference alignment for text-to-image diffusion by learning per-user reward embeddings through a vision-language model and conditioning the diffusion process with cross-attention. It introduces PPD, a multi-reward objective that jointly optimizes several user-specific rewards and allows inference-time interpolation between them. Evaluations on Pick-a-Pic show that PPD achieves strong alignment with real users using only few-shot data, including an 83% accuracy judge agreement and an overall win-rate around 81% across seen and unseen users. The approach enables scalable personalization without training separate models per user and demonstrates robust generalization to new users and rewards.

Abstract

RLHF techniques like DPO can significantly improve the generation quality of text-to-image diffusion models. However, these methods optimize for a single reward that aligns model generation with population-level preferences, neglecting the nuances of individual users' beliefs or values. This lack of personalization limits the efficacy of these models. To bridge this gap, we introduce PPD, a multi-reward optimization objective that aligns diffusion models with personalized preferences. With PPD, a diffusion model learns the individual preferences of a population of users in a few-shot way, enabling generalization to unseen users. Specifically, our approach (1) leverages a vision-language model (VLM) to extract personal preference embeddings from a small set of pairwise preference examples, and then (2) incorporates the embeddings into diffusion models through cross attention. Conditioning on user embeddings, the text-to-image models are fine-tuned with the DPO objective, simultaneously optimizing for alignment with the preferences of multiple users. Empirical results demonstrate that our method effectively optimizes for multiple reward functions and can interpolate between them during inference. In real-world user scenarios, with as few as four preference examples from a new user, our approach achieves an average win rate of 76\% over Stable Cascade, generating images that more accurately reflect specific user preferences.
Paper Structure (36 sections, 17 equations, 9 figures, 2 tables)

This paper contains 36 sections, 17 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The overall architecture of PPD. In Stage 1, user embedding are generated with few-shot preference examples using a VLM. In Stage 2, we fine-tune diffusion models on the preference datasets with the user embedding as conditioning added to cross-attention.
  • Figure 2: Top-K accuracy of the User Classification. We fine-tune a user-classifier from the frozen embeddings from the VLM on few-shot preference examples for 300 users. This classifier significantly outperforms a random chance baseline.
  • Figure 3: Automatic win rate evaluation with reward functions. We compare against Stable Cascade, Diffusion-DPO, and SFT.
  • Figure 4: PPD is able to interpolate among three distinct objectives during inference. (a) generated images conditioned on (b) various weights, with three axes representing CLIP, Aesthetic, and HPS; (c) reward scores for each image. The score for each objective increases as its respective weight increases, and decreasing otherwise. For each row from left to right, the CLIP score increases alongside its weight, leading to a decrease in the Aesthetic score as its weight decreases. From bottom to top, the HPS score increases with its weight.
  • Figure 5: Qualitative Analysis of Images Generated by PPD and Baselines. Compared to Diffusion-DPO, PPD achieves closer alignment with the generated user profile, highlighted in green. The caption-augmented method captures user profile details; however, it often leads to unintended image alterations, causing the image to disregard the caption itself, as indicated in red.
  • ...and 4 more figures