Table of Contents
Fetching ...

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, Tingbo Hou

TL;DR

Proximal Reward Difference Prediction is proposed, enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts and theoretically proves that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective.

Abstract

Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

TL;DR

Proximal Reward Difference Prediction is proposed, enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts and theoretically proves that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective.

Abstract

Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However, in the vision domain, existing RL-based reward finetuning methods are limited by their instability in large-scale training, rendering them incapable of generalizing to complex, unseen prompts. In this paper, we propose Proximal Reward Difference Prediction (PRDP), enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically, the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments, we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore, through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset, PRDP achieves superior generation quality on a diverse set of complex, unseen prompts whereas RL-based methods completely fail.
Paper Structure (23 sections, 3 theorems, 32 equations, 15 figures, 5 tables, 1 algorithm)

This paper contains 23 sections, 3 theorems, 32 equations, 15 figures, 5 tables, 1 algorithm.

Key Result

Lemma A.1

Given two diffusion models $\pi_\theta, \pi_\mathrm{ref}$, a prompt distribution $p(\mathbf{c})$, a reward function $r(\mathbf{x}_0, \mathbf{c})$, and a constant $\beta > 0$, we have: where $\bar{\mathbf{x}} \coloneqq \mathbf{x}_{0:T}$ is the full denoising trajectory, and $\pi_\theta, \pi_\mathrm{ref}$ are defined as:

Figures (15)

  • Figure 1: Generation samples on complex, unseen prompts. Our proposed method, PRDP, achieves stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets, leading to superior generation quality on complex, unseen prompts. Here, PRDP is finetuned from Stable Diffusion v1.4 on the training set prompts of Pick-a-Pic v1 dataset, using a weighted combination of rewards: PickScore $= 10$, HPSv2 $= 2$, Aesthetic $= 0.05$. The images within each column are generated using the same random seed.
  • Figure 2: PRDP framework. PRDP mitigates the instability of policy gradient methods by converting the RLHF objective to an equivalent supervised regression objective. Specifically, given a text prompt, PRDP samples two images, and tasks the diffusion model with predicting the reward difference of these two images from their denoising trajectories. The diffusion model is updated by stochastic gradient descent on the MSE loss that measures the prediction error. We prove that the MSE loss and the RLHF objective have the same optimal solution.
  • Figure 3: Effect of proximal updates. We show generation samples during the PRDP training process. Here, we use the small-scale setup described in \ref{['sec:exp_setup']} and HPSv2 as the reward model. All samples use the same prompt "A painting of a deer" and the same random seed. (Left) Without proximal updates, training is quite unstable, and the generation quickly becomes meaningless noise. (Right) With proximal updates, the training stability is remarkably improved.
  • Figure 4: Generation samples from small-scale training. DDPO and PRDP are finetuned from Stable Diffusion v1.4 on $45$ prompts consisting of common animal names, with HPSv2 (Left) and PickScore (Right) as the reward model. Samples within each column use the same random seed. The prompt template is "A painting of a $\langle$animal$\rangle$", where the $\langle$animal$\rangle$ is listed on top of each column. All prompts are seen during training. Both DDPO and PRDP significantly improve the generation quality, with PRDP being slightly better.
  • Figure 5: Generation samples from large-scale training. DDPO and PRDP are finetuned from Stable Diffusion v1.4 on over $100$K prompts from the training set of HPDv2, with HPSv2 (Left) and PickScore (Right) as the reward model. Samples within each column are generated from the prompt shown on top, using the same random seed. All prompts are unseen during training. PRDP significantly improves the generation quality over Stable Diffusion, whereas DDPO fails to generate reasonable results.
  • ...and 10 more figures

Theorems & Definitions (6)

  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3
  • proof