Table of Contents
Fetching ...

Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning

Hanyang Zhao, Haoxian Chen, Ji Zhang, David D. Yao, Wenpin Tang

TL;DR

The paper tackles the discretization limitations of reinforcement learning when fine-tuning diffusion models for human-aligned generation. It introduces a continuous-time RL framework that treats score functions as actions, with a KL-regularized objective and a specialized value-network design to exploit diffusion structure. A scalable continuous-time policy optimization algorithm, inspired by TRPO/PPO, is derived and validated on both small-step CIFAR-10 diffusion models and a large-scale Stable Diffusion v1.5 setup, showing faster convergence and better sample quality than discrete-time baselines. The results demonstrate the practical viability of continuous-time RL for diffusion-model fine-tuning and suggest broad potential for improved RLHF integrations and diffusion-solver compatibility.

Abstract

Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area use a discrete-time formulation, which is prone to induced discretization errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tune diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby making connections to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models of Stable Diffusion v1.5.

Score as Action: Fine-Tuning Diffusion Generative Models by Continuous-time Reinforcement Learning

TL;DR

The paper tackles the discretization limitations of reinforcement learning when fine-tuning diffusion models for human-aligned generation. It introduces a continuous-time RL framework that treats score functions as actions, with a KL-regularized objective and a specialized value-network design to exploit diffusion structure. A scalable continuous-time policy optimization algorithm, inspired by TRPO/PPO, is derived and validated on both small-step CIFAR-10 diffusion models and a large-scale Stable Diffusion v1.5 setup, showing faster convergence and better sample quality than discrete-time baselines. The results demonstrate the practical viability of continuous-time RL for diffusion-model fine-tuning and suggest broad potential for improved RLHF integrations and diffusion-solver compatibility.

Abstract

Reinforcement learning from human feedback (RLHF), which aligns a diffusion model with input prompt, has become a crucial step in building reliable generative AI models. Most works in this area use a discrete-time formulation, which is prone to induced discretization errors, and often not applicable to models with higher-order/black-box solvers. The objective of this study is to develop a disciplined approach to fine-tune diffusion models using continuous-time RL, formulated as a stochastic control problem with a reward function that aligns the end result (terminal state) with input prompt. The key idea is to treat score matching as controls or actions, and thereby making connections to policy optimization and regularization in continuous-time RL. To carry out this idea, we lay out a new policy optimization framework for continuous-time RL, and illustrate its potential in enhancing the value networks design space via leveraging the structural property of diffusion models. We validate the advantages of our method by experiments in downstream tasks of fine-tuning large-scale Text2Image models of Stable Diffusion v1.5.

Paper Structure

This paper contains 23 sections, 6 theorems, 88 equations, 10 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

For any given $c$, the KL divergence between $p^{\theta}$ and $p^{\theta_{pre}}$ is:

Figures (10)

  • Figure 1: Reward curve of model checkpoints sampling under different discretization steps (25, 50, 100): After training Stable Diffusion v1.4 for a fixed prompt with 60 training steps by DDPO DDPO with 50 discretization steps, the average reward of images generated by the checkpoints obtained (under 50 discretization steps) evaluated by ImageReward ImageReward increases by 0.046, while the average reward of images generated with 100 discretization steps only increases by less than 0.016.
  • Figure 2: Pretraining Value Function with Different Architecture.
  • Figure 3: Training curves of DxMI and continuous-time RL.
  • Figure 4: DxMI samples at the $6000$-th step
  • Figure 5: Continuous-time RL samples at the $6000$-th step
  • ...and 5 more figures

Theorems & Definitions (6)

  • Theorem 3.1
  • Theorem 4.1
  • Lemma 4.2
  • Theorem 4.4
  • Theorem 2.1
  • Lemma 2.2: Theorem 5 of jia2022policy_gradient when $R\equiv 0$