Table of Contents
Fetching ...

Diffusion Reinforcement Learning via Centered Reward Distillation

Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton

Abstract

Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.

Diffusion Reinforcement Learning via Centered Reward Distillation

Abstract

Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.
Paper Structure (64 sections, 36 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 64 sections, 36 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: Qualitative results produced by our RL fine-tuned SD3.5M esser2024scaling model with GenEvalghosh2023geneval reward (top) and OCRchen2023textdiffuser reward (bottom).
  • Figure 2: For each prompt, $K$ samples $\{x_i\}_{i=1}^K$ are generated from the sampling model $p_{\mathrm{samp}}$ ($p_s$). A reward model produces external rewards $r(c,x_i)$, and implicit model rewards $\widehat{R}_{\theta,i}$ are estimated via diffusion ELBO differences between the current model $p_\theta$ and a moving reference $p_{\mathrm{old}}$ ($p_o$). Within-prompt centering yields $\Delta_{r,w}^i$ and $\Delta_{\widehat{R},w}^i$, cancelling the prompt-dependent normalizer and enabling a well-posed matching objective. A initial KL penalty with respect to the fixed CFG-guided pretrained model $p_\phi^{\mathrm{\hbox{CFG}}}$ is imposed to prevent reward hacking.
  • Figure 3: Visual comparison between benchmarks and our models.
  • Figure 4: Ablations on slow old model decay rate $\eta_{\mathrm{old}}$, and initial KL strength $\beta_{\mathrm{init}}$.
  • Figure 5: Visual comparison corresponding to the ablations in \ref{['fig:Ablation']}.
  • ...and 9 more figures