Table of Contents
Fetching ...

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, Ming-Yu Liu

TL;DR

DiffusionNFT introduces a forward-process online reinforcement learning framework for diffusion models that leverages flow matching and negative-aware fine-tuning. By contrasting positive and negative generations, it yields a likelihood-free, off-policy training objective that integrates reinforcement signals directly into the diffusion objective. The approach eliminates reliance on likelihood estimation and reverse-process RL, enabling solver-agnostic data collection and CFG-free optimization. Empirically, DiffusionNFT achieves up to 25x efficiency gains over FlowGRPO and rapidly improves multiple reward signals, including GenEval, demonstrating strong CFG-free performance and robust multi-reward improvements. This work offers a practical, principled path toward unifying supervised and reinforcement learning in diffusion modeling with practical impact on efficient, reward-driven image generation.

Abstract

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

TL;DR

DiffusionNFT introduces a forward-process online reinforcement learning framework for diffusion models that leverages flow matching and negative-aware fine-tuning. By contrasting positive and negative generations, it yields a likelihood-free, off-policy training objective that integrates reinforcement signals directly into the diffusion objective. The approach eliminates reliance on likelihood estimation and reverse-process RL, enabling solver-agnostic data collection and CFG-free optimization. Empirically, DiffusionNFT achieves up to 25x efficiency gains over FlowGRPO and rapidly improves multiple reward signals, including GenEval, demonstrating strong CFG-free performance and robust multi-reward improvements. This work offers a practical, principled path toward unifying supervised and reinforcement learning in diffusion modeling with practical impact on efficient, reward-driven image generation.

Abstract

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

Paper Structure

This paper contains 25 sections, 6 theorems, 56 equations, 13 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

Consider diffusion models ${\bm{v}}^+$, ${\bm{v}}^-$, and ${\bm{v}}^\text{old}$ for the policy triplet $\pi^+$, $\pi^-$, and $\pi^\text{old}$. The directional differences between these models are proportional: where $0\leq\alpha({\bm{x}}_t)\leq1$ is a scalar coefficient:

Figures (13)

  • Figure 1: Performance of DiffusionNFT. (a) Head-to-head comparison with FlowGRPO on the GenEval task. (b) By employing multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested, while being fully CFG-free.
  • Figure 2: Comparison between Forward-Process RL (NFT) and Reverse-Process RL (GRPO). NFT allows using any solvers and does not require storing the whole sampling trajectory for optimization.
  • Figure 3: Improvement Direction.
  • Figure 4: DiffusionNFT jointly optimizes two dual diffusion objectives, on both positive ($r=1$) and negative ($r=0$) branches. Rather than training two independent models ${\bm{v}}_\theta^+$ and ${\bm{v}}_\theta^-$, it adopts an implicit parameterization technique that directly optimizes a single target policy ${\bm{v}}_\theta$.
  • Figure 5: Qualitative Comparison. The prompts are taken from GenEval, OCR and DrawBench respectively, where we compare the corresponding FlowGRPO model with our model.
  • ...and 8 more figures

Theorems & Definitions (10)

  • Theorem 3.1: Improvement Direction
  • Theorem 3.2: Policy Optimization
  • Lemma A.1: Distribution Split
  • proof
  • Lemma A.2: Posterior Split
  • proof
  • Theorem A.3: Improvement Direction
  • proof
  • Theorem A.4: Reinforcement Guidance Optimization
  • proof