TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Yihong Luo; Tianyang Hu; Weijian Luo; Jing Tang

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Yihong Luo, Tianyang Hu, Weijian Luo, Jing Tang

TL;DR

TDM-R1 is a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics.

Abstract

While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

TL;DR

Abstract

Paper Structure (20 sections, 27 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 27 equations, 10 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries
Method
Accurate Intermediate Reward Estimation via Deterministic Trajectories
Surrogate Reward Learning
Few-Step Generator Learning
Experiments
Experimental Setup
Main Results
Ablation Study
Related Works
Conclusion
Derivation
Derivation of \ref{['eq:kl_dgpo']}
Derivation of \ref{['eq:tdm_r1_grad']}
...and 5 more sections

Figures (10)

Figure 1: Samples generated by TDM-R1 using only 4 NFEs, obtained by reinforcing the recent powerful Z-Image model zimage.
Figure 2: TDM-R1 rapidly boosts GenEval score of few-step TDM, notably outperforming its many-step base model and GPT-4o. This is achieved without sacrificing out-of-domain metrics.
Figure 3: Qualitative comparisons of TDM-R1 against competing methods.
Figure 4: Compare the training performance and speed of TDM-R1 and potential baselines.
Figure 5: Compare TDM-R1 with the direct combination of TDM and RL loss.
...and 5 more figures

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

TL;DR

Abstract

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Authors

TL;DR

Abstract

Table of Contents

Figures (10)