Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning
Xinyao Liao, Wei Wei, Xiaoye Qu, Yu Cheng
TL;DR
This work tackles reward sparsity in RL-based fine-tuning of text-to-image diffusion models by introducing Contribution-based Credit Assignment (CoCA). CoCA estimates each denoising step's contribution to the final image using cosine similarity between intermediate latent representations and the final latent, then redistributes the terminal reward across steps with fixed-window smoothing and two-stage normalization to produce dense, informative step-level rewards. The authors prove that this reward shaping is potential-based and invariant with respect to the optimal policy, ensuring theoretical consistency with the original objective. Empirically, CoCA achieves 1.25x–2x improvements in sample efficiency across four human-preference reward functions and generalizes better to unseen rewards and prompts, without adding auxiliary networks. These results suggest that adaptive, contribution-aware credit assignment can substantially enhance RL-guided T2I diffusion fine-tuning and facilitate more efficient, controllable image synthesis.
Abstract
Recent advances in text-to-image (T2I) diffusion model fine-tuning leverage reinforcement learning (RL) to align generated images with learnable reward functions. The existing approaches reformulate denoising as a Markov decision process for RL-driven optimization. However, they suffer from reward sparsity, receiving only a single delayed reward per generated trajectory. This flaw hinders precise step-level attribution of denoising actions, undermines training efficiency. To address this, we propose a simple yet effective credit assignment framework that dynamically distributes dense rewards across denoising steps. Specifically, we track changes in cosine similarity between intermediate and final images to quantify each step's contribution on progressively reducing the distance to the final image. Our approach avoids additional auxiliary neural networks for step-level preference modeling and instead uses reward shaping to highlight denoising phases that have a greater impact on image quality. Our method achieves 1.25 to 2 times higher sample efficiency and better generalization across four human preference reward functions, without compromising the original optimal policy.
