A Dense Reward View on Aligning Text-to-Image Diffusion with Preference
Shentao Yang, Tianqi Chen, Mingyuan Zhou
TL;DR
This work reframes text-to-image diffusion alignment with user preference as a dense reward learning problem over the diffusion reverse chain. By introducing temporal discounting with $\gamma<1$, the method emphasizes early steps that shape high-level image structure, and derives a tractable Bradley–Terry–style loss that avoids explicit trajectory rewards. The approach is implemented in an off-policy, LoRA-enabled setup and is shown to outperform trajectory-level baselines on single and multi-prompt tasks, both quantitatively (ImageReward, Aesthetic) and qualitatively (trajectory analyses, human win rates). The results suggest that dense, per-step guidance aligns generations more efficiently and effectively with user preferences, with potential extensions to real human feedback and broader diffusion-based modalities.
Abstract
Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. This may harm the efficacy and efficiency of preference alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into DPO-style explicit-reward-free objectives, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further investigations are conducted to illustrate the insight of our approach.
