Table of Contents
Fetching ...

A Dense Reward View on Aligning Text-to-Image Diffusion with Preference

Shentao Yang, Tianqi Chen, Mingyuan Zhou

TL;DR

This work reframes text-to-image diffusion alignment with user preference as a dense reward learning problem over the diffusion reverse chain. By introducing temporal discounting with $\gamma<1$, the method emphasizes early steps that shape high-level image structure, and derives a tractable Bradley–Terry–style loss that avoids explicit trajectory rewards. The approach is implemented in an off-policy, LoRA-enabled setup and is shown to outperform trajectory-level baselines on single and multi-prompt tasks, both quantitatively (ImageReward, Aesthetic) and qualitatively (trajectory analyses, human win rates). The results suggest that dense, per-step guidance aligns generations more efficiently and effectively with user preferences, with potential extensions to real human feedback and broader diffusion-based modalities.

Abstract

Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. This may harm the efficacy and efficiency of preference alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into DPO-style explicit-reward-free objectives, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further investigations are conducted to illustrate the insight of our approach.

A Dense Reward View on Aligning Text-to-Image Diffusion with Preference

TL;DR

This work reframes text-to-image diffusion alignment with user preference as a dense reward learning problem over the diffusion reverse chain. By introducing temporal discounting with , the method emphasizes early steps that shape high-level image structure, and derives a tractable Bradley–Terry–style loss that avoids explicit trajectory rewards. The approach is implemented in an off-policy, LoRA-enabled setup and is shown to outperform trajectory-level baselines on single and multi-prompt tasks, both quantitatively (ImageReward, Aesthetic) and qualitatively (trajectory analyses, human win rates). The results suggest that dense, per-step guidance aligns generations more efficiently and effectively with user preferences, with potential extensions to real human feedback and broader diffusion-based modalities.

Abstract

Aligning text-to-image diffusion model (T2I) with preference has been gaining increasing research attention. While prior works exist on directly optimizing T2I by preference data, these methods are developed under the bandit assumption of a latent reward on the entire diffusion reverse chain, while ignoring the sequential nature of the generation process. This may harm the efficacy and efficiency of preference alignment. In this paper, we take on a finer dense reward perspective and derive a tractable alignment objective that emphasizes the initial steps of the T2I reverse chain. In particular, we introduce temporal discounting into DPO-style explicit-reward-free objectives, to break the temporal symmetry therein and suit the T2I generation hierarchy. In experiments on single and multiple prompt generation, our method is competitive with strong relevant baselines, both quantitatively and qualitatively. Further investigations are conducted to illustrate the insight of our approach.
Paper Structure (32 sections, 6 theorems, 44 equations, 24 figures, 7 tables, 1 algorithm)

This paper contains 32 sections, 6 theorems, 44 equations, 24 figures, 7 tables, 1 algorithm.

Key Result

Lemma 2.1

The optimal (regularized) policy Eq. (eq:optimal_policy_main) under the reward-shaped MDP ${\mathcal{M}}'$ is the same as that in the original MDP ${\mathcal{M}}$ .

Figures (24)

  • Figure 1: ImageReward scores for the seen prompts in the single prompt experiments. "Orig." denotes the original SD1.5. "SFT" is the supervised fine-tuned model. "Traj." denotes the classical DPO-style objective discussed in Section \ref{['sec:connect_with_dpo']}, i.e., assuming trajectory-level reward. All our produced results are the average over $100$ samples. Horizontal line indicates the best baseline result.
  • Figure 2: Aesthetic scores for the seen prompts in the single prompt experiments. Number reporting and abbreviations follow Fig.\ref{['fig:single_imagerew']}.
  • Figure 3: Generated images in the single prompt experiment for both seen and unseen prompts (Table \ref{['table:four_single_prompts']}). Each comparison is generated from the same random seed. "Traj. Rew." denotes the classical DPO-style objective of assuming trajectory-level reward (Section \ref{['sec:connect_with_dpo']}).
  • Figure 4: Generated images in the multiple prompt experiment from our method and baselines, with prompts. "DL" denotes Dreamlike Photoreal 2.0, the best baseline from HPSv2 paper. "Traj. Rew." is the classical DPO-style objective of assuming trajectory-level reward.
  • Figure 5: Generation trajectories from our method and the baselines on the prompt "A green colored rabbit." in the single prompt experiment, correspond to the images in Fig. \ref{['fig:single_prompt_generated_images']}. Shown are the $\hat{{\bm{x}}}_0$ predicted from the latents at the specified steps of the reverse chain.
  • ...and 19 more figures

Theorems & Definitions (21)

  • Remark 2.1: Practical Rationality of $e(\tau)$
  • Definition 2.1: Reward Shaping
  • Lemma 2.1: Invariance of Optimal Policy under Reward Shaping
  • Definition 2.1
  • Remark 2.1
  • Theorem 2.2
  • Remark 2.2
  • Remark 2.2
  • Definition 2.1: Reward Shaping
  • Lemma 2.1: Invariance of Optimal Policy under Reward Shaping
  • ...and 11 more