Table of Contents
Fetching ...

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo

TL;DR

This work introduces DiNa-LRM, a diffusion-native latent reward model that learns human preferences directly on noisy diffusion states. By extending the Thurstone framework with diffusion-noise-dependent uncertainty and building a latent-space reward head on a pretrained diffusion backbone, the method enables test-time noise ensembling for robust scoring while avoiding pixel-space rewards. Across image-alignment benchmarks, DiNa-LRM outperforms diffusion-based baselines and approaches state-of-the-art VLM rewards at a fraction of the cost, while also improving preference optimization dynamics in post-training alignment. This diffusion-native approach offers a practical, scalable alternative to VLM rewards, reducing memory and compute overhead and mitigating latent-to-pixel mismatches during reward-guided alignment.

Abstract

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

TL;DR

This work introduces DiNa-LRM, a diffusion-native latent reward model that learns human preferences directly on noisy diffusion states. By extending the Thurstone framework with diffusion-noise-dependent uncertainty and building a latent-space reward head on a pretrained diffusion backbone, the method enables test-time noise ensembling for robust scoring while avoiding pixel-space rewards. Across image-alignment benchmarks, DiNa-LRM outperforms diffusion-based baselines and approaches state-of-the-art VLM rewards at a fraction of the cost, while also improving preference optimization dynamics in post-training alignment. This diffusion-native approach offers a practical, scalable alternative to VLM rewards, reducing memory and compute overhead and mitigating latent-to-pixel mismatches during reward-guided alignment.

Abstract

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
Paper Structure (50 sections, 18 equations, 7 figures, 8 tables, 2 algorithms)

This paper contains 50 sections, 18 equations, 7 figures, 8 tables, 2 algorithms.

Figures (7)

  • Figure 1: Overview of DiNa-LRM.Left: Diffusion-native Preference Learning. During training, clean preference pairs $(x_0^+, x_0^-, c)$ are perturbed to noisy states $(x_t^+, x_t^-)$ and evaluated by a time-conditioned reward model $r_\theta$. We employ a noise-calibrated Thurstone likelihood where comparison variance scales with the diffusion noise level $\sigma_t$, and optimize via a fidelity loss $\mathcal{L}$. Right: Latent Reward Architecture. Multi-layer visual and text features extracted from a latent diffusion backbone are FiLM-modulated by timestep embeddings $t_{emb}$. These features are aggregated through a gated Q-Former and an MLP to produce a scalar reward score.
  • Figure 2: Effect of Inference-time Noise Level.Uniform sampling performs consistently well across a wide range of $t$, with accuracy peaking at mid-range timesteps.
  • Figure 3: Training Curves (ReFL on SD3.5-M). We optimize with either HPSv3 or DiNa-LRM (Ours) as the proxy reward. We report the optimized proxy score (right) and an external held-out golden metric(PickScore; left). DiNa-LRM improves the proxy score faster while the golden metric increases in tandem.
  • Figure 4: Efficiency Analysis. Peak varm(top) and per-step tflops(bottom) for a single ReFL optimization step is reported.
  • Figure 5: Training Curves (Flow-GRPO-Fast on SD3.5-M). We optimize with DiNa-LRM as the proxy reward. We report the optimized proxy score (right) and an external held-out golden metric(PickScore; left).
  • ...and 2 more figures