Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo
TL;DR
This work introduces DiNa-LRM, a diffusion-native latent reward model that learns human preferences directly on noisy diffusion states. By extending the Thurstone framework with diffusion-noise-dependent uncertainty and building a latent-space reward head on a pretrained diffusion backbone, the method enables test-time noise ensembling for robust scoring while avoiding pixel-space rewards. Across image-alignment benchmarks, DiNa-LRM outperforms diffusion-based baselines and approaches state-of-the-art VLM rewards at a fraction of the cost, while also improving preference optimization dynamics in post-training alignment. This diffusion-native approach offers a practical, scalable alternative to VLM rewards, reducing memory and compute overhead and mitigating latent-to-pixel mismatches during reward-guided alignment.
Abstract
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
