Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan
TL;DR
<3-5 sentence high-level summary> The paper tackles the challenge of aligning diffusion models to human preferences at step level using pixel-space reward models, which struggle with noisy timesteps and transformation overhead. It proposes a latent-space reward model (LRM) that leverages pre-trained diffusion models to predict step-level preferences directly from noisy latent images, along with Multi-Preference Consistent Filtering (MPCF) and Latent Preference Optimization (LPO) to operate entirely in latent space. Empirical results show LPO yields superior general, aesthetic, and text-image alignment with substantial training speedups (2.5-28x) compared to existing methods. The work demonstrates a practical, efficient path for preference optimization in diffusion-based image generation by exploiting latent-space capabilities of diffusion models themselves.
Abstract
Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.
