Pixel-wise RL on Diffusion Models: Reinforcement Learning from Rich Feedback
Mo Kordzanganeh, Danial Keshvary, Nariman Arian
TL;DR
The paper tackles the problem of aligning latent diffusion models with human preferences efficiently by addressing the sparse, global reward signal used in prior RLHF methods. It introduces Pixel-wise Policy Optimisation (PXPO), a pixel-level extension of DDPO that replaces a single final-image reward with per-pixel rewards and per-pixel conditional likelihoods, formalised as $r(x_0,c) = \sum_{i,j} r(x_0^{i,j},c)$ and $\nabla_\theta \mathcal{J}_{\text{PXPO}} = \mathbb{E}[ \sum_{i,j} r(x_0^{i,j},c) \sum_{t=0}^{T} \nabla_\theta \log p_\theta(x_{t-1}^{i,j}|x_t,c) ]$. This approach eliminates cross-talk between pixels by enforcing per-pixel credit through Kronecker deltas, enabling richer and more scalable guidance without training a reward model. Empirically, PXPO shows improvements in colour-based pixel control, AI-based segmentation-driven feedback, and single-image human-guided refinements, demonstrating faster and more targeted alignment with user intents. Overall, PXPO offers a practical path for fine-grained alignment of diffusion-based image generators by leveraging pixel-wise feedback directly in the DDIM RLHF loop.
Abstract
Latent diffusion models are the state-of-the-art for synthetic image generation. To align these models with human preferences, training the models using reinforcement learning on human feedback is crucial. Black et. al 2024 introduced denoising diffusion policy optimisation (DDPO), which accounts for the iterative denoising nature of the generation by modelling it as a Markov chain with a final reward. As the reward is a single value that determines the model's performance on the entire image, the model has to navigate a very sparse reward landscape and so requires a large sample count. In this work, we extend the DDPO by presenting the Pixel-wise Policy Optimisation (PXPO) algorithm, which can take feedback for each pixel, providing a more nuanced reward to the model.
