Table of Contents
Fetching ...

Pixel-wise RL on Diffusion Models: Reinforcement Learning from Rich Feedback

Mo Kordzanganeh, Danial Keshvary, Nariman Arian

TL;DR

The paper tackles the problem of aligning latent diffusion models with human preferences efficiently by addressing the sparse, global reward signal used in prior RLHF methods. It introduces Pixel-wise Policy Optimisation (PXPO), a pixel-level extension of DDPO that replaces a single final-image reward with per-pixel rewards and per-pixel conditional likelihoods, formalised as $r(x_0,c) = \sum_{i,j} r(x_0^{i,j},c)$ and $\nabla_\theta \mathcal{J}_{\text{PXPO}} = \mathbb{E}[ \sum_{i,j} r(x_0^{i,j},c) \sum_{t=0}^{T} \nabla_\theta \log p_\theta(x_{t-1}^{i,j}|x_t,c) ]$. This approach eliminates cross-talk between pixels by enforcing per-pixel credit through Kronecker deltas, enabling richer and more scalable guidance without training a reward model. Empirically, PXPO shows improvements in colour-based pixel control, AI-based segmentation-driven feedback, and single-image human-guided refinements, demonstrating faster and more targeted alignment with user intents. Overall, PXPO offers a practical path for fine-grained alignment of diffusion-based image generators by leveraging pixel-wise feedback directly in the DDIM RLHF loop.

Abstract

Latent diffusion models are the state-of-the-art for synthetic image generation. To align these models with human preferences, training the models using reinforcement learning on human feedback is crucial. Black et. al 2024 introduced denoising diffusion policy optimisation (DDPO), which accounts for the iterative denoising nature of the generation by modelling it as a Markov chain with a final reward. As the reward is a single value that determines the model's performance on the entire image, the model has to navigate a very sparse reward landscape and so requires a large sample count. In this work, we extend the DDPO by presenting the Pixel-wise Policy Optimisation (PXPO) algorithm, which can take feedback for each pixel, providing a more nuanced reward to the model.

Pixel-wise RL on Diffusion Models: Reinforcement Learning from Rich Feedback

TL;DR

The paper tackles the problem of aligning latent diffusion models with human preferences efficiently by addressing the sparse, global reward signal used in prior RLHF methods. It introduces Pixel-wise Policy Optimisation (PXPO), a pixel-level extension of DDPO that replaces a single final-image reward with per-pixel rewards and per-pixel conditional likelihoods, formalised as and . This approach eliminates cross-talk between pixels by enforcing per-pixel credit through Kronecker deltas, enabling richer and more scalable guidance without training a reward model. Empirically, PXPO shows improvements in colour-based pixel control, AI-based segmentation-driven feedback, and single-image human-guided refinements, demonstrating faster and more targeted alignment with user intents. Overall, PXPO offers a practical path for fine-grained alignment of diffusion-based image generators by leveraging pixel-wise feedback directly in the DDIM RLHF loop.

Abstract

Latent diffusion models are the state-of-the-art for synthetic image generation. To align these models with human preferences, training the models using reinforcement learning on human feedback is crucial. Black et. al 2024 introduced denoising diffusion policy optimisation (DDPO), which accounts for the iterative denoising nature of the generation by modelling it as a Markov chain with a final reward. As the reward is a single value that determines the model's performance on the entire image, the model has to navigate a very sparse reward landscape and so requires a large sample count. In this work, we extend the DDPO by presenting the Pixel-wise Policy Optimisation (PXPO) algorithm, which can take feedback for each pixel, providing a more nuanced reward to the model.
Paper Structure (14 sections, 8 equations, 7 figures)

This paper contains 14 sections, 8 equations, 7 figures.

Figures (7)

  • Figure 1: The result of training the PXPO on a single image. The original image was created using the prompt: nature landscape, and then two approaches were followed: (top) reducing the trees, increasing the lake; and (bottom) reducing the lake, increasing the trees. At each step, a human participant identified the trees and lake and provided feedback accordingly. The figure above shows the human feedback for both cases for the first step, where red indicates a $-2$, green a $+2$, and black a reward of $0$ for the corresponding pixel. After only 15 steps, the same image was aligned dramatically differently with the user's two objectives. This is also evident in the improvement of the mean reward (taken over all pixels) for each task.
  • Figure 2: The PXPO algorithm pipeline. The procedure begins by initialising a latent noise sampled from $\mathcal{N}(0,1)$. At each denoising step, we keep the gradients of the pixel-wise log probabilities. Then the pixel-wise feedback is collected from the black-box feedback function, which in this example is the blue channel of the image. It is, then, downsampled using interpolation to match the size of the latent images, transforming it into the reward. Then, the downsampled reward is multiplied element-wise by each log-likelihood gradient. Then, the mean of these gradients across both the time and pixel dimensions are taken to update the model according to Eqn. \ref{['eqn:pxpo_grad']}.
  • Figure 3: The PXPO algorithm effectively reduced the value of the blue channel, while keeping the context intact. The top image is the original and the downward direction indicates evolution.
  • Figure 4: Reward plot of the colour-based, pixel-wise feedback.
  • Figure 5: PXPO minimising the visible hair in the image. In this example, the SegFormer model detected parts of the man's hat (top image) as hair, so the model received negative rewards in those areas. After some exploration involving removing the hat, the model realised it could put the hat back on in a way that is undetected by the SegFormer model.
  • ...and 2 more figures