Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

Zijing Ou; Jacob Si; Junyi Zhu; Ondrej Bohdal; Mete Ozay; Taha Ceritli; Yingzhen Li

Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

Zijing Ou, Jacob Si, Junyi Zhu, Ondrej Bohdal, Mete Ozay, Taha Ceritli, Yingzhen Li

TL;DR

Diffusion alignment seeks samples from a reward-tilted target along the denoising trajectory, defined as $p_{tilt}$. The paper introduces VMPO, reframing alignment as minimising the variance of log importance weights in a Sequential Monte Carlo view, and proves the optimum target equals $p_{tilt}$. It shows that, under on-policy sampling, the gradient of VMPO matches the gradient of the KL divergence $KL(p_theta||p_{tilt})$. A practical VMPO objective uses a neural baseline $M_phi(t)$ to estimate $E_h[log w_t]$, yielding VMPO with variants VMPO-R2G and VMPO-Diff. Empirically, VMPO improves reward metrics on Stable Diffusion 1.5/3.5 across several rewards, showing better sample efficiency but encountering reward-hacking and diversity trade-offs.

Abstract

Diffusion alignment adapts pretrained diffusion models to sample from reward-tilted distributions along the denoising trajectory. This process naturally admits a Sequential Monte Carlo (SMC) interpretation, where the denoising model acts as a proposal and reward guidance induces importance weights. Motivated by this view, we introduce Variance Minimisation Policy Optimisation (VMPO), which formulates diffusion alignment as minimising the variance of log importance weights rather than directly optimising a Kullback-Leibler (KL) based objective. We prove that the variance objective is minimised by the reward-tilted target distribution and that, under on-policy sampling, its gradient coincides with that of standard KL-based alignment. This perspective offers a common lens for understanding diffusion alignment. Under different choices of potential functions and variance minimisation strategies, VMPO recovers various existing methods, while also suggesting new design directions beyond KL.

Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

TL;DR

Diffusion alignment seeks samples from a reward-tilted target along the denoising trajectory, defined as

. The paper introduces VMPO, reframing alignment as minimising the variance of log importance weights in a Sequential Monte Carlo view, and proves the optimum target equals

. It shows that, under on-policy sampling, the gradient of VMPO matches the gradient of the KL divergence

. A practical VMPO objective uses a neural baseline

to estimate

, yielding VMPO with variants VMPO-R2G and VMPO-Diff. Empirically, VMPO improves reward metrics on Stable Diffusion 1.5/3.5 across several rewards, showing better sample efficiency but encountering reward-hacking and diversity trade-offs.

Abstract

Paper Structure (21 sections, 2 theorems, 59 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 2 theorems, 59 equations, 8 figures, 3 tables, 1 algorithm.

Introduction
Background: Diffusion Alignment
VMPO: Variance Minimisation Policy Optimiser
Experiments
Conclusion and Limitation
Diffusion Alignment: a Tale of Two Views
Diffusion Alignment as Policy Optimisation
Diffusion Alignment as Probability Inference
REINFORCE, PPO, and GRPO
Proofs and Derivations
Proof of \ref{['prop:optimal-proposal']}
Proof of \ref{['eq:graidnet-monte-carlo-gradient']}
Holistic VMPO Kaleidoscopes
Demystifying VMPO
VMPO Kaleidoscopes
...and 6 more sections

Key Result

Proposition 1

The optimum of $\mathcal{L}^{h}_{\mathrm{Var}}(t;\theta)$ satisfies $p_{\theta^*} = p_{\mathrm{tilt}}(x_{t-1} | x_t)$, $\theta^* = \mathop{\mathrm{argmin}}\limits_\theta \mathcal{L}^{h}_{\mathrm{Var}}(t;\theta)$. Moreover, $\left. \nabla_\theta \mathcal{L}^{h}_{\mathrm{Var}}(t;\theta) \right|_{h=p_\

Figures (8)

Figure 1: Visualisation of alignment dynamics over the training progress of SD1.5 with HPSv2. The generated images become more faithful to the prompt as the training continues (from left to right).
Figure 2: Illustration of the generated samples of different models.
Figure 3: HPSv2 convergence curves of SD1.5.
Figure 4: ImageReward convergence curves of SD1.5.
Figure 5: OCR accuracy convergence curves of SD3.5-M.
...and 3 more figures

Theorems & Definitions (3)

Proposition 1
Proposition 1
proof

Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

TL;DR

Abstract

Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (3)