Table of Contents
Fetching ...

SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai

TL;DR

This work introduces Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions and provides a principled justification and practical guidance for negative reweighting.

Abstract

A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces an over-greedy policy and fails to leverage feedback from negative samples. In this work, we introduce Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions. SiMPO revisits diffusion RL via a two-stage measure matching lens. First, we construct a virtual target policy by $f$-divergence regularized policy optimization, where we can relax the non-negativity constraint to allow for a signed target measure. Second, we use this signed measure to guide diffusion or flow models through reweighted matching. This formulation offers two key advantages: a) it generalizes to arbitrary monotonically increasing weighting functions; and b) it provides a principled justification and practical guidance for negative reweighting. Furthermore, we provide geometric interpretations to illustrate how negative reweighting actively repels the policy from suboptimal actions. Extensive empirical evaluations demonstrate that SiMPO achieves superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods tailored to the reward landscape.

SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

TL;DR

This work introduces Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions and provides a principled justification and practical guidance for negative reweighting.

Abstract

A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces an over-greedy policy and fails to leverage feedback from negative samples. In this work, we introduce Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions. SiMPO revisits diffusion RL via a two-stage measure matching lens. First, we construct a virtual target policy by -divergence regularized policy optimization, where we can relax the non-negativity constraint to allow for a signed target measure. Second, we use this signed measure to guide diffusion or flow models through reweighted matching. This formulation offers two key advantages: a) it generalizes to arbitrary monotonically increasing weighting functions; and b) it provides a principled justification and practical guidance for negative reweighting. Furthermore, we provide geometric interpretations to illustrate how negative reweighting actively repels the policy from suboptimal actions. Extensive empirical evaluations demonstrate that SiMPO achieves superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods tailored to the reward landscape.
Paper Structure (39 sections, 7 theorems, 88 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 39 sections, 7 theorems, 88 equations, 6 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.5

Assume $g$ is strictly increasing, we have guaranteed policy improvement on $\pi$ over $\pi_{\rm old}$, i.e., $\mathbb{E}_{\pi}[Q(\bm{s}, \bm{a})]\geqslant \mathbb{E}_{\pi}[Q(\bm{s}, \bm{a})$ for $\forall s\in{\mathcal{S}}$.

Figures (6)

  • Figure 1: A demonstrative comparison between the existing advantage-weighted regression (up) and the proposed SiMPO framework (down). We employ a two-stage viewpoint, via first creating a target measure that can be signed or unnormalized and projecting it back to probability distributions by reweighted conditional flow matching. Especially, the negative weights have a "repelling" effect, pushing the generated action away from negative samples, enhancing algorithm performance.
  • Figure 2: Bandit problem to answer Q2. Left: Reward functions and initial policy. Right: Regret curves comparing different reweighting with and without negative reweighting functions over 200 epochs.
  • Figure 3: Demonstration of the first bandit problem exploring Q2. Left column: Reward functions (blue) showing broad (top) vs. sharp (bottom) optima, with initial policy distributions (histograms). Right column: Regret curves comparing different reweighting schemes over 100 epochs.
  • Figure 4: Reward functions.
  • Figure 5: Performance with different reweighting functions with different reward functions on the Cheetah Run task. Square reweighting achieves the best performance with flat rewards, while linear reweighting achieves the best performance with steep rewards, both outperforming exponential reweighting.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Example 3.1: DPMD ma2025efficient as forward KL-regularization
  • Example 3.2: QVPO ding2024diffusion as $\chi^2$ regularization.
  • Example 3.3: Power function reweighting as $\alpha$-divergence regularization
  • Example 3.4: wd1 tang2025wd1 as an example of negative reweighting
  • Theorem 3.5: Guaranteed policy improvement, proof in Appendix \ref{['sec.apdx.policy_improvement']}
  • Remark 3.6: Another interpretation from generation of reweighted distributions via signed measures
  • Theorem 3.1: Theorem \ref{['thm:policy_improvement']} restated
  • Lemma 4.1: Conditional flow matching fits marginal velocity lipman2022flow.
  • Corollary 4.2: Diffusion marginal velocity field is the noisy score functions.
  • Theorem 4.3: Optimal solution of the reweighted conditional flow matching.
  • ...and 2 more