SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

Haitong Ma; Chenxiao Gao; Tianyi Chen; Na Li; Bo Dai

SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai

TL;DR

This work introduces Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions and provides a principled justification and practical guidance for negative reweighting.

Abstract

A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces an over-greedy policy and fails to leverage feedback from negative samples. In this work, we introduce Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions. SiMPO revisits diffusion RL via a two-stage measure matching lens. First, we construct a virtual target policy by $f$-divergence regularized policy optimization, where we can relax the non-negativity constraint to allow for a signed target measure. Second, we use this signed measure to guide diffusion or flow models through reweighted matching. This formulation offers two key advantages: a) it generalizes to arbitrary monotonically increasing weighting functions; and b) it provides a principled justification and practical guidance for negative reweighting. Furthermore, we provide geometric interpretations to illustrate how negative reweighting actively repels the policy from suboptimal actions. Extensive empirical evaluations demonstrate that SiMPO achieves superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods tailored to the reward landscape.

SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

TL;DR

Abstract

-divergence regularized policy optimization, where we can relax the non-negativity constraint to allow for a signed target measure. Second, we use this signed measure to guide diffusion or flow models through reweighted matching. This formulation offers two key advantages: a) it generalizes to arbitrary monotonically increasing weighting functions; and b) it provides a principled justification and practical guidance for negative reweighting. Furthermore, we provide geometric interpretations to illustrate how negative reweighting actively repels the policy from suboptimal actions. Extensive empirical evaluations demonstrate that SiMPO achieves superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods tailored to the reward landscape.

Paper Structure (39 sections, 7 theorems, 88 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 39 sections, 7 theorems, 88 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Preliminaries
Reinforcement Learning
Diffusion and Flow Models
$f$-divergence and Extention to Signed Measure
Extension to $p$ being a signed measure csiszar2002membroniatowski2006minimization.
A Unified Framework for Reweighting with Measure Matching
A Two-Stage Framework for Diffusion Policy Optimization
Unifying Existing Reweighting Schemes via $f$-Divergences
Negative Reweighting by Extending $f$-divergence Definition to Signed Measures
Geometric Interpretation of Negative Reweighted Flow Matching
SiMPO: A Practical Algorithm
Normalization constraints
Experiments
Exploration-Exploitation Trade-off in Bandit
...and 24 more sections

Key Result

Theorem 3.5

Assume $g$ is strictly increasing, we have guaranteed policy improvement on $\pi$ over $\pi_{\rm old}$, i.e., $\mathbb{E}_{\pi}[Q(\bm{s}, \bm{a})]\geqslant \mathbb{E}_{\pi}[Q(\bm{s}, \bm{a})$ for $\forall s\in{\mathcal{S}}$.

Figures (6)

Figure 1: A demonstrative comparison between the existing advantage-weighted regression (up) and the proposed SiMPO framework (down). We employ a two-stage viewpoint, via first creating a target measure that can be signed or unnormalized and projecting it back to probability distributions by reweighted conditional flow matching. Especially, the negative weights have a "repelling" effect, pushing the generated action away from negative samples, enhancing algorithm performance.
Figure 2: Bandit problem to answer Q2. Left: Reward functions and initial policy. Right: Regret curves comparing different reweighting with and without negative reweighting functions over 200 epochs.
Figure 3: Demonstration of the first bandit problem exploring Q2. Left column: Reward functions (blue) showing broad (top) vs. sharp (bottom) optima, with initial policy distributions (histograms). Right column: Regret curves comparing different reweighting schemes over 100 epochs.
Figure 4: Reward functions.
Figure 5: Performance with different reweighting functions with different reward functions on the Cheetah Run task. Square reweighting achieves the best performance with flat rewards, while linear reweighting achieves the best performance with steep rewards, both outperforming exponential reweighting.
...and 1 more figures

Theorems & Definitions (12)

Example 3.1: DPMD ma2025efficient as forward KL-regularization
Example 3.2: QVPO ding2024diffusion as $\chi^2$ regularization.
Example 3.3: Power function reweighting as $\alpha$-divergence regularization
Example 3.4: wd1 tang2025wd1 as an example of negative reweighting
Theorem 3.5: Guaranteed policy improvement, proof in Appendix \ref{['sec.apdx.policy_improvement']}
Remark 3.6: Another interpretation from generation of reweighted distributions via signed measures
Theorem 3.1: Theorem \ref{['thm:policy_improvement']} restated
Lemma 4.1: Conditional flow matching fits marginal velocity lipman2022flow.
Corollary 4.2: Diffusion marginal velocity field is the noisy score functions.
Theorem 4.3: Optimal solution of the reweighted conditional flow matching.
...and 2 more

SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

TL;DR

Abstract

SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (12)