Learning a Diffusion Model Policy from Rewards via Q-Score Matching

Michael Psenka; Alejandro Escontrela; Pieter Abbeel; Yi Ma

Learning a Diffusion Model Policy from Rewards via Q-Score Matching

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, Yi Ma

TL;DR

Diffusion-model policies offer expressive, sample-efficient representations for continuous control but standard training often relies on behavior cloning terms. The authors introduce Q-score matching (QSM), a theory-grounded method that aligns the diffusion-policy score with the action-gradient of the Q-function, enabling off-policy optimization by updating only the denoising model. Theoretical results show that, under both deterministic and stochastic dynamics, the optimal score aligns with ∇_a Q^Ψ, guaranteeing policy-improvement when misalignment is corrected, and empirical results demonstrate competitive performance and multimodal behavior. This work advances diffusion-model RL by exploiting score structure for efficient, explorative policy learning and provides public code for replication.

Abstract

Diffusion models have become a popular choice for representing actor policies in behavior cloning and offline reinforcement learning. This is due to their natural ability to optimize an expressive class of distributions over a continuous space. However, previous works fail to exploit the score-based structure of diffusion models, and instead utilize a simple behavior cloning term to train the actor, limiting their ability in the actor-critic setting. In this paper, we present a theoretical framework linking the structure of diffusion model policies to a learned Q-function, by linking the structure between the score of the policy to the action gradient of the Q-function. We focus on off-policy reinforcement learning and propose a new policy update method from this theory, which we denote Q-score matching. Notably, this algorithm only needs to differentiate through the denoising model rather than the entire diffusion model evaluation, and converged policies through Q-score matching are implicitly multi-modal and explorative in continuous domains. We conduct experiments in simulated environments to demonstrate the viability of our proposed method and compare to popular baselines. Source code is available from the project website: https://michaelpsenka.io/qsm.

Learning a Diffusion Model Policy from Rewards via Q-Score Matching

TL;DR

Abstract

Paper Structure (24 sections, 4 theorems, 34 equations, 9 figures, 1 algorithm)

This paper contains 24 sections, 4 theorems, 34 equations, 9 figures, 1 algorithm.

Introduction
Related work
Diffusion models in reinforcement learning
Stochastic optimal control
Problem formulation
Notation
Definitions
Time discretization
Policy gradient for diffusion policies
Policy optimization via matching the score to the Q-function
Non-stochastic setting
Stochastic setting
Pedagogical reduction in gridworld
Experiments
Continuous Control Evaluation
...and 9 more sections

Key Result

Theorem 4.1

Consider the following joint deterministic dynamics governing the state $s(t) \in \mathbb{R}^s$ and action $a(t) \in \mathbb{R}^a$: where $s(t) \in \mathbb{R}^s, a(t) \in \mathbb{R}^a$, $\|\Psi(s, a)\|_2 \le C$ for all $(s, a)$, and $\Psi$ is Lipschitz with respect to $\|\cdot \|_2$. Denote $s(t, s_0, a_0)$ the resulting state $s(t)$ from initial conditions $s_0, a_0$. Let $r: \mathbb{R}^s \to [0

Figures (9)

Figure 1: A visual description of Theorem \ref{['thm:main_nonstoch']} and Theorem \ref{['thm:main_stoch_max']}, and the implied update rule for a policy $\pi$ parameterized by a diffusion model. The left image depicts a randomly initialized score $\Psi^0$, and the right the result after one step of QSM $\Psi^1$. If there is any discrepancy between the score $\nabla_a \log(\pi(a|s))$ (orange vector, denoted $\Psi$ in the paper and optimized directly) and the action gradient $\nabla_a Q(s, a)$ (blue vector), we can forcefully align the score to the $Q$ action gradient to strictly increase the $Q$ value at $(s, a)$.
Figure 2: Pedagogical simulation of our algorithm's reduction to a simple single-goal gridworld setting. The top row is a visualization of two iterates of $\pi(a|s) \leftarrow e^{\alpha Q^\pi(s, a)} / \sum_{a'}e^{\alpha aQ^\pi(s, a')}$, for $\alpha = 2$. The color of each square is the expected reward starting from that square, and we use the local maximizing direction to define discrete gradients: $\nabla_a Q(s, a) \coloneqq a^*Q(s,a^*)$, where $a^* \coloneqq \operatornamewithlimits{argmax}_{a'} Q(s,a')$, and similarly for $\nabla_a \log(\pi(a|s))$. The bottom row shows the effect of the parameter $\alpha$ on the entropy of the converged distribution $\pi*(s|a)$. To the left is the learned policy with $\alpha = 1$, and to the right the learned policy with $\alpha = 10$.
Figure 3: Experimental results across a suite of eight continuous control tasks. QSM matches and sometimes outperforms TD3 and SAC performance on the tasks evaluated, particularly in samples needed to reach high rewards. Even though QSM trains on expressive diffusion models, it matches the sample efficiency of explicit Gaussian and tanh-parameterized models.
Figure 4: QSM can learn multi-modal policies. Samples from policy shown for the first state of a toy cartpole swingup task, where -1 and 1 represent the initial action for each of two optimal trajectories.
Figure 5: Demonstration of multimodal action distributions due to QSM. Displayed are sampled actions from a successfully QSM-trained model and a successfully Diffusion-QL-trained model. Each figure displays 1,000 sampled actions at the given time step from the displayed initial condition of quadruped_walk, projected down to $\mathbb{R}^2$ using UMAP mcinnes2018umap. Both compared models are exactly the same (including sampling procedure), except for the method the denoising submodel was trained. This demonstrates that the diversity in sampling from QSM comes not from the diffusion model architecture, but the training methodology itself.
...and 4 more figures

Theorems & Definitions (8)

Theorem 4.1: Optimality condition, deterministic setting
Corollary 4.2
Theorem 4.3: Optimality condition, stochastic setting
Theorem 2.1
proof
proof
proof
Definition 2.2

Learning a Diffusion Model Policy from Rewards via Q-Score Matching

TL;DR

Abstract

Learning a Diffusion Model Policy from Rewards via Q-Score Matching

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (8)