Policy Representation via Diffusion Probability Model for Reinforcement Learning

Long Yang; Zhixiong Huang; Fenghao Lei; Yucun Zhong; Yiming Yang; Cong Fang; Shiting Wen; Binbin Zhou; Zhouchen Lin

Policy Representation via Diffusion Probability Model for Reinforcement Learning

Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, Zhouchen Lin

TL;DR

This work introduces policy representation via diffusion probability models to overcome the expressiveness limits of unimodal policies in reinforcement learning. It develops a theoretical foundation where diffusion policy is defined by forward and reverse SDEs, with score-based guidance and exponential-integrator discretization, and proves a finite-time convergence bound under mild conditions. To apply this in online model-free RL, the authors propose DIPO, which uses an action-gradient-based policy improvement and a denoising-score-matching loss to learn the diffusion policy from experience. Empirical results on MuJoCo show DIPO achieving strong performance with faster initial gains and robust exploration, supported by state-visitation visuals and ablations against VAE/MLP baselines and varying reverse-length K. Overall, the paper provides both theoretical and practical pillars for diffusion-based policy representations in online RL, demonstrating potential for multimodal, richly-explorative policies in continuous control tasks.

Abstract

Popular reinforcement learning (RL) algorithms tend to produce a unimodal policy distribution, which weakens the expressiveness of complicated policy and decays the ability of exploration. The diffusion probability model is powerful to learn complicated multimodal distributions, which has shown promising and potential applications to RL. In this paper, we formally build a theoretical foundation of policy representation via the diffusion probability model and provide practical implementations of diffusion policy for online model-free RL. Concretely, we character diffusion policy as a stochastic process, which is a new approach to representing a policy. Then we present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy. Furthermore, we propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy. To the best of our knowledge, DIPO is the first algorithm to solve model-free online RL problems with the diffusion model. Finally, extensive empirical results show the effectiveness and superiority of DIPO on the standard continuous control Mujoco benchmark.

Policy Representation via Diffusion Probability Model for Reinforcement Learning

TL;DR

Abstract

Paper Structure (71 sections, 17 theorems, 180 equations, 20 figures, 3 tables, 4 algorithms)

This paper contains 71 sections, 17 theorems, 180 equations, 20 figures, 3 tables, 4 algorithms.

Introduction
Our Main Work
Paper Organization
Reinforcement Learning
Motivation: A View from Policy Representation
Policy Representation for Reinforcement Learning
Policy Representation via Value Function
Policy Representation via Parametric Function
Policy Representation via Stochastic Process
Diffusion Model is Powerful to Policy Representation
Diffusion Policy
Stochastic Dynamics of Diffusion Policy
Exponential Integrator Discretization for Diffusion Policy
Convergence Analysis of Diffusion Policy
DIPO: Implementation of Diffusion Policy for Model-Free Online RL
...and 56 more sections

Key Result

Theorem 4.3

For a given state $\mathbf{s}$, let $\{{\color{red}{\bar{\pi}}}_{t}(\cdot|\mathbf{s})\}_{t=0:T}$ and $\{{\color{orange}\tilde{\pi}}_{t}(\cdot|\mathbf{s})\}_{t=0:T}$ be the distributions along the flow (def:diffusion-policy-sde-forward-process-01) and (def:diffusion-policy-sde-reverse-process) corres

Figures (20)

Figure 1: Diffusion Policy: Policy Representation via Stochastic Process. For a given state $\mathbf{s}$, the forward stochastic process $\{{\color{red}{\bar{\mathbf{a}}}}_{t}|\mathbf{s}\}$ maps the input ${\color{red}{\bar{\mathbf{a}}}}_{0}=:\mathbf{a}\sim\pi(\cdot|\mathbf{s})$ to be a noise; then we recover the input by the stochastic process $\{{\color{orange}\tilde{\mathbf{a}}}_{t}|\mathbf{s}\}$ that reverses the reversed SDE if we know the score function $\bm{\nabla} \log p_{t}(\cdot)$, where $p_{t}(\cdot)$ is the probability distribution of the forward process, i.e., $p_{t}(\cdot)={\color{red}{\bar{\pi}}}_{t}(\cdot|\mathbf{s})$.
Figure 2: Standard Training Framework for Model-free Online RL.
Figure 3: Framework of DIPO: Implementation for Model-free Online RL with DIffusion POlicy.
Figure 4: Unimodal Distribution vs Multimodal Distribution.
Figure 5: Policy representation comparison of different policies on multimodal environment.
...and 15 more figures

Theorems & Definitions (34)

Theorem 4.3: Finite-time Analysis of Diffusion Policy
Proposition B.1
proof
Proposition B.2
proof
Proposition B.3: donsker1983asymptotic
Proposition B.4
proof
Proposition B.5
proof
...and 24 more

Policy Representation via Diffusion Probability Model for Reinforcement Learning

TL;DR

Abstract

Policy Representation via Diffusion Probability Model for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (34)