Table of Contents
Fetching ...

Diffusion Policy through Conditional Proximal Policy Optimization

Ben Liu, Shunpeng Yang, Hua Chen

TL;DR

This work proposes a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability, and can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies.

Abstract

Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability. This is achieved by aligning the policy iteration with the diffusion process, which is a distinct paradigm compared to previous work. Moreover, our formulation can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies. Experiments demonstrate that the proposed method produces multimodal policy behaviors and achieves superior performance on a variety of benchmark tasks in both IsaacLab and MuJoCo Playground.

Diffusion Policy through Conditional Proximal Policy Optimization

TL;DR

This work proposes a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability, and can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies.

Abstract

Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability. This is achieved by aligning the policy iteration with the diffusion process, which is a distinct paradigm compared to previous work. Moreover, our formulation can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies. Experiments demonstrate that the proposed method produces multimodal policy behaviors and achieves superior performance on a variety of benchmark tasks in both IsaacLab and MuJoCo Playground.
Paper Structure (21 sections, 25 equations, 11 figures, 14 tables, 1 algorithm)

This paper contains 21 sections, 25 equations, 11 figures, 14 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of the proposed method. We align reinforcement learning policy iteration with the diffusion generative process through a novel policy parameterization. Unlike standard diffusion, where the probability density function is updated via a pre-defined SDE Euler–Maruyama step, our method employs conditional PPO to determine the Gaussian kernel for policy updates.
  • Figure 2: Left. Multi-Goal environments, where different trajectories starting from the origin under the diffusion policy are illustrated. Contour lines in the figure are drawn based on the distance cost. Right. The positions after taking the first action from different saddle points are shown, visualizing the distribution of the policy $\pi(a|s)$. The diffusion policy exhibits multimodal behavior, whereas the Gaussian policy collapses to near-zero movement due to the averaging effect of opposite goals.
  • Figure 3: Rewards on Playground FingerSpin.
  • Figure 4: Training rewards across eight environments in IsaacLab. Results show the mean and standard deviation over 5 runs with different seeds (higher is better). For better visualization, we have smoothed the reward data. The proposed method outperforms Gaussian PPO in most tasks.
  • Figure 5: Ablation study on score-based regularization.
  • ...and 6 more figures