Table of Contents
Fetching ...

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Shutong Ding, Ke Hu, Shan Zhong, Haoyang Luo, Weinan Zhang, Jingya Wang, Jun Wang, Ye Shi

TL;DR

GenPO bridges diffusion-based policies with on-policy RL by enabling exact state-action likelihoods through exact diffusion inversion and a doubled dummy action scheme. It provides unbiased entropy and KL estimates to support entropy regularization and adaptive learning rates within PPO, and it introduces a compression term to stabilize training in the expanded action space. Empirically, GenPO achieves state-of-the-art performance across eight IsaacLab robotic benchmarks, showcasing improved sample efficiency and convergence over prior diffusion-based and standard RL methods. This work unlocks scalable on-policy diffusion policy training for large GPU-parallel simulators and real-world robotics, with explicit avenues for improving computational efficiency in future work.

Abstract

Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

TL;DR

GenPO bridges diffusion-based policies with on-policy RL by enabling exact state-action likelihoods through exact diffusion inversion and a doubled dummy action scheme. It provides unbiased entropy and KL estimates to support entropy regularization and adaptive learning rates within PPO, and it introduces a compression term to stabilize training in the expanded action space. Empirically, GenPO achieves state-of-the-art performance across eight IsaacLab robotic benchmarks, showcasing improved sample efficiency and convergence over prior diffusion-based and standard RL methods. This work unlocks scalable on-policy diffusion policy training for large GPU-parallel simulators and real-world robotics, with explicit avenues for improving computational efficiency in future work.

Abstract

Recent advances in reinforcement learning (RL) have demonstrated the powerful exploration capabilities and multimodality of generative diffusion-based policies. While substantial progress has been made in offline RL and off-policy RL settings, integrating diffusion policies into on-policy frameworks like PPO remains underexplored. This gap is particularly significant given the widespread use of large-scale parallel GPU-accelerated simulators, such as IsaacLab, which are optimized for on-policy RL algorithms and enable rapid training of complex robotic tasks. A key challenge lies in computing state-action log-likelihoods under diffusion policies, which is straightforward for Gaussian policies but intractable for flow-based models due to irreversible forward-reverse processes and discretization errors (e.g., Euler-Maruyama approximations). To bridge this gap, we propose GenPO, a generative policy optimization framework that leverages exact diffusion inversion to construct invertible action mappings. GenPO introduces a novel doubled dummy action mechanism that enables invertibility via alternating updates, resolving log-likelihood computation barriers. Furthermore, we also use the action log-likelihood for unbiased entropy and KL divergence estimation, enabling KL-adaptive learning rates and entropy regularization in on-policy updates. Extensive experiments on eight IsaacLab benchmarks, including legged locomotion (Ant, Humanoid, Anymal-D, Unitree H1, Go2), dexterous manipulation (Shadow Hand), aerial control (Quadcopter), and robotic arm tasks (Franka), demonstrate GenPO's superiority over existing RL baselines. Notably, GenPO is the first method to successfully integrate diffusion policies into on-policy RL, unlocking their potential for large-scale parallelized training and real-world robotic deployment.

Paper Structure

This paper contains 28 sections, 17 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Existing diffusion-based reinforcement learning algorithms mainly focus on the off-policy (middle) and offline (right) RL. This is because we can generally obtain the gradient of the Q function to update the diffusion policy in off-policy RL and utilize the offline data to train the agent in offline RL. However, as for diffusion-based RL in the on-policy (left) algorithm, there still exists a challenge that we cannot obtain the log-likelihood of diffusion.
  • Figure 2: Forward and reverse process of GenPO. The forward process is to sample actions with the given state; the reverse process is to compute the probability density of the given state-action pair. Notably, the forward and reverse processes are invertible.
  • Figure 3: Learning curves across 8 IsaacLab benchmarks. Results are averaged over 5 runs. The x-axis denotes training epochs, and the y-axis shows average episodic return with one standard deviation shaded.
  • Figure 4: Ablation study results. (a) Effect of varying the compression loss coefficient $\nu$ on training stability and final performance. (b) Impact of entropy and learning rate adaptation on exploration and convergence. (c) Performance under different mixing coefficients $p$ in flow policies.
  • Figure 5: Eight Isaaclab benchmark visualizations, eight images from https://isaac-sim.github.io/IsaacLab/main/source/overview/environments.html. From (a) to (h) are Isaac-Ant-v0, Isaac-Humanoid-v0, Isaac-Lift-Cube-Franka-v0, Isaac-Quadcopter-Direct-v0, Isaac-Velocity-Flat-Anymal-D-v0, Isaac-Velocity-Rough-Unitree-Go2-v0, Isaac-Velocity-Rough-H1-v0, and Isaac-Repose-Cube-Shadow-Direct-v0.
  • ...and 7 more figures

Theorems & Definitions (1)

  • proof