Reparameterization Proximal Policy Optimization
Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang
TL;DR
Reparameterization Policy Gradient (RPG) offers high sample efficiency by backpropagating through differentiable dynamics, but suffers from underutilized Jacobians and instability. This paper introduces Reparameterization Proximal Policy Optimization (RPO), which binds RPG to a PPO-like surrogate under principled sample reuse, and stabilizes updates with a tailored gradient clipping mechanism and explicit KL regularization. RPO reuses action-gradients across on- and off-policy updates, enabling efficient learning, and demonstrates state-of-the-art performance and robust stability on five differentiable tasks. The results highlight the practical impact of principled sample reuse for RPG-based policies in robotics and differentiable simulators, achieving faster training times and improved final performance. Overall, RPO provides a principled framework that combines sample efficiency, stability, and strong performance for RPG in continuous control domains.
Abstract
By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risk exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization. Experimental results demonstrate that RPO maintains superior sample efficiency and consistently outperforms or achieves state-of-the-art performance across diverse tasks.
