Table of Contents
Fetching ...

Reparameterization Proximal Policy Optimization

Hai Zhong, Xun Wang, Zhuoran Li, Longbo Huang

TL;DR

Reparameterization Policy Gradient (RPG) offers high sample efficiency by backpropagating through differentiable dynamics, but suffers from underutilized Jacobians and instability. This paper introduces Reparameterization Proximal Policy Optimization (RPO), which binds RPG to a PPO-like surrogate under principled sample reuse, and stabilizes updates with a tailored gradient clipping mechanism and explicit KL regularization. RPO reuses action-gradients across on- and off-policy updates, enabling efficient learning, and demonstrates state-of-the-art performance and robust stability on five differentiable tasks. The results highlight the practical impact of principled sample reuse for RPG-based policies in robotics and differentiable simulators, achieving faster training times and improved final performance. Overall, RPO provides a principled framework that combines sample efficiency, stability, and strong performance for RPG in continuous control domains.

Abstract

By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risk exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization. Experimental results demonstrate that RPO maintains superior sample efficiency and consistently outperforms or achieves state-of-the-art performance across diverse tasks.

Reparameterization Proximal Policy Optimization

TL;DR

Reparameterization Policy Gradient (RPG) offers high sample efficiency by backpropagating through differentiable dynamics, but suffers from underutilized Jacobians and instability. This paper introduces Reparameterization Proximal Policy Optimization (RPO), which binds RPG to a PPO-like surrogate under principled sample reuse, and stabilizes updates with a tailored gradient clipping mechanism and explicit KL regularization. RPO reuses action-gradients across on- and off-policy updates, enabling efficient learning, and demonstrates state-of-the-art performance and robust stability on five differentiable tasks. The results highlight the practical impact of principled sample reuse for RPG-based policies in robotics and differentiable simulators, achieving faster training times and improved final performance. Overall, RPO provides a principled framework that combines sample efficiency, stability, and strong performance for RPG in continuous control domains.

Abstract

By leveraging differentiable dynamics, Reparameterization Policy Gradient (RPG) achieves high sample efficiency. However, current approaches are hindered by two critical limitations: the under-utilization of computationally expensive dynamics Jacobians and inherent training instability. While sample reuse offers a remedy for under-utilization, no prior principled framework exists, and naive attempts risk exacerbating instability. To address these challenges, we propose Reparameterization Proximal Policy Optimization (RPO). We first establish that under sample reuse, RPG naturally optimizes a PPO-style surrogate objective via Backpropagation Through Time, providing a unified framework for both on- and off-policy updates. To further ensure stability, RPO integrates a clipped policy gradient mechanism tailored for RPG and employs explicit Kullback-Leibler divergence regularization. Experimental results demonstrate that RPO maintains superior sample efficiency and consistently outperforms or achieves state-of-the-art performance across diverse tasks.

Paper Structure

This paper contains 44 sections, 1 theorem, 25 equations, 14 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Let the state $s_k$ be sampled from the state distribution of the behavior policy, $s_k \sim p(s_k=s|\pi_{\theta_{\text{old}}})$. Let the reparameterization noise $\epsilon$ be sampled from the standard Normal distribution, $\epsilon \sim p_{\text{std}}(\epsilon)$, and define the regenerated noise a where the action $a_k = f_{\theta_{\text{old}}}(\epsilon;s_k) = f_{\theta'}(\epsilon_{\text{reg}};s

Figures (14)

  • Figure 1: An example of SAPO's training instability in the Humanoid task. Large KL divergence (both smoothed and raw curves shown on the right) spikes correspond to sudden performance drops. Additional examples of training instability for both SAPO and SHAC are provided in Appendix \ref{['appendix:more_unstable_seeds']}.
  • Figure 2: Computing the reparameterization policy gradient of the surrogate objective involves three steps: (a) Action-gradients are computed from rollouts via a single backward pass and cached. (b) These gradients are used directly for the initial, on-policy update. (c) For subsequent off-policy updates, the cached action-gradients are importance-weighted by $\rho(\theta')$ and reused, enabling stable sample reuse. Note that we only plot the trajectory for $H$ steps for illustration purposes.
  • Figure 3: Training performance comparison of RPO, SAPO, SHAC, and PPO. Each plot shows the mean episode return over environment steps, with the shaded region representing the standard deviation. All curves are smoothed with a 100-episode moving average.
  • Figure 4: Ablation study of RPO's components. The plot shows training curves for three variants: RPO without KL regularization, RPO with only one policy update epochs (i.e, no sample reuse), and RPO without the clipping mechanism.
  • Figure 5: Comparison of wall-clock training time. RPO ($10$M environment steps) completes training in $\sim 81$ minutes, compared to $\sim 313$ minutes for SHAC ($40$M environment steps). This highlights that the slight computational overhead of sample reuse is far outweighed by the substantial boost in overall training efficiency.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof