Table of Contents
Fetching ...

ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm

Hanyong Wang, Menglong Yang

TL;DR

ExO-PPO introduces a principled extension of PPO that incorporates off-policy data through an Extended Off-Policy Improvement Lower Bound and a novel Extended Ratio Objective with an Exponentially Decaying Edge. By maintaining a replay buffer of the last $M$ policies and sampling trajectories as cohesive units, the method improves sample efficiency while restraining policy drift via a wider but controlled gradient space. The approach is evaluated on Atari and MuJoCo tasks, showing enhanced performance and stability relative to PPO and other variants, and demonstrates applicability to both online and offline settings. The work advances practical reinforcement learning by enabling stable, efficient learning from diverse data sources with minimal changes to existing PPO workflows.

Abstract

Deep reinforcement learning has been able to solve various tasks successfully, however, due to the construction of policy gradient and training dynamics, tuning deep reinforcement learning models remains challenging. As one of the most successful deep reinforcement-learning algorithm, the Proximal Policy Optimization algorithm (PPO) clips the policy gradient within a conservative on-policy updates, which ensures reliable and stable policy improvement. However, this training pattern may sacrifice sample efficiency. On the other hand, off-policy methods make more adequate use of data through sample reuse, though at the cost of increased the estimation variance and bias. To leverage the advantages of both, in this paper, we propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization. Specifically, we first derive an extended off-policy improvement from an expectation form of generalized policy improvement lower bound. Then, we extend the clipping mechanism with segmented exponential functions for a suitable surrogate objective function. Third, the trajectories generated by the past $M$ policies are organized in the replay buffer for off-policy training. We refer to this method as Extended Off-policy Proximal Policy Optimization (ExO-PPO). Compared with PPO and some other state-of-the-art variants, we demonstrate an improved performance of ExO-PPO with balanced sample efficiency and stability on varied tasks in the empirical experiments.

ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm

TL;DR

ExO-PPO introduces a principled extension of PPO that incorporates off-policy data through an Extended Off-Policy Improvement Lower Bound and a novel Extended Ratio Objective with an Exponentially Decaying Edge. By maintaining a replay buffer of the last policies and sampling trajectories as cohesive units, the method improves sample efficiency while restraining policy drift via a wider but controlled gradient space. The approach is evaluated on Atari and MuJoCo tasks, showing enhanced performance and stability relative to PPO and other variants, and demonstrates applicability to both online and offline settings. The work advances practical reinforcement learning by enabling stable, efficient learning from diverse data sources with minimal changes to existing PPO workflows.

Abstract

Deep reinforcement learning has been able to solve various tasks successfully, however, due to the construction of policy gradient and training dynamics, tuning deep reinforcement learning models remains challenging. As one of the most successful deep reinforcement-learning algorithm, the Proximal Policy Optimization algorithm (PPO) clips the policy gradient within a conservative on-policy updates, which ensures reliable and stable policy improvement. However, this training pattern may sacrifice sample efficiency. On the other hand, off-policy methods make more adequate use of data through sample reuse, though at the cost of increased the estimation variance and bias. To leverage the advantages of both, in this paper, we propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization. Specifically, we first derive an extended off-policy improvement from an expectation form of generalized policy improvement lower bound. Then, we extend the clipping mechanism with segmented exponential functions for a suitable surrogate objective function. Third, the trajectories generated by the past policies are organized in the replay buffer for off-policy training. We refer to this method as Extended Off-policy Proximal Policy Optimization (ExO-PPO). Compared with PPO and some other state-of-the-art variants, we demonstrate an improved performance of ExO-PPO with balanced sample efficiency and stability on varied tasks in the empirical experiments.
Paper Structure (30 sections, 5 theorems, 21 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 30 sections, 5 theorems, 21 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

From the current policy $\pi_t$, each training policy $\pi$ has: where $C^{\pi,\pi_t}=\max_{s\in S} \bigl| \mathbb{E}_{a\sim \pi(\cdot|s)}[A^{\pi_t}(s,a)] \bigr|$ and $\mathrm{TV}(\pi,\pi_t)(s)$ represents the total variation distance between the distributions $\pi(\cdot|s)$ and $\pi_t(\cdot|s)$.

Figures (9)

  • Figure 1: Intuitive comparison between DRL sampling-and-training patterns. (a): On-policy pattern samples from the current policy, which usually requires $N$ parallel online interactions to generate enough samples for training. (b): Off-policy pattern separately keeps each action-state transition from distinct periods in the limited-size queue-like replay buffer. Reusing past pieces improves sample efficiency, but the shift between policies becomes a thorny issue. (c): Different from limiting the number of individual entries in the replay buffer, our Extended Off-policy sampling pattern takes the trajectories by a certain policy as a cohesive unit. It could improve the efficiency while constrict the distribution shift by finite $M$ prior policies.
  • Figure 2: Left: The optimization objective functions $L$ of stochastic policy gradient versus the probability ratio $r={\pi}/{\pi_{t-i}}$ when $\hat{A}>0$. Right: The corresponding gradients of these objective functions $\nabla_r L$ with respect to $r$ are shown. Compared with the original PPO objective (blue dashdot line) and the Scopic objective TheSuffi with $\tau = 2$ (purple dashed line), our proposed ExO-PPO objectives with different $\alpha$ (a cluster of solid lines) are smooth curves with continuous first order derivatives. And the center symmetry point $(1,\hat{A})$ is the on-policy point of optimization.
  • Figure 3: Average performance throughout training across Atari games. Shading denotes the range of multiple runs and solid lines are the mean return of each algorithm. ExO-PPO (the blue line) outperforms in the most of the tasks in both early and final stages.
  • Figure 4: Average performance throughout training across continuous MuJoCo tasks.
  • Figure 5: Absolute logarithm of probability ratio throughout training across Atari games. Red vertical dashed lines represent the absolute logarithm of the probability ratio $y=0.2$, which is an indicator of whether the surrogate objective restrains the disparity between training updates.
  • ...and 4 more figures

Theorems & Definitions (10)

  • Lemma 1: Policy Improvement Lower Bound
  • Lemma 2: Policy Improvement Lower Bound Between Any Reference Policy
  • proof
  • Theorem 1: Extended Off-Policy Improvement Lower Bound
  • proof
  • Definition 1: Extended Ratio with Exponentially Decaying Edge
  • Lemma 3
  • Lemma 4
  • proof
  • proof