Table of Contents
Fetching ...

Beyond the Boundaries of Proximal Policy Optimization

Charlie B. Tan, Edan Toledo, Benjamin Ellis, Jakob N. Foerster, Ferenc Huszár

TL;DR

An alternative perspective of PPO is offered, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate, and a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer is proposed.

Abstract

Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer. The decoupling of update estimation and update application enabled by outer-PPO highlights several implicit design choices in PPO that we challenge through empirical investigation. In particular we consider non-unity learning rates and momentum applied to the outer loop, and a momentum-bias applied to the inner estimation loop. Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumanji and MinAtar environments; non-unity learning rates and momentum both achieve statistically significant improvement on Brax and Jumanji, given the same hyperparameter tuning budget.

Beyond the Boundaries of Proximal Policy Optimization

TL;DR

An alternative perspective of PPO is offered, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate, and a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer is proposed.

Abstract

Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer. The decoupling of update estimation and update application enabled by outer-PPO highlights several implicit design choices in PPO that we challenge through empirical investigation. In particular we consider non-unity learning rates and momentum applied to the outer loop, and a momentum-bias applied to the inner estimation loop. Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumanji and MinAtar environments; non-unity learning rates and momentum both achieve statistically significant improvement on Brax and Jumanji, given the same hyperparameter tuning budget.

Paper Structure

This paper contains 39 sections, 4 equations, 24 figures, 4 tables, 6 algorithms.

Figures (24)

  • Figure 1: Diagram of outer-PPO estimating and applying the outer gradient as an update. (i) Transitions are collected with policy $\pi(\bm{\theta}_k)$ defining a surrogate objective and corresponding 'trust-region' (shaded) surrounding $\bm{\theta}_k$; inner-loop optimization of the surrogate objective (blue dashed) yields $\bm{\theta}_k^*$. (ii) Outer-PPO computes outer gradient as $\bm{g}^{O}_k \gets \bm{\theta}_k^*-\bm{\theta}_k$. (iii) Outer-PPO updates behavior parameters using an arbitrary gradient based optimizer applied to the outer gradient to give $\bm{\theta}_{k+1}$, in this case gradient ascent with a learning rate $\sigma > 1$. Standard PPO can be understood as directly taking $\bm{\theta}_{k+1} \gets \bm{\theta}_k^*$, or as a special case of outer-PPO corresponding to gradient ascent with learning rate $\sigma = 1$.
  • Figure 2: Comparison of Nesterov-PPO and biased initialization.
  • Figure 3: Aggregate point estimates for Brax (upper), Jumanji (center), and MinAtar (lower). Optimal hyperparameters per-environment are used. Normalized to task min/max across all experiments.
  • Figure 4: Probability of improvement for Brax (left), Jumanji (center), and MinAtar (right). Optimal hyperparameters per-environment are used. Normalized to task min/max across all experiments.
  • Figure 5: Outer-LR hyperparameter sensivity. Mean normalized return across the Brax (left), Jumanji (center), MinAtar (right) tasks as a function of outer learning rate $\sigma$. Mean of 4 seeds plotted with standard error shaded. Normalized to task min/max across all experiments.
  • ...and 19 more figures