Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

Michael Beukman; Khimya Khetarpal; Zeyu Zheng; Will Dabney; Jakob Foerster; Michael Dennis; Clare Lyle

Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

Michael Beukman, Khimya Khetarpal, Zeyu Zheng, Will Dabney, Jakob Foerster, Michael Dennis, Clare Lyle

TL;DR

This work conceptually models PPO as a stochastic optimization system, and predicts that performance will plateau at a suboptimal level if the outer step size is too large relative to the noise, and investigates how hyperparameter choices influence the step size and update noise.

Abstract

Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples collected between policy update steps. This model predicts that performance will plateau at a suboptimal level if the outer step size is too large relative to the noise. Recasting PPO in this light makes it clear that there are two ways to address this particular type of learning stagnation: either reduce the step size or increase the number of samples collected between updates. We first validate the predictions of our model and investigate how hyperparameter choices influence the step size and update noise, concluding that increasing the number of parallel environments is a simple and robust way to reduce both factors. Next, we propose a recipe for how to co-scale the other hyperparameters when increasing parallelization, and show that incorrectly doing so can lead to severe performance degradation. Finally, we vastly outperform prior baselines in a complex open-ended domain by scaling PPO to more than 1M parallel environments, thereby enabling monotonic performance improvement up to one trillion transitions.

Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 20 figures, 4 tables)

This paper contains 27 sections, 2 equations, 20 figures, 4 tables.

Introduction
Background
Proximal Policy Optimization
PPO-EWMA
PPO as a Stochastic Optimization Process
Learning Dynamics Under Excessive Step Size
Decoupling the Inner and Outer Loops
Understanding PPO's Outer Loop
Regularization
Optimization Epochs
Rollout Batch Size
Choosing an Appropriate Step Size
A Reliable Recipe for Scaling Parallelization in PPO
Robotics Results
Batch Size Scaling Enables Open-Ended Learning
...and 12 more sections

Figures (20)

Figure 1: Comparing the behavior in (a) PPO and (b) a simple convex optimization problem with stochastic gradients. In (a) having too large of an outer step size (in particular, having a center of mass of the proximal policy being too low) leads to a suboptimal plateau, with the same behavior occurring in (b). Solve rate corresponds to the policy's average success rate over all 512 morphologies. For all figures, we plot the mean and shade the 95% CI over 5 seeds unless otherwise noted.
Figure 2: (a) In stochastic optimization, the update magnitude is consistently large when the step size is too large, despite a stagnating loss. (b,c) Showing that PPO shares similar dynamics.
Figure 3: (a) Loading checkpoints and retraining with a different COM recovers the performance of the most recent regularization strength. The legend indicates the center of mass, and the number in brackets indicates the starting COM. (b) The same phenomenon occurs in stochastic optimization.
Figure 4: Tuning the learning rate cannot counteract a poor outer step size. Here we sweep over whether or not to anneal LR for each run, and show the best result per learning rate.
Figure 5: Weak regularization, corresponding to either (a) too low of a COM or (b) too large of a clipping $\epsilon$ can lead to premature plateaus.
...and 15 more figures

Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

TL;DR

Abstract

Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

Authors

TL;DR

Abstract

Table of Contents

Figures (20)