Table of Contents
Fetching ...

Complexity-Regularized Proximal Policy Optimization

Luca Serfilippi, Giorgio Franceschelli, Antonio Corradi, Mirco Musolesi

TL;DR

Complexity-Regularized Proximal Policy Optimization (CR-PPO) is introduced, a modification of PPO that is significantly more robust to hyperparameter selection than entropy-regularized PPO, achieving consistent performance across orders of magnitude of regularization coefficients and remaining harmless when regularization is unnecessary, thereby reducing the need for expensive hyperparameter tuning.

Abstract

Policy gradient methods usually rely on entropy regularization to prevent premature convergence. However, maximizing entropy indiscriminately pushes the policy towards a uniform distribution, often overriding the reward signal if not optimally tuned. We propose replacing the standard entropy term with a self-regulating complexity term, defined as the product of Shannon entropy and disequilibrium, where the latter quantifies the distance from the uniform distribution. Unlike pure entropy, which favors maximal disorder, this complexity measure is zero for both fully deterministic and perfectly uniform distributions, i.e., it is strictly positive for systems that exhibit a meaningful interplay between order and randomness. These properties ensure the policy maintains beneficial stochasticity while reducing regularization pressure when the policy is highly uncertain, allowing learning to focus on reward optimization. We introduce Complexity-Regularized Proximal Policy Optimization (CR-PPO), a modification of PPO that leverages this dynamic. We empirically demonstrate that CR-PPO is significantly more robust to hyperparameter selection than entropy-regularized PPO, achieving consistent performance across orders of magnitude of regularization coefficients and remaining harmless when regularization is unnecessary, thereby reducing the need for expensive hyperparameter tuning.

Complexity-Regularized Proximal Policy Optimization

TL;DR

Complexity-Regularized Proximal Policy Optimization (CR-PPO) is introduced, a modification of PPO that is significantly more robust to hyperparameter selection than entropy-regularized PPO, achieving consistent performance across orders of magnitude of regularization coefficients and remaining harmless when regularization is unnecessary, thereby reducing the need for expensive hyperparameter tuning.

Abstract

Policy gradient methods usually rely on entropy regularization to prevent premature convergence. However, maximizing entropy indiscriminately pushes the policy towards a uniform distribution, often overriding the reward signal if not optimally tuned. We propose replacing the standard entropy term with a self-regulating complexity term, defined as the product of Shannon entropy and disequilibrium, where the latter quantifies the distance from the uniform distribution. Unlike pure entropy, which favors maximal disorder, this complexity measure is zero for both fully deterministic and perfectly uniform distributions, i.e., it is strictly positive for systems that exhibit a meaningful interplay between order and randomness. These properties ensure the policy maintains beneficial stochasticity while reducing regularization pressure when the policy is highly uncertain, allowing learning to focus on reward optimization. We introduce Complexity-Regularized Proximal Policy Optimization (CR-PPO), a modification of PPO that leverages this dynamic. We empirically demonstrate that CR-PPO is significantly more robust to hyperparameter selection than entropy-regularized PPO, achieving consistent performance across orders of magnitude of regularization coefficients and remaining harmless when regularization is unnecessary, thereby reducing the need for expensive hyperparameter tuning.

Paper Structure

This paper contains 18 sections, 12 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Disequilibrium, entropy, and complexity in a two-dimensional space. The bullet points indicate their maxima.
  • Figure 2: CartPole (top) and CarRacing (bottom) mean return for different $c_{reg}$ values of CR-PPO (left) and PPOwEnt (center), and their aggregated average (right). The mean and standard error are shown across 3 seeds.
  • Figure 3: CoinRun (top) and AirRaid (bottom) mean return for different $c_{reg}$ values of CR-PPO (left) and PPOwEnt (center), and their aggregated average (right). The mean and standard error are shown across 3 seeds.
  • Figure 4: Asteroids (top) and RiverRaid (bottom) mean return for different $c_{reg}$ values of CR-PPO (left) and PPOwEnt (center), and their aggregated average (right). The mean and standard error are shown across 3 seeds.
  • Figure 5: Schematic representation of the CARTerpillar environment with 4 carts. On the left, a flattened rendering with all the relevant physical symbols. On the right, the full dynamics of the environment, where the effect of newly introduced dampers and springs is highlighted in red.
  • ...and 8 more figures