Complexity-Regularized Proximal Policy Optimization

Luca Serfilippi; Giorgio Franceschelli; Antonio Corradi; Mirco Musolesi

Complexity-Regularized Proximal Policy Optimization

Luca Serfilippi, Giorgio Franceschelli, Antonio Corradi, Mirco Musolesi

TL;DR

Complexity-Regularized Proximal Policy Optimization (CR-PPO) is introduced, a modification of PPO that is significantly more robust to hyperparameter selection than entropy-regularized PPO, achieving consistent performance across orders of magnitude of regularization coefficients and remaining harmless when regularization is unnecessary, thereby reducing the need for expensive hyperparameter tuning.

Abstract

Policy gradient methods usually rely on entropy regularization to prevent premature convergence. However, maximizing entropy indiscriminately pushes the policy towards a uniform distribution, often overriding the reward signal if not optimally tuned. We propose replacing the standard entropy term with a self-regulating complexity term, defined as the product of Shannon entropy and disequilibrium, where the latter quantifies the distance from the uniform distribution. Unlike pure entropy, which favors maximal disorder, this complexity measure is zero for both fully deterministic and perfectly uniform distributions, i.e., it is strictly positive for systems that exhibit a meaningful interplay between order and randomness. These properties ensure the policy maintains beneficial stochasticity while reducing regularization pressure when the policy is highly uncertain, allowing learning to focus on reward optimization. We introduce Complexity-Regularized Proximal Policy Optimization (CR-PPO), a modification of PPO that leverages this dynamic. We empirically demonstrate that CR-PPO is significantly more robust to hyperparameter selection than entropy-regularized PPO, achieving consistent performance across orders of magnitude of regularization coefficients and remaining harmless when regularization is unnecessary, thereby reducing the need for expensive hyperparameter tuning.

Complexity-Regularized Proximal Policy Optimization

TL;DR

Abstract

Complexity-Regularized Proximal Policy Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)