Proximal Policy Optimization with Adaptive Exploration
Andrei Lixandru
TL;DR
Proximal Policy Optimization with Adaptive Exploration (axPPO) tackles the exploration-exploitation tradeoff in reinforcement learning by making the entropy bonus in PPO dynamic, driven by recent performance. It defines $G_{recent}$ as a normalized moving-average of past returns and incorporates it into the PPO objective as a scaling of the entropy term. In experiments on CartPole-v1, axPPO achieves competitive or superior returns across a range of entropy coefficients, demonstrating robustness to initial exploration levels. These results suggest that performance-driven adaptive exploration can improve learning efficiency and motivate broader testing in richer domains.
Abstract
Proximal Policy Optimization with Adaptive Exploration (axPPO) is introduced as a novel learning algorithm. This paper investigates the exploration-exploitation tradeoff within the context of reinforcement learning and aims to contribute new insights into reinforcement learning algorithm design. The proposed adaptive exploration framework dynamically adjusts the exploration magnitude during training based on the recent performance of the agent. Our proposed method outperforms standard PPO algorithms in learning efficiency, particularly when significant exploratory behavior is needed at the beginning of the learning process.
