Efficient On-Policy Reinforcement Learning via Exploration of Sparse Parameter Space
Xinyu Zhang, Aishik Deb, Klaus Mueller
TL;DR
The paper tackles the inefficiency of on‑policy reinforcement learning caused by surrogate gradient drift and high variance by introducing ExploRLer, an iteration‑level parameter space exploration pipeline that leverages Empty‑Space Search (ESA) around multiple anchors. By generating zero‑order candidate policies every few iterations and evaluating them online, ExploRLer corrects gradient bias without increasing per‑update gradient computation, yielding higher final returns and faster convergence on a range of continuous control tasks. It formalizes training granularity (batch, epoch, and iteration) and provides default ESA parameters, demonstrating robustness across PPO and TRPO, and discusses potential extensions to off‑policy settings and evaluation efficiency. Overall, the approach offers a practical route to more robust and data‑efficient on‑policy RL by exploiting unexplored regions of the parameter space rather than solely refining local gradient directions. The work also highlights the limitations of surrogate objectives and provides a framework for leveraging iteration‑level structure to improve policy optimization.
Abstract
Policy-gradient methods such as Proximal Policy Optimization (PPO) are typically updated along a single stochastic gradient direction, leaving the rich local structure of the parameter space unexplored. Previous work has shown that the surrogate gradient is often poorly correlated with the true reward landscape. Building on this insight, we visualize the parameter space spanned by policy checkpoints within an iteration and reveal that higher performing solutions often lie in nearby unexplored regions. To exploit this opportunity, we introduce ExploRLer, a pluggable pipeline that seamlessly integrates with on-policy algorithms such as PPO and TRPO, systematically probing the unexplored neighborhoods of surrogate on-policy gradient updates. Without increasing the number of gradient updates, ExploRLer achieves significant improvements over baselines in complex continuous control environments. Our results demonstrate that iteration-level exploration provides a practical and effective way to strengthen on-policy reinforcement learning and offer a fresh perspective on the limitations of the surrogate objective.
