Table of Contents
Fetching ...

Efficient On-Policy Reinforcement Learning via Exploration of Sparse Parameter Space

Xinyu Zhang, Aishik Deb, Klaus Mueller

TL;DR

The paper tackles the inefficiency of on‑policy reinforcement learning caused by surrogate gradient drift and high variance by introducing ExploRLer, an iteration‑level parameter space exploration pipeline that leverages Empty‑Space Search (ESA) around multiple anchors. By generating zero‑order candidate policies every few iterations and evaluating them online, ExploRLer corrects gradient bias without increasing per‑update gradient computation, yielding higher final returns and faster convergence on a range of continuous control tasks. It formalizes training granularity (batch, epoch, and iteration) and provides default ESA parameters, demonstrating robustness across PPO and TRPO, and discusses potential extensions to off‑policy settings and evaluation efficiency. Overall, the approach offers a practical route to more robust and data‑efficient on‑policy RL by exploiting unexplored regions of the parameter space rather than solely refining local gradient directions. The work also highlights the limitations of surrogate objectives and provides a framework for leveraging iteration‑level structure to improve policy optimization.

Abstract

Policy-gradient methods such as Proximal Policy Optimization (PPO) are typically updated along a single stochastic gradient direction, leaving the rich local structure of the parameter space unexplored. Previous work has shown that the surrogate gradient is often poorly correlated with the true reward landscape. Building on this insight, we visualize the parameter space spanned by policy checkpoints within an iteration and reveal that higher performing solutions often lie in nearby unexplored regions. To exploit this opportunity, we introduce ExploRLer, a pluggable pipeline that seamlessly integrates with on-policy algorithms such as PPO and TRPO, systematically probing the unexplored neighborhoods of surrogate on-policy gradient updates. Without increasing the number of gradient updates, ExploRLer achieves significant improvements over baselines in complex continuous control environments. Our results demonstrate that iteration-level exploration provides a practical and effective way to strengthen on-policy reinforcement learning and offer a fresh perspective on the limitations of the surrogate objective.

Efficient On-Policy Reinforcement Learning via Exploration of Sparse Parameter Space

TL;DR

The paper tackles the inefficiency of on‑policy reinforcement learning caused by surrogate gradient drift and high variance by introducing ExploRLer, an iteration‑level parameter space exploration pipeline that leverages Empty‑Space Search (ESA) around multiple anchors. By generating zero‑order candidate policies every few iterations and evaluating them online, ExploRLer corrects gradient bias without increasing per‑update gradient computation, yielding higher final returns and faster convergence on a range of continuous control tasks. It formalizes training granularity (batch, epoch, and iteration) and provides default ESA parameters, demonstrating robustness across PPO and TRPO, and discusses potential extensions to off‑policy settings and evaluation efficiency. Overall, the approach offers a practical route to more robust and data‑efficient on‑policy RL by exploiting unexplored regions of the parameter space rather than solely refining local gradient directions. The work also highlights the limitations of surrogate objectives and provides a framework for leveraging iteration‑level structure to improve policy optimization.

Abstract

Policy-gradient methods such as Proximal Policy Optimization (PPO) are typically updated along a single stochastic gradient direction, leaving the rich local structure of the parameter space unexplored. Previous work has shown that the surrogate gradient is often poorly correlated with the true reward landscape. Building on this insight, we visualize the parameter space spanned by policy checkpoints within an iteration and reveal that higher performing solutions often lie in nearby unexplored regions. To exploit this opportunity, we introduce ExploRLer, a pluggable pipeline that seamlessly integrates with on-policy algorithms such as PPO and TRPO, systematically probing the unexplored neighborhoods of surrogate on-policy gradient updates. Without increasing the number of gradient updates, ExploRLer achieves significant improvements over baselines in complex continuous control environments. Our results demonstrate that iteration-level exploration provides a practical and effective way to strengthen on-policy reinforcement learning and offer a fresh perspective on the limitations of the surrogate objective.

Paper Structure

This paper contains 42 sections, 6 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Visualization of policy value distribution in the local parameter space. Red dots denote 10 epoch checkpoints from a PPO iteration, fitted with a Gaussian to sample 100 candidate policies. Each candidate is evaluated over 1,000 episodes, and the average return is projected onto a PCA plane to form a contour map. Results are shown at iteration 1 (start), 1,000 (midpoint), and 3,500 (end). Figure \ref{['fig:ant-group']} is the parameter space visualization for MuJoCo Ant and Figure \ref{['fig:walker2d-group']} is for MuJoCo Walker2d. More results can be found in Appendix. \ref{['app:visualization']}.
  • Figure 2: Comparison of training curves on MuJoCo environments across 3M steps.The solid and dashed lines show the average performance across 4 random seeds, with the shaded region indicating ±1 standard deviation, and with a smoothing window of length 100.
  • Figure 3: Training curves of Ablation Study for 3M steps. The solid and dashed lines show the average performance across 4 random seeds, with the shaded region indicating ±1 standard deviation, and with a smoothing window of length 100.
  • Figure 4: ( \ref{['fig:esa-sac-fig']}): Training curves for Humanoid-v5 using SAC with and without ExploRLer integration. The solid and dashed lines show the average performance across 4 random seeds, with shaded regions indicating ±1 standard deviation, and with a smoothing window of length 100; ( \ref{['fig:fqe']}): The orange line is the regular PPO algorithm; the green line is ExploRLer-P with fully online evaluation; the blue line is ExploRLer with FQE and online evaluation combination. All the experiments start from a 1-million-step pretrained model.
  • Figure 5: Local parameter space visualizations for Hopper, Humanoid, and HalfCheetah, showing the distribution of checkpoints within the parameter space and revealing adjacent empty regions that can host higher-value candidate policies.
  • ...and 2 more figures