On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Nicholas E. Corrado; Josiah P. Hanna

On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Nicholas E. Corrado, Josiah P. Hanna

TL;DR

This work tackles the sampling-error bottleneck in on-policy policy-gradient RL by introducing PROPS, an adaptive data-collection method that increases the probability of under-sampled actions to align the collected data with the on-policy distribution $d_{π_{θ}}$. PROPS combines a PPO-like clipped surrogate for stabilizing behavior-policy updates with a KL-regularization term to keep the data-collection policy near the current target policy, and it employs a finite buffer to reuse past data. Empirically, PROPS reduces sampling error faster than on-policy sampling and the prior ROS method, delivering improved data efficiency on GridWorld and continuous MuJoCo tasks, often achieving comparable or better returns with substantially fewer environment interactions. The work discusses limitations and future directions, including extending theory to continuous MDPs and focusing sampling on actions with high gradient impact to further enhance efficiency.

Abstract

On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent's current policy. However, after observing only a finite number of trajectories, such on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to high-variance gradient estimates that yield data-inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce (Zhong et. al, 2022). Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled w.r.t. the current policy. We empirically evaluate PROPS on continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) increases the data efficiency of on-policy policy gradient algorithms.

On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

TL;DR

. PROPS combines a PPO-like clipped surrogate for stabilizing behavior-policy updates with a KL-regularization term to keep the data-collection policy near the current target policy, and it employs a finite buffer to reuse past data. Empirically, PROPS reduces sampling error faster than on-policy sampling and the prior ROS method, delivering improved data efficiency on GridWorld and continuous MuJoCo tasks, often achieving comparable or better returns with substantially fewer environment interactions. The work discusses limitations and future directions, including extending theory to continuous MDPs and focusing sampling on actions with high gradient impact to further enhance efficiency.

Abstract

Paper Structure (25 sections, 3 theorems, 22 equations, 17 figures, 4 tables, 2 algorithms)

This paper contains 25 sections, 3 theorems, 22 equations, 17 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Reinforcement Learning
On-Policy Policy Gradient Algorithms
Correcting Sampling Error in Reinforcement Learning
Proximal Robust On-Policy Sampling for Policy Gradient Algorithms
Robust On-Policy Sampling
Proximal Robust On-Policy Sampling
Experiments
Sampling Error Metrics
Correcting Sampling Error for a Fixed Target Policy
Correcting Sampling Error During RL Training
Discussion
Conclusion
...and 10 more sections

Key Result

Proposition 0

Assume that data is collected with an adaptive behavior policy that always takes the most under-sampled action in each state $s$ w.r.t. $\pi$, i.e., $a \leftarrow \arg\max_{a'} (\pi(a'|s) - {\pi_{\mathcal{D}}}(a'|s))$, where ${\pi_{\mathcal{D}}}$ is the empirical policy after $m$ state-action pairs

Figures (17)

Figure 1: An overview of PROPS. Rather than collecting data ${\mathcal{D}}$ via on-policy sampling from the agent's current policy $\pi_{\bm{\theta}}$, we collect data with a separate data collection policy $\pi_{\bm{\phi}}$ that we continually adapt to reduce sampling error in ${\mathcal{D}}$ with respect to the agent's current policy.
Figure 2: (a) A GridWorld task in which the agent receives reward $+1$ upon reaching the bottom right corner (the optimal goal), a reward of $+0.5$ upon reaching the top left corner (the suboptimal goal), and a reward of $-0.01$. The agent always starts in the center of the grid. Under an initially uniform policy, the agent visits both goals with equal probability, and thus the true policy gradient increases the probability of reaching the optimal goal. However, sampling error can yield an empirical gradient that increases the probability of reaching the suboptimal goal and cause the agent to converge suboptimally. To converge optimally, the agent must have low sampling error. (b, c) PROPS reduces sampling error and achieves more accurate gradients faster than on-policy sampling. Solid curves denote means over 50 seeds. Shaded regions denote 95% bootstrap confidence belts.
Figure 4: Mean normalized return and performance profiles aggregated over all six MuJoCo tasks. We compute normalized returns as $\frac{R_\text{max} - R_t}{R_\text{max}}$, where $R_\text{max}$ is the maximum return achieved by any algorithm in a particular task, and $R_t$ is the return at timestep $t$. Solid curves denote the mean over 50 seeds per task (300 seeds total). Shaded regions denote 95% bootstrap confidence belts.
Figure 5: GridWorld RL experiments over 50 seeds.
Figure 6: Sampling error throughout RL training. Solid curves denote the mean over 5 seeds. Shaded regions denote 95% confidence belts.
...and 12 more figures

Theorems & Definitions (6)

Proposition 0
Proposition 0
proof
proof
Proposition 0
proof

On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

TL;DR

Abstract

On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (6)