On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling
Nicholas E. Corrado, Josiah P. Hanna
TL;DR
This work tackles the sampling-error bottleneck in on-policy policy-gradient RL by introducing PROPS, an adaptive data-collection method that increases the probability of under-sampled actions to align the collected data with the on-policy distribution $d_{π_{θ}}$. PROPS combines a PPO-like clipped surrogate for stabilizing behavior-policy updates with a KL-regularization term to keep the data-collection policy near the current target policy, and it employs a finite buffer to reuse past data. Empirically, PROPS reduces sampling error faster than on-policy sampling and the prior ROS method, delivering improved data efficiency on GridWorld and continuous MuJoCo tasks, often achieving comparable or better returns with substantially fewer environment interactions. The work discusses limitations and future directions, including extending theory to continuous MDPs and focusing sampling on actions with high gradient impact to further enhance efficiency.
Abstract
On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent's current policy. However, after observing only a finite number of trajectories, such on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to high-variance gradient estimates that yield data-inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce (Zhong et. al, 2022). Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled w.r.t. the current policy. We empirically evaluate PROPS on continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) increases the data efficiency of on-policy policy gradient algorithms.
