Transductive Off-policy Proximal Policy Optimization
Yaozhong Gan, Renye Yan, Xiaoyang Tan, Zhe Wu, Junliang Xing
TL;DR
ToPPO addresses the inefficiency of on-policy PPO by leveraging off-policy data through a transductive, off-policy surrogate bound based on $A^{\mu}$ rather than $A^{\pi_k}$. It derives a policy-improvement lower bound and casts updates as a constrained optimization with PPO-like clipping, enabling monotonic improvement while avoiding the need to store many old-policy networks. The approach is validated on MuJoCo and Atari, where ToPPO shows improved sample efficiency and stability over PPO and competitive baselines, and is shown to mitigate biases associated with off-policy advantage estimation in prior methods. Overall, ToPPO provides both theoretical guarantees and practical guidance for effectively reusing off-policy data in policy optimization, with potential impact on sample efficiency in continuous control and related domains.
Abstract
Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm, esteemed for its simplicity and efficacy. However, due to its inherent on-policy nature, its proficiency in harnessing data from disparate policies is constrained. This paper introduces a novel off-policy extension to the original PPO method, christened Transductive Off-policy PPO (ToPPO). Herein, we provide theoretical justification for incorporating off-policy data in PPO training and prudent guidelines for its safe application. Our contribution includes a novel formulation of the policy improvement lower bound for prospective policies derived from off-policy data, accompanied by a computationally efficient mechanism to optimize this bound, underpinned by assurances of monotonic improvement. Comprehensive experimental results across six representative tasks underscore ToPPO's promising performance.
