Reflective Policy Optimization
Yaozhong Gan, Renye Yan, Zhe Wu, Junliang Xing
TL;DR
Reflective Policy Optimization (RPO) extends on-policy reinforcement learning by incorporating both past and future state-action information from trajectories to guide current policy updates. By formulating a generalized surrogate with multi-step terms and employing PPO-style clipping, RPO achieves monotonic policy improvement while contracting the policy space, leading to faster convergence and improved sample efficiency. The authors provide theoretical lower bounds and validate the approach on MuJoCo continuous-control tasks and Atari games, supported by a public codebase. The work demonstrates that leveraging reflective information from adjacent trajectory pairs can meaningfully enhance data efficiency for on-policy RL without relying on off-policy data.
Abstract
On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at https://github.com/Edgargan/RPO.
