Generalized Policy Improvement Algorithms with Theoretically Supported Sample Reuse
James Queeney, Ioannis Ch. Paschalidis, Christos G. Cassandras
TL;DR
The paper develops Generalized Policy Improvement (GPI) algorithms that merge on-policy policy-improvement guarantees with principled sample reuse from recent policies, addressing the trade-off between performance guarantees and data efficiency in deep RL. By introducing a generalized trust-region framework and an optimal mixture over past policies, the authors derive GePPO, GeTRPO, and GeVMPO, which reuse data while preserving approximate policy improvement. Theoretical results show how to set the GPI trust-region parameter and mixture weights to bound risk and maintain guarantees, and empirical results on the DeepMind Control Suite demonstrate substantial gains over on-policy baselines, especially in sparse-reward tasks. The work offers a practical pathway to more data-efficient, reliable policy learning in real-world control scenarios, with future directions toward extending sample reuse to more aggressive off-policy settings.
Abstract
We develop a new class of model-free deep reinforcement learning algorithms for data-driven, learning-based control. Our Generalized Policy Improvement algorithms combine the policy improvement guarantees of on-policy methods with the efficiency of sample reuse, addressing a trade-off between two important deployment requirements for real-world control: (i) practical performance guarantees and (ii) data efficiency. We demonstrate the benefits of this new class of algorithms through extensive experimental analysis on a broad range of simulated control tasks.
