Provably Efficient Exploration in Policy Optimization
Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang
TL;DR
This paper addresses the challenge of introducing provable exploration into policy optimization. It proposes OPPO, an optimistic variant of PPO, which augments the action-value estimates with an uncertainty bonus and uses a KL-regularized policy update to balance exploration and robustness. In the episodic linear MDP setting with unknown dynamics and adversarial rewards, OPPO achieves a regret of $\tilde{O}(\sqrt{d^2 H^3 T})$ (up to logarithmic factors), independent of the size of the state-action space, and remains robust to adversarial reward sequences. This work establishes the first provably efficient policy-optimization algorithm that explicitly incorporates exploration, bridging theoretical guarantees with practical policy-gradient methods and offering potential impact for sample-efficient reinforcement learning in uncertain environments.
Abstract
While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
