Table of Contents
Fetching ...

Provably Efficient Exploration in Policy Optimization

Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang

TL;DR

This paper addresses the challenge of introducing provable exploration into policy optimization. It proposes OPPO, an optimistic variant of PPO, which augments the action-value estimates with an uncertainty bonus and uses a KL-regularized policy update to balance exploration and robustness. In the episodic linear MDP setting with unknown dynamics and adversarial rewards, OPPO achieves a regret of $\tilde{O}(\sqrt{d^2 H^3 T})$ (up to logarithmic factors), independent of the size of the state-action space, and remains robust to adversarial reward sequences. This work establishes the first provably efficient policy-optimization algorithm that explicitly incorporates exploration, bridging theoretical guarantees with practical policy-gradient methods and offering potential impact for sample-efficient reinforcement learning in uncertain environments.

Abstract

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

Provably Efficient Exploration in Policy Optimization

TL;DR

This paper addresses the challenge of introducing provable exploration into policy optimization. It proposes OPPO, an optimistic variant of PPO, which augments the action-value estimates with an uncertainty bonus and uses a KL-regularized policy update to balance exploration and robustness. In the episodic linear MDP setting with unknown dynamics and adversarial rewards, OPPO achieves a regret of (up to logarithmic factors), independent of the size of the state-action space, and remains robust to adversarial reward sequences. This work establishes the first provably efficient policy-optimization algorithm that explicitly incorporates exploration, bridging theoretical guarantees with practical policy-gradient methods and offering potential impact for sample-efficient reinforcement learning in uncertain environments.

Abstract

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves regret. Here is the feature dimension, is the episode horizon, and is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

Paper Structure

This paper contains 23 sections, 11 theorems, 74 equations, 2 algorithms.

Key Result

Theorem 3.1

Let $\alpha=\sqrt{2\log{|\mathcal{A}|}/(HT)}$ in 1014522 and Line 6 of Algorithm ppoalgo, $\lambda=1$ in eq:w1150 and Line 12 of Algorithm ppoalgo, and $\beta=C\sqrt{dH^2\cdot\log(dT/\zeta)}$ in 1015509 and Line 15 of Algorithm ppoalgo, where $C>1$ is an absolute constant and $\zeta\in (0,1]$. Under with probability at least $1-\zeta$, where $C' > 0$ is an absolute constant.

Theorems & Definitions (12)

  • Theorem 3.1: Total Regret
  • Lemma 3.2: Performance Difference
  • Lemma 3.3: One-Step Descent
  • Definition 4.1: Filtration
  • Lemma 4.2: Regret Decomposition
  • Lemma 4.3: Upper Confidence Bound
  • Lemma D.1
  • Lemma D.2: Concentration of Self-Normalized Process abbasi2011improved
  • Lemma D.3: Elliptical Potential Lemma dani2008stochasticrusmevichientong2010linearlychu2011contextualabbasi2011improvedjin2019provably
  • Corollary E.1: Regret in the Tabular Setting
  • ...and 2 more