Strongly-polynomial time and validation analysis of policy gradient methods

Caleb Ju; Guanghui Lan

Strongly-polynomial time and validation analysis of policy gradient methods

Caleb Ju, Guanghui Lan

TL;DR

This work introduces the advantage gap function as a principled termination certificate for reinforcement learning and develops distribution-free convergence theory for policy mirror descent (PMD). By designing a scheduled (and, in one variant, geometrically growing) step-size, the authors prove distribution-free linear convergence for PMD in deterministic MDPs and show a strongly-polynomial time PMD for unregularized MDPs with Euclidean Bregman distance. The stochastic extension yields distribution-free sublinear convergence with both online and offline validation certificates, and a last-iterate convergence result that provides practical guarantees for the policy actually being learned. Validation analysis (online/offline certificates) enables computable monitors of optimality and termination criteria, bridging RL with convex optimization duality concepts and offering certificates of near-optimality for learned policies.

Abstract

This paper proposes a novel termination criterion, termed the advantage gap function, for finite state and action Markov decision processes (MDP) and reinforcement learning (RL). By incorporating this advantage gap function into the design of step size rules and deriving a new linear rate of convergence that is independent of the stationary state distribution of the optimal policy, we demonstrate that policy gradient methods can solve MDPs in strongly-polynomial time. To the best of our knowledge, this is the first time that such strong convergence properties have been established for policy gradient methods. Moreover, in the stochastic setting, where only stochastic estimates of policy gradients are available, we show that the advantage gap function provides close approximations of the optimality gap for each individual state and exhibits a sublinear rate of convergence at every state. The advantage gap function can be easily estimated in the stochastic case, and when coupled with easily computable upper bounds on policy values, they provide a convenient way to validate the solutions generated by policy gradient methods. Therefore, our developments offer a principled and computable measure of optimality for RL, whereas current practice tends to rely on algorithm-to-algorithm or baselines comparisons with no certificate of optimality.

Strongly-polynomial time and validation analysis of policy gradient methods

TL;DR

Abstract

Paper Structure (24 sections, 26 theorems, 56 equations, 1 figure, 3 tables, 1 algorithm)

This paper contains 24 sections, 26 theorems, 56 equations, 1 figure, 3 tables, 1 algorithm.

Introduction
Notation
Markov decision process, a gap function, and connections to (non-)linear programming
Performance difference and advantage function
Advantage gap function and distribution-free convergence
Convex programming and duality theory of (regularized) RL
Distribution-free convergence for PMD and strongly-polynomial runtime
Basic PMD method
Distribution-free linear convergence for PMD
A strongly-polynomial time PMD
Distribution-free convergence for stochastic PMD
Basic stochastic policy mirror descent
Validation analysis and last-iterate convergence of SPMD
Online accuracy certificates
Last-iterate convergence
...and 9 more sections

Key Result

Lemma 2.1

\newlabellem:performance_diff_deter0 Let $\pi$ and $\pi'$ be two feasible policies. Then we have where for a given $p \in \Delta_{\vert \mathcal{A} \vert}$, the advantage function is defined as

Figures (1)

Figure 1: Mean and confidence interval for estimates of the average value function $k^{-1}\sum_{t=0}^{k-1}f_\rho(\pi_t)$ and the optimal value $f_\rho(\pi^*)$, where $f_\rho(\pi) := \mathbb{E}_{s \sim \rho}V^{\pi}(s)$ and $\rho$ is the uniform distribution over states. Experiments are repeated over 10 seeds on the same environment. For the top right plot, the worst-case lower bound is not shown since it smaller than the minimum of -200.

Theorems & Definitions (45)

Lemma 2.1
Proposition 2.2
Proof 1
Proposition 2.3
Proof 2
Lemma 2.4
Proof 3
Lemma 2.5
Lemma 2.6
Proof 4
...and 35 more

Strongly-polynomial time and validation analysis of policy gradient methods

TL;DR

Abstract

Strongly-polynomial time and validation analysis of policy gradient methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (45)