On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

Navdeep Kumar; Yashaswini Murthy; Itai Shufaro; Kfir Y. Levy; R. Srikant; Shie Mannor

On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

Navdeep Kumar, Yashaswini Murthy, Itai Shufaro, Kfir Y. Levy, R. Srikant, Shie Mannor

TL;DR

This work establishes the first finite-time global convergence analysis for policy gradient in infinite-horizon average-reward MDPs on ergodic, finite state/action spaces. By introducing a projection-based technique to render the average-reward value function unique and Lipschitz, the authors prove smoothness of the average-reward objective and derive a finite-time convergence bound for projected policy gradient, achieving a sublinear rate with explicit dependence on MDP complexity. The main result shows the optimality gap shrinks as rho^*−rho^{pi_{k+1}} ≤ max(128 L_2^Pi C_PL^2 / k, 2^{−k/2}(rho^*−rho^{pi_0})), enabling O(log T) regret in T iterations, and clarifies how MD-structure constants influence convergence speed. They also extend the framework to discounted MDPs, obtaining stronger bounds expressed via a problem-structure constant L^Pi_2, and provide simulations illustrating the practical implications of MDP complexity on convergence.

Abstract

We present the first finite time global convergence analysis of policy gradient in the context of infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$ regret, where $T$ represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.

On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

TL;DR

Abstract

which translates to

regret, where

represents the number of iterations. Prior work on performance bounds for discounted reward MDPs cannot be extended to average reward MDPs because the bounds grow proportional to the fifth power of the effective horizon. Thus, our primary contribution is in proving that the policy gradient algorithm converges for average-reward MDPs and in obtaining finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations to empirically evaluate the performance of average reward policy gradient algorithm.

Paper Structure (27 sections, 25 theorems, 84 equations, 5 figures, 2 tables)

This paper contains 27 sections, 25 theorems, 84 equations, 5 figures, 2 tables.

Introduction
Related Work
Contributions
Preliminaries
Average Reward MDP Formulation
Relationship to discounted reward MDPs
Main Results
Key Ideas and Proof outline
Smoothness of average reward
Convergence of policy gradient
Extension to Discounted Reward MDPs
Simulations
Appendix
Smoothness of Average Reward
Proof of Lemma 1
...and 12 more sections

Key Result

Theorem 1

Let $\rho^{\pi_k}$ be the average reward corresponding to the policy iterates $\pi_k$, obtained through the policy gradient update avg:pg. Let $\rho^*$ represent the optimal average reward, that is, $\rho^*=\max_{\pi\in\Pi}\rho^\pi$. There exist constants $L_2^\Pi$ and $C_{PL}$ which are determined

Figures (5)

Figure 1: Improvement in average reward as a function of MDP complexity
Figure 2: Convex Projection
Figure 3: Convergence as a function of $C_p$
Figure 4: Variation of $C_p$ with state and action space cardinality.
Figure 5: Improvement in average reward as a function of MDP complexity

Theorems & Definitions (41)

Theorem 1
Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
Lemma 6
Lemma 7
Lemma 8
Lemma 9
...and 31 more

On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

TL;DR

Abstract

On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (41)