Table of Contents
Fetching ...

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes

Qinbo Bai, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

This work advances reinforcement learning for infinite-horizon average-reward MDPs by introducing a policy gradient method with general parameterization that does not rely on linear structure. The proposed PG_MAG algorithm uses epoch-based gradient estimation with carefully separated sub-trajectories to control estimator variance and bias, achieving global convergence to near-optimal average reward. The authors derive a sublinear regret bound of $\tilde{O}(T^{3/4})$ (up to $\sqrt{\epsilon_{\text{bias}}}$) under ergodic dynamics and provide a comprehensive set of lemmas and proofs to support convergence and regret guarantees. The results bridge the gap between general-parameter PG methods and regret analysis in the average-reward setting, offering a foundation for scalable, model-free RL in non-tabular, non-linear policy spaces with practical implications for long-horizon decision-making.

Abstract

In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has $\tilde{\mathcal{O}}({T}^{3/4})$ regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.

Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes

TL;DR

This work advances reinforcement learning for infinite-horizon average-reward MDPs by introducing a policy gradient method with general parameterization that does not rely on linear structure. The proposed PG_MAG algorithm uses epoch-based gradient estimation with carefully separated sub-trajectories to control estimator variance and bias, achieving global convergence to near-optimal average reward. The authors derive a sublinear regret bound of (up to ) under ergodic dynamics and provide a comprehensive set of lemmas and proofs to support convergence and regret guarantees. The results bridge the gap between general-parameter PG methods and regret analysis in the average-reward setting, offering a foundation for scalable, model-free RL in non-tabular, non-linear policy spaces with practical implications for long-horizon decision-making.

Abstract

In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.
Paper Structure (20 sections, 14 theorems, 89 equations, 1 table, 2 algorithms)

This paper contains 20 sections, 14 theorems, 89 equations, 1 table, 2 algorithms.

Key Result

Lemma 1

The gradient of the long-term average reward can be expressed as follows.

Theorems & Definitions (26)

  • Definition 1
  • Definition 2
  • Lemma 1
  • Lemma 2
  • Remark 1
  • Lemma 3
  • Remark 2
  • Remark 3
  • Lemma 4
  • Lemma 5
  • ...and 16 more