Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes
Qinbo Bai, Washim Uddin Mondal, Vaneet Aggarwal
TL;DR
This work advances reinforcement learning for infinite-horizon average-reward MDPs by introducing a policy gradient method with general parameterization that does not rely on linear structure. The proposed PG_MAG algorithm uses epoch-based gradient estimation with carefully separated sub-trajectories to control estimator variance and bias, achieving global convergence to near-optimal average reward. The authors derive a sublinear regret bound of $\tilde{O}(T^{3/4})$ (up to $\sqrt{\epsilon_{\text{bias}}}$) under ergodic dynamics and provide a comprehensive set of lemmas and proofs to support convergence and regret guarantees. The results bridge the gap between general-parameter PG methods and regret analysis in the average-reward setting, offering a foundation for scalable, model-free RL in non-tabular, non-linear policy spaces with practical implications for long-horizon decision-making.
Abstract
In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has $\tilde{\mathcal{O}}({T}^{3/4})$ regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.
