Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs
Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal
TL;DR
The paper advances policy-gradient methods for infinite-horizon average-reward MDPs with general policy parametrization by introducing an auxiliary function-based analysis and two variance-reduced algorithms. The IGT-based method achieves $\tilde{\mathcal{O}}(T^{2/3})$ regret, while the Hessian-based method attains the optimal $\tilde{\mathcal{O}}(\sqrt{T})$ regret, improving upon the prior $\tilde{O}(T^{3/4})$ and matching the lower bound. A key contribution is proving an approximate $L$-smoothness for the average-reward function via the auxiliary function $\bar{J}$, and constructing unbiased gradient and Hessian estimators for $\bar{J}$. These results extend efficient, scalable policy-gradient techniques to large state-action spaces without simulators, advancing practical learning in ergodic MDPs and offering a path to parameter-free extensions.
Abstract
We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.
