Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Swetha Ganesh; Washim Uddin Mondal; Vaneet Aggarwal

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

The paper advances policy-gradient methods for infinite-horizon average-reward MDPs with general policy parametrization by introducing an auxiliary function-based analysis and two variance-reduced algorithms. The IGT-based method achieves $\tilde{\mathcal{O}}(T^{2/3})$ regret, while the Hessian-based method attains the optimal $\tilde{\mathcal{O}}(\sqrt{T})$ regret, improving upon the prior $\tilde{O}(T^{3/4})$ and matching the lower bound. A key contribution is proving an approximate $L$-smoothness for the average-reward function via the auxiliary function $\bar{J}$, and constructing unbiased gradient and Hessian estimators for $\bar{J}$. These results extend efficient, scalable policy-gradient techniques to large state-action spaces without simulators, advancing practical learning in ergodic MDPs and offering a path to parameter-free extensions.

Abstract

We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

TL;DR

regret, while the Hessian-based method attains the optimal

regret, improving upon the prior

and matching the lower bound. A key contribution is proving an approximate

-smoothness for the average-reward function via the auxiliary function

, and constructing unbiased gradient and Hessian estimators for

. These results extend efficient, scalable policy-gradient techniques to large state-action spaces without simulators, advancing practical learning in ergodic MDPs and offering a path to parameter-free extensions.

Abstract

. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order

. These results significantly improve the state-of-the-art

regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately

-smooth, a result that was previously assumed in earlier works.

Paper Structure (22 sections, 24 theorems, 174 equations, 1 table, 3 algorithms)

This paper contains 22 sections, 24 theorems, 174 equations, 1 table, 3 algorithms.

INTRODUCTION
Related Works
Technical Novelty and Contributions
SETUP
PROPOSED ALGORITHMS
MAIN RESULTS
Construction and Properties of the Auxiliary Function
CONCLUSION
Details of the Hessian estimator
Proof Outline
Proof of Lemma \ref{['lemma:advatge_estimate']}
Proof of Lemma \ref{['lemma:advatge_estimate']}(a)
Proof of Lemma \ref{['lemma:advatge_estimate']}(b)
Proof of Lemma \ref{['lem:grad+hess-est-prop']}
Proof of Lemma \ref{['lem:grad+hess-est-prop']}(a)
...and 7 more sections

Key Result

Theorem 1

Let $\{\theta_k\}_{k=1}^{K}$ be the outputs of Algorithm alg:PG_IGT_Avg. If Assumptions assump:ergodic_mdp, assump:score_func_bounds, assump:function_approx_error and assump:FND_policy hold, $\nabla_{\theta} J(\theta)$ is $L_h$-smooth, $\gamma_k=\frac{6G}{\mu(k+2)}$ and $\eta_k = \left(\frac{2}{k+2}

Theorems & Definitions (32)

Definition 1
Definition 2
Remark 1
Remark 2
Theorem 1: Regret bound for Algorithm \ref{['alg:PG_IGT_Avg']}
Theorem 2: Regret bound for Algorithm \ref{['alg:PG_Hessian_Avg']}
Theorem 3
Lemma 1
Lemma 2
Lemma 3
...and 22 more

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

TL;DR

Abstract

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (32)