Table of Contents
Fetching ...

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

The paper advances policy-gradient methods for infinite-horizon average-reward MDPs with general policy parametrization by introducing an auxiliary function-based analysis and two variance-reduced algorithms. The IGT-based method achieves $\tilde{\mathcal{O}}(T^{2/3})$ regret, while the Hessian-based method attains the optimal $\tilde{\mathcal{O}}(\sqrt{T})$ regret, improving upon the prior $\tilde{O}(T^{3/4})$ and matching the lower bound. A key contribution is proving an approximate $L$-smoothness for the average-reward function via the auxiliary function $\bar{J}$, and constructing unbiased gradient and Hessian estimators for $\bar{J}$. These results extend efficient, scalable policy-gradient techniques to large state-action spaces without simulators, advancing practical learning in ergodic MDPs and offering a path to parameter-free extensions.

Abstract

We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.

Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

TL;DR

The paper advances policy-gradient methods for infinite-horizon average-reward MDPs with general policy parametrization by introducing an auxiliary function-based analysis and two variance-reduced algorithms. The IGT-based method achieves regret, while the Hessian-based method attains the optimal regret, improving upon the prior and matching the lower bound. A key contribution is proving an approximate -smoothness for the average-reward function via the auxiliary function , and constructing unbiased gradient and Hessian estimators for . These results extend efficient, scalable policy-gradient techniques to large state-action spaces without simulators, advancing practical learning in ergodic MDPs and offering a path to parameter-free extensions.

Abstract

We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order . The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order . These results significantly improve the state-of-the-art regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately -smooth, a result that was previously assumed in earlier works.
Paper Structure (22 sections, 24 theorems, 174 equations, 1 table, 3 algorithms)

This paper contains 22 sections, 24 theorems, 174 equations, 1 table, 3 algorithms.

Key Result

Theorem 1

Let $\{\theta_k\}_{k=1}^{K}$ be the outputs of Algorithm alg:PG_IGT_Avg. If Assumptions assump:ergodic_mdp, assump:score_func_bounds, assump:function_approx_error and assump:FND_policy hold, $\nabla_{\theta} J(\theta)$ is $L_h$-smooth, $\gamma_k=\frac{6G}{\mu(k+2)}$ and $\eta_k = \left(\frac{2}{k+2}

Theorems & Definitions (32)

  • Definition 1
  • Definition 2
  • Remark 1
  • Remark 2
  • Theorem 1: Regret bound for Algorithm \ref{['alg:PG_IGT_Avg']}
  • Theorem 2: Regret bound for Algorithm \ref{['alg:PG_Hessian_Avg']}
  • Theorem 3
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • ...and 22 more