Table of Contents
Fetching ...

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

This work tackles average-reward reinforcement learning with general policy parametrization by introducing MLMC-NAC, a model-free actor-critic algorithm that uses Multi-Level Monte Carlo gradient estimators to compute the natural policy gradient and critic updates without relying on known mixing or hitting times. The method maintains a policy parameter update $\theta_{k+1}=\theta_k+\alpha\omega_k$, where $\omega_k$ is obtained via a refined NPG subroutine, and estimates are produced through an outer loop of $K=\Theta(\sqrt{T})$ epochs and an inner loop of $H=\Theta(\sqrt{T}/\log T)$ steps. The authors prove a global convergence rate of $\tilde{\mathcal{O}}(T^{-1/2})$ for the average reward objective $J(\theta)$, with a bound that scales with $\sqrt{\epsilon_{\mathrm{app}}}$ and $\sqrt{\epsilon_{\mathrm{bias}}}$ and depends on the mixing time only polylogarithmically, while remaining independent of the state-space size. This yields near-optimal performance for large or continuous state spaces and removes practical barriers posed by mixing/hitting-time knowledge in prior analyses. The results rely on a novel decomposition of errors into bias and second-order NPG terms and on a general linear-recursion analysis underpinning the MLMC gradient estimators, enabling sharper global guarantees for average-reward AC methods.

Abstract

This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), without requiring the knowledge of mixing and hitting times. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

TL;DR

This work tackles average-reward reinforcement learning with general policy parametrization by introducing MLMC-NAC, a model-free actor-critic algorithm that uses Multi-Level Monte Carlo gradient estimators to compute the natural policy gradient and critic updates without relying on known mixing or hitting times. The method maintains a policy parameter update , where is obtained via a refined NPG subroutine, and estimates are produced through an outer loop of epochs and an inner loop of steps. The authors prove a global convergence rate of for the average reward objective , with a bound that scales with and and depends on the mixing time only polylogarithmically, while remaining independent of the state-space size. This yields near-optimal performance for large or continuous state spaces and removes practical barriers posed by mixing/hitting-time knowledge in prior analyses. The results rely on a novel decomposition of errors into bias and second-order NPG terms and on a general linear-recursion analysis underpinning the MLMC gradient estimators, enabling sharper global guarantees for average-reward AC methods.

Abstract

This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of for average-reward Markov Decision Processes (MDPs) (where is the horizon length), without requiring the knowledge of mixing and hitting times. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.
Paper Structure (18 sections, 12 theorems, 104 equations, 1 table, 1 algorithm)

This paper contains 18 sections, 12 theorems, 104 equations, 1 table, 1 algorithm.

Key Result

Theorem 1

Consider Algorithm alg:ranac with $K=\Theta(\sqrt{T})$, $H=\Theta(\sqrt{T}/\log(T))$. Let Assumptions assump:ergodic_mdp-assump:FND_policy hold and $J$ be $L$-smooth. There exists a choice of parameters such that the following holds for a sufficiently large $T$. where $J^*$ is the optimal value of $J(\cdot)$.

Theorems & Definitions (15)

  • Definition 1
  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Lemma 2
  • Theorem 3
  • Lemma 3
  • Theorem 4
  • Lemma 4: Lemma 4, bai2023regret
  • Lemma 5
  • ...and 5 more