A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

Swetha Ganesh; Washim Uddin Mondal; Vaneet Aggarwal

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

Swetha Ganesh, Washim Uddin Mondal, Vaneet Aggarwal

TL;DR

This work tackles average-reward reinforcement learning with general policy parametrization by introducing MLMC-NAC, a model-free actor-critic algorithm that uses Multi-Level Monte Carlo gradient estimators to compute the natural policy gradient and critic updates without relying on known mixing or hitting times. The method maintains a policy parameter update $\theta_{k+1}=\theta_k+\alpha\omega_k$, where $\omega_k$ is obtained via a refined NPG subroutine, and estimates are produced through an outer loop of $K=\Theta(\sqrt{T})$ epochs and an inner loop of $H=\Theta(\sqrt{T}/\log T)$ steps. The authors prove a global convergence rate of $\tilde{\mathcal{O}}(T^{-1/2})$ for the average reward objective $J(\theta)$, with a bound that scales with $\sqrt{\epsilon_{\mathrm{app}}}$ and $\sqrt{\epsilon_{\mathrm{bias}}}$ and depends on the mixing time only polylogarithmically, while remaining independent of the state-space size. This yields near-optimal performance for large or continuous state spaces and removes practical barriers posed by mixing/hitting-time knowledge in prior analyses. The results rely on a novel decomposition of errors into bias and second-order NPG terms and on a general linear-recursion analysis underpinning the MLMC gradient estimators, enabling sharper global guarantees for average-reward AC methods.

Abstract

This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), without requiring the knowledge of mixing and hitting times. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

TL;DR

, where

is obtained via a refined NPG subroutine, and estimates are produced through an outer loop of

epochs and an inner loop of

steps. The authors prove a global convergence rate of

for the average reward objective

, with a bound that scales with

and

and depends on the mixing time only polylogarithmically, while remaining independent of the state-space size. This yields near-optimal performance for large or continuous state spaces and removes practical barriers posed by mixing/hitting-time knowledge in prior analyses. The results rely on a novel decomposition of errors into bias and second-order NPG terms and on a general linear-recursion analysis underpinning the MLMC gradient estimators, enabling sharper global guarantees for average-reward AC methods.

Abstract

for average-reward Markov Decision Processes (MDPs) (where

is the horizon length), without requiring the knowledge of mixing and hitting times. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.

Paper Structure (18 sections, 12 theorems, 104 equations, 1 table, 1 algorithm)

This paper contains 18 sections, 12 theorems, 104 equations, 1 table, 1 algorithm.

Introduction
Setup
Proposed Algorithm
Main Results
Proof Outline
Policy update analysis
Analysis of a General Linear Recursion
Analysis of NPG-Finding Subroutine
Critic and Average Reward Analysis
Conclusions
Proof of Lemma \ref{['lemma:local_global']}
Proof of Theorem \ref{['thm_2']}
Properties of the MLMC Estimates
Proof of Lemma \ref{['lemma:washim_2']}
Proof of Lemma \ref{['lemma_washim_3']}
...and 3 more sections

Key Result

Theorem 1

Consider Algorithm alg:ranac with $K=\Theta(\sqrt{T})$, $H=\Theta(\sqrt{T}/\log(T))$. Let Assumptions assump:ergodic_mdp-assump:FND_policy hold and $J$ be $L$-smooth. There exists a choice of parameters such that the following holds for a sufficiently large $T$. where $J^*$ is the optimal value of $J(\cdot)$.

Theorems & Definitions (15)

Definition 1
Theorem 1
Lemma 1
Theorem 2
Lemma 2
Theorem 3
Lemma 3
Theorem 4
Lemma 4: Lemma 4, bai2023regret
Lemma 5
...and 5 more

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

TL;DR

Abstract

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (15)