Table of Contents
Fetching ...

Regret Analysis of Average-Reward Unichain MDPs via an Actor-Critic Approach

Swetha Ganesh, Vaneet Aggarwal

TL;DR

This work develops NAC-B, a Natural Actor-Critic with Batching, to achieve order-optimal regret $\tilde{O}(\sqrt{T})$ in infinite-horizon average-reward MDPs under the unichain assumption, a relaxation that allows transient states and periodicities. It introduces a batching scheme and a linear critic to enable scalable learning in large state-action spaces, without simulator resets, and analyzes Markovian sample averages via Cesàro averaging. The theoretical contribution hinges on new constants $C_{\text{hit}}$ and $C_{\text{tar}}$ that bound the time to enter the recurrent class and the time to reach stationary-state distributions, respectively, guaranteeing bias-variance control and convergence of value and $Q$-functions. Practically, NAC-B extends policy-gradient guarantees to more general, realistic environments and scales to large problems while maintaining provable performance guarantees in average-reward settings.

Abstract

Actor-Critic methods are widely used for their scalability, yet existing theoretical guarantees for infinite-horizon average-reward Markov Decision Processes (MDPs) often rely on restrictive ergodicity assumptions. We propose NAC-B, a Natural Actor-Critic with Batching, that achieves order-optimal regret of $\tilde{O}(\sqrt{T})$ in infinite-horizon average-reward MDPs under the unichain assumption, which permits both transient states and periodicity. This assumption is among the weakest under which the classic policy gradient theorem remains valid for average-reward settings. NAC-B employs function approximation for both the actor and the critic, enabling scalability to problems with large state and action spaces. The use of batching in our algorithm helps mitigate potential periodicity in the MDP and reduces stochasticity in gradient estimates, and our analysis formalizes these benefits through the introduction of the constants $C_{\text{hit}}$ and $C_{\text{tar}}$, which characterize the rate at which empirical averages over Markovian samples converge to the stationary distribution.

Regret Analysis of Average-Reward Unichain MDPs via an Actor-Critic Approach

TL;DR

This work develops NAC-B, a Natural Actor-Critic with Batching, to achieve order-optimal regret in infinite-horizon average-reward MDPs under the unichain assumption, a relaxation that allows transient states and periodicities. It introduces a batching scheme and a linear critic to enable scalable learning in large state-action spaces, without simulator resets, and analyzes Markovian sample averages via Cesàro averaging. The theoretical contribution hinges on new constants and that bound the time to enter the recurrent class and the time to reach stationary-state distributions, respectively, guaranteeing bias-variance control and convergence of value and -functions. Practically, NAC-B extends policy-gradient guarantees to more general, realistic environments and scales to large problems while maintaining provable performance guarantees in average-reward settings.

Abstract

Actor-Critic methods are widely used for their scalability, yet existing theoretical guarantees for infinite-horizon average-reward Markov Decision Processes (MDPs) often rely on restrictive ergodicity assumptions. We propose NAC-B, a Natural Actor-Critic with Batching, that achieves order-optimal regret of in infinite-horizon average-reward MDPs under the unichain assumption, which permits both transient states and periodicity. This assumption is among the weakest under which the classic policy gradient theorem remains valid for average-reward settings. NAC-B employs function approximation for both the actor and the critic, enabling scalability to problems with large state and action spaces. The use of batching in our algorithm helps mitigate potential periodicity in the MDP and reduces stochasticity in gradient estimates, and our analysis formalizes these benefits through the introduction of the constants and , which characterize the rate at which empirical averages over Markovian samples converge to the stationary distribution.

Paper Structure

This paper contains 23 sections, 20 theorems, 160 equations, 1 figure, 2 tables, 1 algorithm.

Key Result

Theorem 1

Consider Algorithm alg:acb and suppose Assumptions assump_mdp–assump:FND_policy hold. Let $J$ be $L$ smooth and set $K=\Theta(\sqrt{T}/(\log T))$, $B=\Theta(\sqrt{T})$ and $H=\Theta(\log T)$. Then, for a suitable choice of learning parameters, the expected regret satisfies where $C\coloneqq C_{\mathrm{tar}}+C_{\mathrm{hit}}$.

Figures (1)

  • Figure : Natural Actor-Critic with Batching

Theorems & Definitions (36)

  • Definition 1
  • Remark 1
  • Definition 2
  • Definition 3
  • Theorem 1: Main Result
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 2
  • Theorem 3
  • ...and 26 more