Table of Contents
Fetching ...

Optimal Non-Asymptotic Rates of Value Iteration for Average-Reward Markov Decision Processes

Jonmin Lee, Ernest K. Ryu

TL;DR

This work delivers the first non-asymptotic convergence-rate theory for average-reward MDPs under value-iteration-type methods, focusing on multichain and weakly communicating/unichain regimes. It proves an $O(1/k)$ rate for Anchored VI on the Bellman error and provides a span-based complexity lower bound that matches the upper bound up to a constant factor, establishing optimality in key settings. The paper also analyzes Relaxed VI and Relrelative Value Iteration, showing sublinear rates carry over to these variants and demonstrates exact optimality of standard VI for normalized iterates. These results illuminate the fundamental sublinear nature of undiscounted average-reward MDP convergence and guide algorithm design for precise, finite-time guarantees in average-reward reinforcement learning.

Abstract

While there is an extensive body of research on the analysis of Value Iteration (VI) for discounted cumulative-reward MDPs, prior work on analyzing VI for (undiscounted) average-reward MDPs has been limited, and most prior results focus on asymptotic rates in terms of Bellman error. In this work, we conduct refined non-asymptotic analyses of average-reward MDPs, obtaining a collection of convergence results that advance our understanding of the setup. Among our new results, most notable are the $\mathcal{O}(1/k)$-rates of Anchored Value Iteration on the Bellman error under the multichain setup and the span-based complexity lower bound that matches the $\mathcal{O}(1/k)$ upper bound up to a constant factor of $8$ in the weakly communicating and unichain setups

Optimal Non-Asymptotic Rates of Value Iteration for Average-Reward Markov Decision Processes

TL;DR

This work delivers the first non-asymptotic convergence-rate theory for average-reward MDPs under value-iteration-type methods, focusing on multichain and weakly communicating/unichain regimes. It proves an rate for Anchored VI on the Bellman error and provides a span-based complexity lower bound that matches the upper bound up to a constant factor, establishing optimality in key settings. The paper also analyzes Relaxed VI and Relrelative Value Iteration, showing sublinear rates carry over to these variants and demonstrates exact optimality of standard VI for normalized iterates. These results illuminate the fundamental sublinear nature of undiscounted average-reward MDP convergence and guide algorithm design for precise, finite-time guarantees in average-reward reinforcement learning.

Abstract

While there is an extensive body of research on the analysis of Value Iteration (VI) for discounted cumulative-reward MDPs, prior work on analyzing VI for (undiscounted) average-reward MDPs has been limited, and most prior results focus on asymptotic rates in terms of Bellman error. In this work, we conduct refined non-asymptotic analyses of average-reward MDPs, obtaining a collection of convergence results that advance our understanding of the setup. Among our new results, most notable are the -rates of Anchored Value Iteration on the Bellman error under the multichain setup and the span-based complexity lower bound that matches the upper bound up to a constant factor of in the weakly communicating and unichain setups

Paper Structure

This paper contains 57 sections, 41 theorems, 152 equations, 1 figure, 1 table.

Key Result

Theorem 1

Consider a general (multichain) MDP. Let $(g^{\star},h^{\star})$ be a solution of the modified Bellman equations. For $k>K$, the Bellman and policy errors of eq:Rx-VI with $\lambda_k=1/2$ exhibits the rate where $K=\left({2\left\|{r}\right\|_{\infty}+4\left\|{V^0}\right\|_{\infty}+16\left\|{V^0-h^{\star}}\right\|_{\infty} +2\left\|{g^{\star}}\right\|_{\infty}}\right) /\epsilon$, and $S$ is the

Figures (1)

  • Figure 1: Classification of MDPs: Unichain $\subset$ Weakly Communicating $\subset$ Multichain (General)

Theorems & Definitions (74)

  • Theorem 1
  • Corollary 1
  • proof : Proof of Corollary \ref{['cor::KM_bellman_com']}
  • Theorem 2
  • Corollary 2
  • proof : Proof of Corollary \ref{['cor::anc_bellman_com']}
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • ...and 64 more