Optimal Non-Asymptotic Rates of Value Iteration for Average-Reward Markov Decision Processes

Jonmin Lee; Ernest K. Ryu

Optimal Non-Asymptotic Rates of Value Iteration for Average-Reward Markov Decision Processes

Jonmin Lee, Ernest K. Ryu

TL;DR

This work delivers the first non-asymptotic convergence-rate theory for average-reward MDPs under value-iteration-type methods, focusing on multichain and weakly communicating/unichain regimes. It proves an $O(1/k)$ rate for Anchored VI on the Bellman error and provides a span-based complexity lower bound that matches the upper bound up to a constant factor, establishing optimality in key settings. The paper also analyzes Relaxed VI and Relrelative Value Iteration, showing sublinear rates carry over to these variants and demonstrates exact optimality of standard VI for normalized iterates. These results illuminate the fundamental sublinear nature of undiscounted average-reward MDP convergence and guide algorithm design for precise, finite-time guarantees in average-reward reinforcement learning.

Abstract

While there is an extensive body of research on the analysis of Value Iteration (VI) for discounted cumulative-reward MDPs, prior work on analyzing VI for (undiscounted) average-reward MDPs has been limited, and most prior results focus on asymptotic rates in terms of Bellman error. In this work, we conduct refined non-asymptotic analyses of average-reward MDPs, obtaining a collection of convergence results that advance our understanding of the setup. Among our new results, most notable are the $\mathcal{O}(1/k)$-rates of Anchored Value Iteration on the Bellman error under the multichain setup and the span-based complexity lower bound that matches the $\mathcal{O}(1/k)$ upper bound up to a constant factor of $8$ in the weakly communicating and unichain setups

Optimal Non-Asymptotic Rates of Value Iteration for Average-Reward Markov Decision Processes

TL;DR

rate for Anchored VI on the Bellman error and provides a span-based complexity lower bound that matches the upper bound up to a constant factor, establishing optimality in key settings. The paper also analyzes Relaxed VI and Relrelative Value Iteration, showing sublinear rates carry over to these variants and demonstrates exact optimality of standard VI for normalized iterates. These results illuminate the fundamental sublinear nature of undiscounted average-reward MDP convergence and guide algorithm design for precise, finite-time guarantees in average-reward reinforcement learning.

Abstract

-rates of Anchored Value Iteration on the Bellman error under the multichain setup and the span-based complexity lower bound that matches the

upper bound up to a constant factor of

in the weakly communicating and unichain setups

Optimal Non-Asymptotic Rates of Value Iteration for Average-Reward Markov Decision Processes

TL;DR

Abstract

Optimal Non-Asymptotic Rates of Value Iteration for Average-Reward Markov Decision Processes

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (74)