On Value Iteration Convergence in Connected MDPs

Arsenii Mustafin; Alex Olshevsky; Ioannis Ch. Paschalidis

On Value Iteration Convergence in Connected MDPs

Arsenii Mustafin, Alex Olshevsky, Ioannis Ch. Paschalidis

TL;DR

The paper studies convergence of Value Iteration in finite MDPs where the optimal policy induces an irreducible, aperiodic Markov reward process. It shows that three VI variants—classical synchronous without learning rate, synchronous with learning rate, and asynchronous with learning rate—exhibit geometric convergence with rate strictly better than the standard discount factor $γ$ due to mixing properties. It derives explicit iteration complexities for both discounted and average-reward criteria, with bounds that depend on mixing parameters $τ$, $τ_α$, or $τ'$ and on update cadence. These results provide a tighter theoretical understanding of VI performance in connected MDPs and offer guidance for planning and reinforcement learning when faster convergence is desired under unknown mixing characteristics.

Abstract

This paper establishes that an MDP with a unique optimal policy and ergodic associated transition matrix ensures the convergence of various versions of the Value Iteration algorithm at a geometric rate that exceeds the discount factor γ for both discounted and average-reward criteria.

On Value Iteration Convergence in Connected MDPs

TL;DR

due to mixing properties. It derives explicit iteration complexities for both discounted and average-reward criteria, with bounds that depend on mixing parameters

, or

and on update cadence. These results provide a tighter theoretical understanding of VI performance in connected MDPs and offer guidance for planning and reinforcement learning when faster convergence is desired under unknown mixing characteristics.

Abstract

Paper Structure (13 sections, 10 theorems, 38 equations, 1 figure, 1 algorithm)

This paper contains 13 sections, 10 theorems, 38 equations, 1 figure, 1 algorithm.

Introduction
Motivation and contribution
Problem formulation
Main results
Classical algorithm
Synchronous with learning rate
Asynchronous with learning rate
Proofs
Key idea of the proofs
Proof of Theorem \ref{['thm:sync_no_lr']}
Conclusion
Theorem proofs
Proof of Theorem \ref{['thm:sync_w_lr']}

Key Result

Lemma 3.3

For every irreducible and aperiodic stochastic matrix $A$, all elements of a matrix $A^{n^2-2n+2}$ are strictly positive.

Figures (1)

Figure 1: Convergence of the value iteration algorithm on a random MDP with different discount rates. It can be seen that as $\gamma$ approaches one, the convergence rate remains geometric with a rate less than $\gamma$.

Theorems & Definitions (16)

Lemma 3.3
proof
Theorem 3.4
proof
Lemma 3.5
proof
Corollary 3.6
proof
Corollary 3.7
proof
...and 6 more

On Value Iteration Convergence in Connected MDPs

TL;DR

Abstract

On Value Iteration Convergence in Connected MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (16)