Table of Contents
Fetching ...

On Value Iteration Convergence in Connected MDPs

Arsenii Mustafin, Alex Olshevsky, Ioannis Ch. Paschalidis

TL;DR

The paper studies convergence of Value Iteration in finite MDPs where the optimal policy induces an irreducible, aperiodic Markov reward process. It shows that three VI variants—classical synchronous without learning rate, synchronous with learning rate, and asynchronous with learning rate—exhibit geometric convergence with rate strictly better than the standard discount factor $γ$ due to mixing properties. It derives explicit iteration complexities for both discounted and average-reward criteria, with bounds that depend on mixing parameters $τ$, $τ_α$, or $τ'$ and on update cadence. These results provide a tighter theoretical understanding of VI performance in connected MDPs and offer guidance for planning and reinforcement learning when faster convergence is desired under unknown mixing characteristics.

Abstract

This paper establishes that an MDP with a unique optimal policy and ergodic associated transition matrix ensures the convergence of various versions of the Value Iteration algorithm at a geometric rate that exceeds the discount factor γ for both discounted and average-reward criteria.

On Value Iteration Convergence in Connected MDPs

TL;DR

The paper studies convergence of Value Iteration in finite MDPs where the optimal policy induces an irreducible, aperiodic Markov reward process. It shows that three VI variants—classical synchronous without learning rate, synchronous with learning rate, and asynchronous with learning rate—exhibit geometric convergence with rate strictly better than the standard discount factor due to mixing properties. It derives explicit iteration complexities for both discounted and average-reward criteria, with bounds that depend on mixing parameters , , or and on update cadence. These results provide a tighter theoretical understanding of VI performance in connected MDPs and offer guidance for planning and reinforcement learning when faster convergence is desired under unknown mixing characteristics.

Abstract

This paper establishes that an MDP with a unique optimal policy and ergodic associated transition matrix ensures the convergence of various versions of the Value Iteration algorithm at a geometric rate that exceeds the discount factor γ for both discounted and average-reward criteria.
Paper Structure (13 sections, 10 theorems, 38 equations, 1 figure, 1 algorithm)

This paper contains 13 sections, 10 theorems, 38 equations, 1 figure, 1 algorithm.

Key Result

Lemma 3.3

For every irreducible and aperiodic stochastic matrix $A$, all elements of a matrix $A^{n^2-2n+2}$ are strictly positive.

Figures (1)

  • Figure 1: Convergence of the value iteration algorithm on a random MDP with different discount rates. It can be seen that as $\gamma$ approaches one, the convergence rate remains geometric with a rate less than $\gamma$.

Theorems & Definitions (16)

  • Lemma 3.3
  • proof
  • Theorem 3.4
  • proof
  • Lemma 3.5
  • proof
  • Corollary 3.6
  • proof
  • Corollary 3.7
  • proof
  • ...and 6 more