On Value Iteration Convergence in Connected MDPs
Arsenii Mustafin, Alex Olshevsky, Ioannis Ch. Paschalidis
TL;DR
The paper studies convergence of Value Iteration in finite MDPs where the optimal policy induces an irreducible, aperiodic Markov reward process. It shows that three VI variants—classical synchronous without learning rate, synchronous with learning rate, and asynchronous with learning rate—exhibit geometric convergence with rate strictly better than the standard discount factor $γ$ due to mixing properties. It derives explicit iteration complexities for both discounted and average-reward criteria, with bounds that depend on mixing parameters $τ$, $τ_α$, or $τ'$ and on update cadence. These results provide a tighter theoretical understanding of VI performance in connected MDPs and offer guidance for planning and reinforcement learning when faster convergence is desired under unknown mixing characteristics.
Abstract
This paper establishes that an MDP with a unique optimal policy and ergodic associated transition matrix ensures the convergence of various versions of the Value Iteration algorithm at a geometric rate that exceeds the discount factor γ for both discounted and average-reward criteria.
