Table of Contents
Fetching ...

Analysis of Value Iteration Through Absolute Probability Sequences

Arsenii Mustafin, Sebastien Colla, Alex Olshevsky, Ioannis Ch. Paschalidis

TL;DR

This work analyzes Value Iteration for discounted finite MDPs through absolute probability sequences, establishing convergence in a weighted L^2 norm rather than the traditional infinity norm. By decomposing the error into a consensus and disagreement part, the authors show the disagreement δ_t contracts as Δ_{t+1} = γ_α M_t Δ_t with γ_α = (1−α) + α γ and a positive contraction factor R_t, under the assumption that the optimal policy induces a strongly connected graph. The results yield a concrete contraction bound ||Δ_T||_{p_T}^2 ≤ (γ_α^2(1−λ))^T ||Δ_0||_{p_0}^2, demonstrating a rate faster than γ, governed by a generalized Rayleigh quotient and Laplacian spectral properties. This provides a complementary geometric lens on VI behavior and opens avenues for applying absolute-probability techniques to other iterative methods in reinforcement learning.

Abstract

Value Iteration is a widely used algorithm for solving Markov Decision Processes (MDPs). While previous studies have extensively analyzed its convergence properties, they primarily focus on convergence with respect to the infinity norm. In this work, we use absolute probability sequences to develop a new line of analysis and examine the algorithm's convergence in terms of the $L^2$ norm, offering a new perspective on its behavior and performance.

Analysis of Value Iteration Through Absolute Probability Sequences

TL;DR

This work analyzes Value Iteration for discounted finite MDPs through absolute probability sequences, establishing convergence in a weighted L^2 norm rather than the traditional infinity norm. By decomposing the error into a consensus and disagreement part, the authors show the disagreement δ_t contracts as Δ_{t+1} = γ_α M_t Δ_t with γ_α = (1−α) + α γ and a positive contraction factor R_t, under the assumption that the optimal policy induces a strongly connected graph. The results yield a concrete contraction bound ||Δ_T||_{p_T}^2 ≤ (γ_α^2(1−λ))^T ||Δ_0||_{p_0}^2, demonstrating a rate faster than γ, governed by a generalized Rayleigh quotient and Laplacian spectral properties. This provides a complementary geometric lens on VI behavior and opens avenues for applying absolute-probability techniques to other iterative methods in reinforcement learning.

Abstract

Value Iteration is a widely used algorithm for solving Markov Decision Processes (MDPs). While previous studies have extensively analyzed its convergence properties, they primarily focus on convergence with respect to the infinity norm. In this work, we use absolute probability sequences to develop a new line of analysis and examine the algorithm's convergence in terms of the norm, offering a new perspective on its behavior and performance.

Paper Structure

This paper contains 7 sections, 6 theorems, 44 equations.

Key Result

Theorem 4.1

Let $e_t$ be a bounded initial error vector that is not at consensus (i.e., $\Delta_t \ne 0$), $\{M_t\}$ a sequence of row-stochastic matrices, $\gamma \in (0,1)$ a discount factor, and $\{p_t\}$ an absolute probability sequence for $\{M_t\}$. If, for any $t$, the matrix $M_t$ has diagonal entries o where $R_t > 0$ for any $t \geq 0$.

Theorems & Definitions (14)

  • Definition 3.1
  • Definition 3.2
  • Theorem 4.1
  • Corollary 4.2
  • Definition 5.1
  • Lemma 5.2
  • proof
  • Lemma 5.3
  • proof
  • Lemma 5.4
  • ...and 4 more