Table of Contents
Fetching ...

Geometric Re-Analysis of Classical MDP Solving Algorithms

Arsenii Mustafin, Aleksei Pakharev, Alex Olshevsky, Ioannis Ch. Paschalidis

TL;DR

This work studies the convergence of Value Iteration (VI) and Policy Iteration (PI) for finite MDPs through a geometry-based interpretation, introducing a discount-factor transformation that preserves dynamics and yields an effective discount $\gamma_{\rm eff}$. It reveals a rotation component in VI and proves that, when the optimal-policy induced MRP is irreducible and aperiodic, VI converges at a rate strictly faster than the standard bound $γ$, with bounds involving the mixing rate $τ$. A 2-state MDP analysis shows PI converges in at most the number of actions, and the paper derives improved VI iteration counts in terms of $τ^{1/N}$, along with simplified geometric proofs. Overall, the paper provides a new analytical framework for VI and PI, offering practical convergence improvements and guidance for geometry-informed algorithm design in MDPs.

Abstract

We build on a recently introduced geometric interpretation of Markov Decision Processes (MDPs) to analyze classical MDP-solving algorithms: Value Iteration (VI) and Policy Iteration (PI). First, we develop a geometry-based analytical apparatus, including a transformation that modifies the discount factor $γ$, to improve convergence guarantees for these algorithms in several settings. In particular, one of our results identifies a rotation component in the VI method, and as a consequence shows that when a Markov Reward Process (MRP) induced by the optimal policy is irreducible and aperiodic, the asymptotic convergence rate of value iteration is strictly smaller than $γ$.

Geometric Re-Analysis of Classical MDP Solving Algorithms

TL;DR

This work studies the convergence of Value Iteration (VI) and Policy Iteration (PI) for finite MDPs through a geometry-based interpretation, introducing a discount-factor transformation that preserves dynamics and yields an effective discount . It reveals a rotation component in VI and proves that, when the optimal-policy induced MRP is irreducible and aperiodic, VI converges at a rate strictly faster than the standard bound , with bounds involving the mixing rate . A 2-state MDP analysis shows PI converges in at most the number of actions, and the paper derives improved VI iteration counts in terms of , along with simplified geometric proofs. Overall, the paper provides a new analytical framework for VI and PI, offering practical convergence improvements and guidance for geometry-informed algorithm design in MDPs.

Abstract

We build on a recently introduced geometric interpretation of Markov Decision Processes (MDPs) to analyze classical MDP-solving algorithms: Value Iteration (VI) and Policy Iteration (PI). First, we develop a geometry-based analytical apparatus, including a transformation that modifies the discount factor , to improve convergence guarantees for these algorithms in several settings. In particular, one of our results identifies a rotation component in the VI method, and as a consequence shows that when a Markov Reward Process (MRP) induced by the optimal policy is irreducible and aperiodic, the asymptotic convergence rate of value iteration is strictly smaller than .

Paper Structure

This paper contains 19 sections, 10 theorems, 48 equations, 3 figures, 2 algorithms.

Key Result

Theorem 3.1

Transformation $\mathcal{J}_s^{\gamma'}$ preserves $(1)$ advantage $\textrm{adv}(a,\pi)$ of any action $a$ with respect to any policy $\pi$; $(2)$ preserves the vector span $\textrm{sp}(V^\pi)$, for any pseudo-policy $V^\pi$.

Figures (3)

  • Figure 1: Illustration of a transformation $\mathcal{J}_s^{\gamma'}$ in the case of 2-state MDP, where $s=2$ and the discount factor is updated from $\gamma$ to $\gamma'$. Dots $a$ and $b$ on the plot represent actions of the MDP on the states $1$ and $2$ resp., the $x$-axis is equal to $c_1 = \gamma - 1 - c_2$, and the $y$-axis is equal to the reward of an action. Blue and teal lines lie on $c_1 = \bar{c}_1 = 0$, red and yellow lines lie on $c_2 = 0$, and purple and brown lines lie on $\bar{c}_2 = 0$. The distance between blue and magenta lines is $1-\gamma$, and the distance between blue and red lines is $1-\gamma'$. After the transformation, action coefficients related to state 1 remain unchanged ($c^a_1 = \bar{c}^a_1$, $c^b_1 = \bar{c}^b_1$) while those related to state 2 change by $\gamma' - \gamma$. The value on state 2 is equal to the length of the cyan bar divided by $1-\gamma$ before the transformation, and divided by $1-\gamma'$ after the transformation. The value on state 1 is more easily accessed using the value on state 2 as reference. Before the transformation, $V_1^\pi - V_2^\pi$ is equal to the difference of cyan and brown bars divided by $1-\gamma$, while $\bar{V}_1 - \bar{V}_2$ is equal to the difference of cyan and yellow bars divided by $1-\gamma'$. This implies that the actual difference does not change: $V_1^\pi - V_2^\pi = \bar{V}_1^\pi - \bar{V}_2^\pi$.
  • Figure 2: Proof of Theorem \ref{['thm:PI_2state']}. For any set of actions $\mathcal{A}$ and the corresponding set of policies $\mathcal{U}$ formed by them, we identify the policies in $\mathcal{U}$ with the most extreme slopes. Denote the policy with the smallest slope as $\pi_r$ (formed by actions $a$ and $b$) and the policy with the largest slope as $\pi_l$ (formed by actions $c$ and $d$). If we draw two lines parallel to $\pi_l$ and $\pi_r$ through any action (for example, action $b$ as shown in the Figure), the area below both lines forms an inefficiency zone: any action $e$ within this zone is inefficient within $\mathcal{U}$ because action $b$ lies above any policy that passes through $e$. Next, we choose a state where the vertical difference between $\pi_r$ and $\pi$ increases with the corresponding coefficient (State 2 in the Figure). The action that participates in the policy with the lower value at this state (action $c$) falls inside the inefficiency zone of the action that forms the policy with the higher value (action $b$).
  • Figure 3: Illustration of the Value Iteration algorithm dynamics: $V_{t+1}(s) = V_t(s) + \textrm{adv}(a^*,V_{t})$ (figure adapted from mdp_geometry). Graphically, VI can be interpreted as subtracting the length of the brown bar, scaled by $1 - \gamma$, from the value bar. The subtracted length is represented by the yellow bar, while the remaining value is shown as a red bar. Assume that $s$ is the state with the maximum value $V_t(s)$, as depicted in the figure. For $V(s)$ to contract exactly by $\gamma$ (i.e., $V_{t+1}(s) = \gamma V_t(s)$), the optimal action must be chosen as the maximizer and must lie exactly on the self-loop line (or its projection in the multidimensional case). For the state with the minimum value, $s'$, the subtracted values will always be less than $(1 - \gamma) V_t(s')$, unless both conditions are met. Together, these two facts explain the source of the extra convergence in the Value Iteration update: it skews the pseudo-policy $V_t$ toward horizontal hyperplane at a faster rate than it converges to zero.

Theorems & Definitions (18)

  • Theorem 3.1
  • proof
  • Definition 3.2
  • Corollary 3.3
  • Corollary 3.4
  • Theorem 4.1
  • proof
  • Corollary 4.2
  • proof
  • Theorem 5.2
  • ...and 8 more