Table of Contents
Fetching ...

Geometry-Inspired Unified Framework for Discounted and Average Reward MDPs

Arsenii Mustafin, Xinyi Sheng, Dominik Baumann

TL;DR

The paper addresses the longstanding split between discounted and average-reward MDP analyses by proposing a geometry-inspired, unified framework that extends the existing mdp_geometry to the average-reward setting with γ = 1. It introduces new action and policy vectors to maintain a coherent geometric interpretation for both reward criteria and demonstrates that Value Iteration converges geometrically under a unique unichain optimal policy. The key contributions include the reformulation of VI in the average-reward context, the invertibility and normalization results for unichain MDPs, and a rigorous contraction bound in the span seminorm. This unification enriches the theoretical toolkit for MDP convergence analysis and informs practical methods for analyzing and designing algorithms across both reward criteria.

Abstract

The theoretical analysis of Markov Decision Processes (MDPs) is commonly split into two cases - the average-reward case and the discounted-reward case - which, while sharing similarities, are typically analyzed separately. In this work, we extend a recently introduced geometric interpretation of MDPs for the discounted-reward case to the average-reward case, thereby unifying both. This allows us to extend a major result known for the discounted-reward case to the average-reward case: under a unique and ergodic optimal policy, the Value Iteration algorithm achieves a geometric convergence rate.

Geometry-Inspired Unified Framework for Discounted and Average Reward MDPs

TL;DR

The paper addresses the longstanding split between discounted and average-reward MDP analyses by proposing a geometry-inspired, unified framework that extends the existing mdp_geometry to the average-reward setting with γ = 1. It introduces new action and policy vectors to maintain a coherent geometric interpretation for both reward criteria and demonstrates that Value Iteration converges geometrically under a unique unichain optimal policy. The key contributions include the reformulation of VI in the average-reward context, the invertibility and normalization results for unichain MDPs, and a rigorous contraction bound in the span seminorm. This unification enriches the theoretical toolkit for MDP convergence analysis and informs practical methods for analyzing and designing algorithms across both reward criteria.

Abstract

The theoretical analysis of Markov Decision Processes (MDPs) is commonly split into two cases - the average-reward case and the discounted-reward case - which, while sharing similarities, are typically analyzed separately. In this work, we extend a recently introduced geometric interpretation of MDPs for the discounted-reward case to the average-reward case, thereby unifying both. This allows us to extend a major result known for the discounted-reward case to the average-reward case: under a unique and ergodic optimal policy, the Value Iteration algorithm achieves a geometric convergence rate.

Paper Structure

This paper contains 13 sections, 7 theorems, 57 equations, 1 figure.

Key Result

Lemma 3.1

In the discounted case, the inner product of action vector $a$ and policy vector $\pi$ is equal to the advantage of $a$ with respect to $\pi$:

Figures (1)

  • Figure 1: Old and new visualizations of two two-state MDP with the same transition probabilities and rewards but different discount factors. The right panel also illustrates why the geometric interpretation developed for the discounted-reward case does not extend directly to the average-reward case: in the latter, the vertical value lines collapse into a single line, so all states have the same value, and this set of values does not define a unique hyperplane. At the same time, a hyperplane can still be constructed for an ergodic policy, and the geometric picture remains logically coherent in that case. To extend the geometric framework to the average-reward case, we propose measuring the values not on the inner but on the outer edges of the action zones. These new values $v$ can be used in both the average- and discounted-reward cases.

Theorems & Definitions (13)

  • Lemma 3.1
  • proof
  • Lemma 3.2: puterman2014
  • Lemma 3.3
  • proof
  • Lemma 3.4
  • proof
  • Lemma 3.5
  • proof
  • Lemma 4.2
  • ...and 3 more