Table of Contents
Fetching ...

MDP Geometry, Normalization and Reward Balancing Solvers

Arsenii Mustafin, Aleksei Pakharev, Alex Olshevsky, Ioannis Ch. Paschalidis

TL;DR

The paper introduces a geometric interpretation of MDPs, along with a normalization that preserves action advantages, forming a foundation for Reward Balancing (RB) solvers. By mapping policy evaluation and optimality to hyperplane geometry in an action-space, the authors define a transformation L_s^δ that preserves advantages and yields a normal form MDP, making optimal policies readily identifiable. They propose Safe RB-S and its stochastic variant for unknown dynamics, establishing convergence guarantees and favorable sample complexities that improve upon Q-learning in producing epsilon-optimal policies, while remaining parallelizable. The framework unifies value-based and value-free approaches under affine-equivalence, offering both theoretical insight and practical algorithms with strong performance in hierarchical and unknown-dynamics settings.

Abstract

We present a new geometric interpretation of Markov Decision Processes (MDPs) with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy. This advantage-preserving transformation of the MDP motivates a class of algorithms which we call Reward Balancing, which solve MDPs by iterating through these transformations, until an approximately optimal policy can be trivially found. We provide a convergence analysis of several algorithms in this class, in particular showing that for MDPs for unknown transition probabilities we can improve upon state-of-the-art sample complexity results.

MDP Geometry, Normalization and Reward Balancing Solvers

TL;DR

The paper introduces a geometric interpretation of MDPs, along with a normalization that preserves action advantages, forming a foundation for Reward Balancing (RB) solvers. By mapping policy evaluation and optimality to hyperplane geometry in an action-space, the authors define a transformation L_s^δ that preserves advantages and yields a normal form MDP, making optimal policies readily identifiable. They propose Safe RB-S and its stochastic variant for unknown dynamics, establishing convergence guarantees and favorable sample complexities that improve upon Q-learning in producing epsilon-optimal policies, while remaining parallelizable. The framework unifies value-based and value-free approaches under affine-equivalence, offering both theoretical insight and practical algorithms with strong performance in hierarchical and unknown-dynamics settings.

Abstract

We present a new geometric interpretation of Markov Decision Processes (MDPs) with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy. This advantage-preserving transformation of the MDP motivates a class of algorithms which we call Reward Balancing, which solve MDPs by iterating through these transformations, until an approximately optimal policy can be trivially found. We provide a convergence analysis of several algorithms in this class, in particular showing that for MDPs for unknown transition probabilities we can improve upon state-of-the-art sample complexity results.
Paper Structure (29 sections, 18 theorems, 60 equations, 13 figures, 4 algorithms)

This paper contains 29 sections, 18 theorems, 60 equations, 13 figures, 4 algorithms.

Key Result

Proposition 3.1

For a policy $\pi$ and an action $a \in \pi$, the dot product of the corresponding action vector and policy vector is $0$. In other words, the action vector $a^+$ is orthogonal to the policy vector $V^\pi_+$ if $a \in \pi$.

Figures (13)

  • Figure 1: An example of the action space for a 2-state MDP with 3 actions in each state. The vertical axis is the action reward axis, while the two horizontal axes correspond to the first (axis $c_1$) and second (axis $c_2$) coefficients of the action vector (the same example with actions only can be found in the Appendix, Figure \ref{['fig:action_space_example']}). The figure also illustrates the application of Theorem \ref{['thm:selfloop_values']}. The shaded area represents the policy hyperplane $\mathcal{H}^\pi$ of the policy $\pi = (a, b)$. The black line connecting actions $a$ and $b$ indicates the intersection of the policy hyperplane and the action constraint hyperplane. The red and cyan bar heights correspond to the values of the policy $\pi$ in States 1 and 2.
  • Figure 2: Two-dimensional plot corresponding to the constrained action space from Figure \ref{['fig:action_space_with_policy']}. All the information required to analyze the MDP --- action coefficients and policy values --- is presented in the plot. Additionally, the plot includes the advantages of two actions, $c$ and $d$, that do not participate in the policy $\pi$.
  • Figure 3: An example of the action space for 2-states MDP with 3 actions on each state described in Appendix \ref{['ssec:geom_example']}. Vertical axis show action rewards, and two horizontal axes correspond to the first (axis $c_1$) and second (axis $c_2$) coefficients of action vector.
  • Figure 4: Illustration of the Policy Iteration algorithm dynamics in 2-state MDP.
  • Figure 5: Illustration of the Value Iteration algorithm dynamics.
  • ...and 8 more figures

Theorems & Definitions (21)

  • Proposition 3.1
  • Proposition 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Theorem 3.5
  • Definition 3.6
  • Lemma 4.1
  • Definition 4.2
  • Theorem 4.3
  • Theorem 4.4
  • ...and 11 more