MDP Geometry, Normalization and Reward Balancing Solvers

Arsenii Mustafin; Aleksei Pakharev; Alex Olshevsky; Ioannis Ch. Paschalidis

MDP Geometry, Normalization and Reward Balancing Solvers

Arsenii Mustafin, Aleksei Pakharev, Alex Olshevsky, Ioannis Ch. Paschalidis

TL;DR

The paper introduces a geometric interpretation of MDPs, along with a normalization that preserves action advantages, forming a foundation for Reward Balancing (RB) solvers. By mapping policy evaluation and optimality to hyperplane geometry in an action-space, the authors define a transformation L_s^δ that preserves advantages and yields a normal form MDP, making optimal policies readily identifiable. They propose Safe RB-S and its stochastic variant for unknown dynamics, establishing convergence guarantees and favorable sample complexities that improve upon Q-learning in producing epsilon-optimal policies, while remaining parallelizable. The framework unifies value-based and value-free approaches under affine-equivalence, offering both theoretical insight and practical algorithms with strong performance in hierarchical and unknown-dynamics settings.

Abstract

We present a new geometric interpretation of Markov Decision Processes (MDPs) with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy. This advantage-preserving transformation of the MDP motivates a class of algorithms which we call Reward Balancing, which solve MDPs by iterating through these transformations, until an approximately optimal policy can be trivially found. We provide a convergence analysis of several algorithms in this class, in particular showing that for MDPs for unknown transition probabilities we can improve upon state-of-the-art sample complexity results.

MDP Geometry, Normalization and Reward Balancing Solvers

TL;DR

Abstract

Paper Structure (29 sections, 18 theorems, 60 equations, 13 figures, 4 algorithms)

This paper contains 29 sections, 18 theorems, 60 equations, 13 figures, 4 algorithms.

INTRODUCTION
Related work
Main Contributions
Basic MDP setting
GEOMETRY OF ACTION SPACE
Optimal policy in action space
MDP transformation
REWARD BALANCING ALGORITHMS
MDPs with unknown dynamics
CONCLUSION
ADDITIONAL REMARKS ON MDP GEOMETRY
Example of the Geometric Interpretation of a Regular MDP
Affine equivalence
Policy Iteration
Value Iteration
...and 14 more sections

Key Result

Proposition 3.1

For a policy $\pi$ and an action $a \in \pi$, the dot product of the corresponding action vector and policy vector is $0$. In other words, the action vector $a^+$ is orthogonal to the policy vector $V^\pi_+$ if $a \in \pi$.

Figures (13)

Figure 1: An example of the action space for a 2-state MDP with 3 actions in each state. The vertical axis is the action reward axis, while the two horizontal axes correspond to the first (axis $c_1$) and second (axis $c_2$) coefficients of the action vector (the same example with actions only can be found in the Appendix, Figure \ref{['fig:action_space_example']}). The figure also illustrates the application of Theorem \ref{['thm:selfloop_values']}. The shaded area represents the policy hyperplane $\mathcal{H}^\pi$ of the policy $\pi = (a, b)$. The black line connecting actions $a$ and $b$ indicates the intersection of the policy hyperplane and the action constraint hyperplane. The red and cyan bar heights correspond to the values of the policy $\pi$ in States 1 and 2.
Figure 2: Two-dimensional plot corresponding to the constrained action space from Figure \ref{['fig:action_space_with_policy']}. All the information required to analyze the MDP --- action coefficients and policy values --- is presented in the plot. Additionally, the plot includes the advantages of two actions, $c$ and $d$, that do not participate in the policy $\pi$.
Figure 3: An example of the action space for 2-states MDP with 3 actions on each state described in Appendix \ref{['ssec:geom_example']}. Vertical axis show action rewards, and two horizontal axes correspond to the first (axis $c_1$) and second (axis $c_2$) coefficients of action vector.
Figure 4: Illustration of the Policy Iteration algorithm dynamics in 2-state MDP.
Figure 5: Illustration of the Value Iteration algorithm dynamics.
...and 8 more figures

Theorems & Definitions (21)

Proposition 3.1
Proposition 3.2
Theorem 3.3
Theorem 3.4
Theorem 3.5
Definition 3.6
Lemma 4.1
Definition 4.2
Theorem 4.3
Theorem 4.4
...and 11 more

MDP Geometry, Normalization and Reward Balancing Solvers

TL;DR

Abstract

MDP Geometry, Normalization and Reward Balancing Solvers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (21)