MDP Geometry, Normalization and Reward Balancing Solvers
Arsenii Mustafin, Aleksei Pakharev, Alex Olshevsky, Ioannis Ch. Paschalidis
TL;DR
The paper introduces a geometric interpretation of MDPs, along with a normalization that preserves action advantages, forming a foundation for Reward Balancing (RB) solvers. By mapping policy evaluation and optimality to hyperplane geometry in an action-space, the authors define a transformation L_s^δ that preserves advantages and yields a normal form MDP, making optimal policies readily identifiable. They propose Safe RB-S and its stochastic variant for unknown dynamics, establishing convergence guarantees and favorable sample complexities that improve upon Q-learning in producing epsilon-optimal policies, while remaining parallelizable. The framework unifies value-based and value-free approaches under affine-equivalence, offering both theoretical insight and practical algorithms with strong performance in hierarchical and unknown-dynamics settings.
Abstract
We present a new geometric interpretation of Markov Decision Processes (MDPs) with a natural normalization procedure that allows us to adjust the value function at each state without altering the advantage of any action with respect to any policy. This advantage-preserving transformation of the MDP motivates a class of algorithms which we call Reward Balancing, which solve MDPs by iterating through these transformations, until an approximately optimal policy can be trivially found. We provide a convergence analysis of several algorithms in this class, in particular showing that for MDPs for unknown transition probabilities we can improve upon state-of-the-art sample complexity results.
