Convergence of Fast Policy Iteration in Markov Games and Robust MDPs
Keith Badger, Jefferson Huang, Marek Petrik
TL;DR
This paper addresses the challenge of efficiently solving Markov games and robust MDPs by saddle-point policies. It first shows that the Filar-Tolwinski algorithm may fail to converge, even in small problems, exposing a gap in prior convergence proofs. It then introduces Residual Conditioned Policy Iteration (RCPI), a convergent algorithm that augments FT with a residual-based correction mechanism to guarantee descent to the saddle-point value and achieves convergence rates comparable to or better than value iteration in practice. Empirically, RCPI significantly outperforms existing convergent methods across a range of MG and RMDP domains, providing a practical and scalable solution with an open-source implementation for researchers and practitioners.
Abstract
Markov games and robust MDPs are closely related models that involve computing a pair of saddle point policies. As part of the long-standing effort to develop efficient algorithms for these models, the Filar-Tolwinski (FT) algorithm has shown considerable promise. As our first contribution, we demonstrate that FT may fail to converge to a saddle point and may loop indefinitely, even in small games. This observation contradicts the proof of FT's convergence to a saddle point in the original paper. As our second contribution, we propose Residual Conditioned Policy Iteration (RCPI). RCPI builds on FT, but is guaranteed to converge to a saddle point. Our numerical results show that RCPI outperforms other convergent algorithms by several orders of magnitude.
