Table of Contents
Fetching ...

Convergence of Fast Policy Iteration in Markov Games and Robust MDPs

Keith Badger, Jefferson Huang, Marek Petrik

TL;DR

This paper addresses the challenge of efficiently solving Markov games and robust MDPs by saddle-point policies. It first shows that the Filar-Tolwinski algorithm may fail to converge, even in small problems, exposing a gap in prior convergence proofs. It then introduces Residual Conditioned Policy Iteration (RCPI), a convergent algorithm that augments FT with a residual-based correction mechanism to guarantee descent to the saddle-point value and achieves convergence rates comparable to or better than value iteration in practice. Empirically, RCPI significantly outperforms existing convergent methods across a range of MG and RMDP domains, providing a practical and scalable solution with an open-source implementation for researchers and practitioners.

Abstract

Markov games and robust MDPs are closely related models that involve computing a pair of saddle point policies. As part of the long-standing effort to develop efficient algorithms for these models, the Filar-Tolwinski (FT) algorithm has shown considerable promise. As our first contribution, we demonstrate that FT may fail to converge to a saddle point and may loop indefinitely, even in small games. This observation contradicts the proof of FT's convergence to a saddle point in the original paper. As our second contribution, we propose Residual Conditioned Policy Iteration (RCPI). RCPI builds on FT, but is guaranteed to converge to a saddle point. Our numerical results show that RCPI outperforms other convergent algorithms by several orders of magnitude.

Convergence of Fast Policy Iteration in Markov Games and Robust MDPs

TL;DR

This paper addresses the challenge of efficiently solving Markov games and robust MDPs by saddle-point policies. It first shows that the Filar-Tolwinski algorithm may fail to converge, even in small problems, exposing a gap in prior convergence proofs. It then introduces Residual Conditioned Policy Iteration (RCPI), a convergent algorithm that augments FT with a residual-based correction mechanism to guarantee descent to the saddle-point value and achieves convergence rates comparable to or better than value iteration in practice. Empirically, RCPI significantly outperforms existing convergent methods across a range of MG and RMDP domains, providing a practical and scalable solution with an open-source implementation for researchers and practitioners.

Abstract

Markov games and robust MDPs are closely related models that involve computing a pair of saddle point policies. As part of the long-standing effort to develop efficient algorithms for these models, the Filar-Tolwinski (FT) algorithm has shown considerable promise. As our first contribution, we demonstrate that FT may fail to converge to a saddle point and may loop indefinitely, even in small games. This observation contradicts the proof of FT's convergence to a saddle point in the original paper. As our second contribution, we propose Residual Conditioned Policy Iteration (RCPI). RCPI builds on FT, but is guaranteed to converge to a saddle point. Our numerical results show that RCPI outperforms other convergent algorithms by several orders of magnitude.

Paper Structure

This paper contains 22 sections, 7 theorems, 58 equations, 8 figures, 1 table, 3 algorithms.

Key Result

Proposition 3.1

For each $\bm{v} \in \mathbb{R}^S$:

Figures (8)

  • Figure 1: Plot of $\psi_2(\bm{v})^2$ projected onto the plane that spans the initial value function, optimal value function, and the step direction.
  • Figure 2: Rewards and transition probabilities of the Markov game for states $s_1, s_2, s_3$ from \ref{['exm:local-minimum']}.
  • Figure 3: The Bellman residual of each algorithm's value function plotted as a function of time for the large Markov games(top) with 200 to 1000 states, and the large inventory problems(bottom) with 40 to 200 states.
  • Figure 4: Rewards and transition probabilities of the Markov game for states $s_1, s_2, s_3$ from \ref{['exm:local-minimum2']}.
  • Figure 5: The Bellman residual of each algorithm's value function plotted as a function of time for the smaller MGs (left) with 20 to 100 states, and the larger MGs (right) with 200 to 1000 states.
  • ...and 3 more figures

Theorems & Definitions (16)

  • Proposition 3.1
  • Proposition 3.2
  • Example 4.1
  • Theorem 4.2
  • Proposition 5.1
  • Theorem 5.2
  • Lemma 5.3
  • proof : Proof of \ref{['prop:value-approximation-error']}
  • proof : Proof of \ref{['prop:rmdp-bound']}
  • proof : Proof of \ref{['thm:counter-example']}
  • ...and 6 more