From Optimization to Control: Quasi Policy Iteration
Mohammad Amin Sharifi Kolarijani, Peyman Mohajerin Esfahani
TL;DR
This work addresses accelerating control for Markov decision processes without increasing per-iteration cost by introducing quasi-policy iteration (QPI), a second-order-like method that builds a Hessian‑like approximation of $H = I - \gamma P$ through two exact linear constraints and a transition-prior. The approach yields linear convergence with the same $\mathcal{O}(n^2 m)$ per-iteration cost as value iteration and demonstrates empirical convergence behavior akin to quasi-Newton methods, with reduced sensitivity to the discount factor $\gamma$, plus a model-free extension, quasi-policy learning (QPL). The authors provide rigorous convergence results, a backtracking implementation to guarantee contraction, and a modified variant with secant constraints that achieves local superlinear convergence; they validate the methods through extensive model-based and model-free numerical simulations across Garnet, Healthcare, and Graph MDPs, highlighting how priors and problem structure influence performance. Overall, QPI/QPL offer a principled, scalable acceleration of standard VI/QL-like methods for both planning and learning in MDPs, with practical implications for faster convergence in large or structurally challenging problems.
Abstract
Recent control algorithms for Markov decision processes (MDPs) have been designed using an implicit analogy with well-established optimization algorithms. In this paper, we adopt the quasi-Newton method (QNM) from convex optimization to introduce a novel control algorithm coined as quasi-policy iteration (QPI). In particular, QPI is based on a novel approximation of the ``Hessian'' matrix in the policy iteration algorithm, which exploits two linear structural constraints specific to MDPs and allows for the incorporation of prior information on the transition probability kernel. While the proposed algorithm has the same computational complexity as value iteration, it exhibits an empirical convergence behavior similar to that of QNM with a low sensitivity to the discount factor.
