From Optimization to Control: Quasi Policy Iteration

Mohammad Amin Sharifi Kolarijani; Peyman Mohajerin Esfahani

From Optimization to Control: Quasi Policy Iteration

Mohammad Amin Sharifi Kolarijani, Peyman Mohajerin Esfahani

TL;DR

This work addresses accelerating control for Markov decision processes without increasing per-iteration cost by introducing quasi-policy iteration (QPI), a second-order-like method that builds a Hessian‑like approximation of $H = I - \gamma P$ through two exact linear constraints and a transition-prior. The approach yields linear convergence with the same $\mathcal{O}(n^2 m)$ per-iteration cost as value iteration and demonstrates empirical convergence behavior akin to quasi-Newton methods, with reduced sensitivity to the discount factor $\gamma$, plus a model-free extension, quasi-policy learning (QPL). The authors provide rigorous convergence results, a backtracking implementation to guarantee contraction, and a modified variant with secant constraints that achieves local superlinear convergence; they validate the methods through extensive model-based and model-free numerical simulations across Garnet, Healthcare, and Graph MDPs, highlighting how priors and problem structure influence performance. Overall, QPI/QPL offer a principled, scalable acceleration of standard VI/QL-like methods for both planning and learning in MDPs, with practical implications for faster convergence in large or structurally challenging problems.

Abstract

Recent control algorithms for Markov decision processes (MDPs) have been designed using an implicit analogy with well-established optimization algorithms. In this paper, we adopt the quasi-Newton method (QNM) from convex optimization to introduce a novel control algorithm coined as quasi-policy iteration (QPI). In particular, QPI is based on a novel approximation of the ``Hessian'' matrix in the policy iteration algorithm, which exploits two linear structural constraints specific to MDPs and allows for the incorporation of prior information on the transition probability kernel. While the proposed algorithm has the same computational complexity as value iteration, it exhibits an empirical convergence behavior similar to that of QNM with a low sensitivity to the discount factor.

From Optimization to Control: Quasi Policy Iteration

TL;DR

through two exact linear constraints and a transition-prior. The approach yields linear convergence with the same

per-iteration cost as value iteration and demonstrates empirical convergence behavior akin to quasi-Newton methods, with reduced sensitivity to the discount factor

, plus a model-free extension, quasi-policy learning (QPL). The authors provide rigorous convergence results, a backtracking implementation to guarantee contraction, and a modified variant with secant constraints that achieves local superlinear convergence; they validate the methods through extensive model-based and model-free numerical simulations across Garnet, Healthcare, and Graph MDPs, highlighting how priors and problem structure influence performance. Overall, QPI/QPL offer a principled, scalable acceleration of standard VI/QL-like methods for both planning and learning in MDPs, with practical implications for faster convergence in large or structurally challenging problems.

Abstract

Paper Structure (23 sections, 5 theorems, 86 equations, 5 figures, 1 table)

This paper contains 23 sections, 5 theorems, 86 equations, 5 figures, 1 table.

Introduction
Optimal control of MDPs
Quasi-Policy Iteration (QPI)
Optimization vs. Control
QPI Algorithm
The prior
Implementation via backtracking
Modified implementation with superlinear convergence
Other constraints
Extension to model-free control: QPL algorithm
Technical Proofs
Proof of Theorem \ref{['thm:QPI']}
Proof of Corollary \ref{['cor:QPI-A']}
Proof of Lemma \ref{['lem:backtrack']}
Proof of Theorem \ref{['thm:QPI-B superlinear']}
...and 8 more sections

Key Result

Theorem 3.1

Consider the update rule eq:QPI update general using the approximation eq:QPI approx where $P_{\mathrm{prior}} \boldsymbol{1} = \boldsymbol{1}$ and let $G_{\mathrm{prior}} = (I-\gamma P_{\mathrm{prior}})^{-1}$. We have Moreover, the iterates $v_k$ of the QPI update rule eq:QPI update gen with the safeguarding eq:QPI safegaurd converge to $v^\star$ linearly with rate $\gamma$ and a per-iteration t

Figures (5)

Figure 1: Performance of model-based algorithms for three values of $\gamma$: (a) Garnet MDP; (b) Healthcare MDP. The bars indicate the iterations at which the safeguard is activated (for NVI, AVI, and QPI).
Figure 2: The running time of the model-based algorithms for three values of $\gamma$ corresponding to Figure \ref{['fig:mb_instance']}.
Figure 3: Performance of model-free algorithms (averaged over 20 runs) for three values of $\gamma$: (a) Garnet MDP; (b) Graph MDP.
Figure 4: Performance of model-based algorithms for three values of $\gamma$ and three different priors: (a) Garnet MDP; (b) Healthcare MDP. The bars indicate the iterations at which the safeguard is activated in QPI.
Figure 5: Performance of model-free algorithms (averaged over 20 runs) for three values of $\gamma$ and three different priors: (a) Garnet MDP; (b) Graph MDP.

Theorems & Definitions (6)

Theorem 3.1: QPI
Corollary 3.2: Uniform prior
Lemma 3.3: Backtracking
Theorem 3.4: Superlinear convergence
Theorem 3.5: QPL
Remark 3.6: Asynchronous QPL

From Optimization to Control: Quasi Policy Iteration

TL;DR

Abstract

From Optimization to Control: Quasi Policy Iteration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (6)