Monotone and Conservative Policy Iteration Beyond the Tabular Case
S. R. Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal
TL;DR
This paper addresses the instability and lack of guarantees when applying policy-iteration-based methods with function approximation. It introduces Reliable Policy Iteration (RPI), which enforces Bellman-inequality constraints during evaluation to yield monotone, lower-bounded value estimates that converge to a Bellman-consistent limit, and Conservative RPI (CRPI), which adds a FA-aware conservative policy update with per-step improvement guarantees. The authors provide a FA-generalized performance-difference analysis and prove convergence properties, including a projection interpretation under weighted $\\ell_1$ norms for RPI's evaluation. Empirical results in Inventory Control and Chain Walk show RPI achieving faster, more stable learning and CRPI attaining robust terminal performance under FA, bridging the gap between tabular CPI guarantees and modern function-approximation-based RL methods. Overall, RPI and CRPI offer a principled framework for robust, scalable RL with arbitrary function classes, enabling safer deployment of policy-iteration-inspired algorithms like TRPO and PPO in FA regimes.
Abstract
We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.
