Table of Contents
Fetching ...

Monotone and Conservative Policy Iteration Beyond the Tabular Case

S. R. Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal

TL;DR

This paper addresses the instability and lack of guarantees when applying policy-iteration-based methods with function approximation. It introduces Reliable Policy Iteration (RPI), which enforces Bellman-inequality constraints during evaluation to yield monotone, lower-bounded value estimates that converge to a Bellman-consistent limit, and Conservative RPI (CRPI), which adds a FA-aware conservative policy update with per-step improvement guarantees. The authors provide a FA-generalized performance-difference analysis and prove convergence properties, including a projection interpretation under weighted $\\ell_1$ norms for RPI's evaluation. Empirical results in Inventory Control and Chain Walk show RPI achieving faster, more stable learning and CRPI attaining robust terminal performance under FA, bridging the gap between tabular CPI guarantees and modern function-approximation-based RL methods. Overall, RPI and CRPI offer a principled framework for robust, scalable RL with arbitrary function classes, enabling safer deployment of policy-iteration-inspired algorithms like TRPO and PPO in FA regimes.

Abstract

We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.

Monotone and Conservative Policy Iteration Beyond the Tabular Case

TL;DR

This paper addresses the instability and lack of guarantees when applying policy-iteration-based methods with function approximation. It introduces Reliable Policy Iteration (RPI), which enforces Bellman-inequality constraints during evaluation to yield monotone, lower-bounded value estimates that converge to a Bellman-consistent limit, and Conservative RPI (CRPI), which adds a FA-aware conservative policy update with per-step improvement guarantees. The authors provide a FA-generalized performance-difference analysis and prove convergence properties, including a projection interpretation under weighted norms for RPI's evaluation. Empirical results in Inventory Control and Chain Walk show RPI achieving faster, more stable learning and CRPI attaining robust terminal performance under FA, bridging the gap between tabular CPI guarantees and modern function-approximation-based RL methods. Overall, RPI and CRPI offer a principled framework for robust, scalable RL with arbitrary function classes, enabling safer deployment of policy-iteration-inspired algorithms like TRPO and PPO in FA regimes.

Abstract

We introduce Reliable Policy Iteration (RPI) and Conservative RPI (CRPI), variants of Policy Iteration (PI) and Conservative PI (CPI), that retain tabular guarantees under function approximation. RPI uses a novel Bellman-constrained optimization for policy evaluation. We show that RPI restores the textbook \textit{monotonicity} of value estimates and that these estimates provably \textit{lower-bound} the true return; moreover, their limit partially satisfies the \textit{unprojected} Bellman equation. CRPI shares RPI's evaluation, but updates policies conservatively by maximizing a new performance-difference \textit{lower bound} that explicitly accounts for function-approximation-induced errors. CRPI inherits RPI's guarantees and, crucially, admits per-step improvement bounds. In initial simulations, RPI and CRPI outperform PI and its variants. Our work addresses a foundational gap in RL: popular algorithms such as TRPO and PPO derive from tabular CPI yet are deployed with function approximation, where CPI's guarantees often fail-leading to divergence, oscillations, or convergence to suboptimal policies. By restoring PI/CPI-style guarantees for \textit{arbitrary} function classes, RPI and CRPI provide a principled basis for next-generation RL.

Paper Structure

This paper contains 25 sections, 10 theorems, 50 equations, 5 figures, 2 algorithms.

Key Result

Theorem 3.1

Suppose the FA space $\mathcal{F}$ is a closed subset of $\mathbb{R}^{SA}$ and the initial policy and value estimates satisfy $T_{\mu_0} f_0 \geq f_0.$ Then, the following claims hold.

Figures (5)

  • Figure 1: Inventory Control with linear function approximation. Left: Training curve of a single representative run (solid: true return, dashed: estimated return). Center: Averaged training curves over 100 runs (solid: mean return, shaded: mean return $\pm$ 1 std). Right: Key metrics table (AUC and terminal performance). Summary: RPI converges fastest, outperforms baselines on average across runs, and achieves a higher AUC—indicating faster, more sample-efficient learning.
  • Figure 2: Chain Walk with linear function approximation. Left: A single training run. RPI and CRPI both maintain a monotonic lower bound. This contrasts with the highly accurate estimates of USPI and CPI and the severe overestimation by AMPI-Q, far exceeding the optimal value. Center: Averaged training curves over 25 runs (solid: mean return, shaded: mean return $\pm$ 1 std). Right: Key performance metrics. Summary: Although AMPI-Q achieves the highest AUC, indicating rapid initial learning, CRPI's conservative updates lead to the best and most stable terminal performance.
  • Figure 3: Performance comparison of CRPI against RPI on Chain Walk over 6 representative runs (solid: true values, dashed: estimated values)
  • Figure 4: Comparison of $\Psi_1$ and $\Psi_0$ curves against approximate performance-difference curves across different runs and iteration stages.
  • Figure 5: Chain Walk starting from the optimal policy. Left: Averaged training curves over 10 runs for different feature matrices (solid: mean return, shaded: mean return $\pm$ 1 std). Right: Key metrics table (AUC and terminal performance). Summary: When initialized at the optimal policy, RPI, CRPI, and CPI maintain their performance, remaining stable near the optimum. In contrast, both AMPI-Q and USPI exhibit a significant degradation in performance.

Theorems & Definitions (24)

  • Theorem 3.1: RPI properties with general FA
  • Proposition 3.2: RPI generalizes PI
  • Proposition 3.3: Projection view under $\ell_1$-type-norm
  • Lemma 3.4: Approximate Performance-Difference Lemma
  • Remark 3.5
  • Proposition 3.6
  • Remark 3.7
  • Theorem 3.8
  • Theorem 3.9
  • Remark 3.10
  • ...and 14 more