Table of Contents
Fetching ...

A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence

Carlo Alfano, Rui Yuan, Patrick Rebeschini

TL;DR

This work introduces Approximate Mirror Policy Optimization (AMPO), a general framework that integrates policy optimization with mirror-descent updates using Bregman projections to accommodate arbitrary parameterizations. AMPO yields convergence guarantees—sublinear under non-decreasing step-sizes and linear under geometrically increasing step-sizes—for any policy parameterization and any mirror map, while providing efficient $\widetilde{\mathcal{O}}(|\mathcal{A}|)$ computations for $\omega$-potential mirrors. The authors also derive a neural-network sample complexity bound of $\widetilde{\mathcal{O}}\big(C_v^2 \nu_\mu^5 / (\varepsilon^4 (1-\gamma)^6)\big)$ for shallow nets, and validate the framework empirically on control tasks, demonstrating competitive performance against PPO and the influence of mirror-map choice on convergence. Overall, AMPO unifies and extends PMD analyses to general parameterizations and mirrors, enabling new algorithms with rigorous guarantees and practical applicability to deep RL.

Abstract

Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.

A Novel Framework for Policy Mirror Descent with General Parameterization and Linear Convergence

TL;DR

This work introduces Approximate Mirror Policy Optimization (AMPO), a general framework that integrates policy optimization with mirror-descent updates using Bregman projections to accommodate arbitrary parameterizations. AMPO yields convergence guarantees—sublinear under non-decreasing step-sizes and linear under geometrically increasing step-sizes—for any policy parameterization and any mirror map, while providing efficient computations for -potential mirrors. The authors also derive a neural-network sample complexity bound of for shallow nets, and validate the framework empirically on control tasks, demonstrating competitive performance against PPO and the influence of mirror-map choice on convergence. Overall, AMPO unifies and extends PMD analyses to general parameterizations and mirrors, enabling new algorithms with rigorous guarantees and practical applicability to deep RL.

Abstract

Modern policy optimization methods in reinforcement learning, such as TRPO and PPO, owe their success to the use of parameterized policies. However, while theoretical guarantees have been established for this class of algorithms, especially in the tabular setting, the use of general parameterization schemes remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parameterizations. The policy class induced by our scheme recovers known classes, e.g., softmax, and generates new ones depending on the choice of mirror map. Using our framework, we obtain the first result that guarantees linear convergence for a policy-gradient-based method involving general parameterization. To demonstrate the ability of our framework to accommodate general parameterization schemes, we provide its sample complexity when using shallow neural networks, show that it represents an improvement upon the previous best results, and empirically validate the effectiveness of our theoretical claims on classic control tasks.
Paper Structure (40 sections, 20 theorems, 181 equations, 1 figure, 2 tables, 4 algorithms)

This paper contains 40 sections, 20 theorems, 181 equations, 1 figure, 2 tables, 4 algorithms.

Key Result

Lemma 4.1

For any policies $\pi$ and $\bar{\pi}$, for any function $f^\theta\in\mathcal{F}^\Theta$ and for $\eta>0$, we have where $\tilde{\pi}$ is the Bregman projected policy induced by $f^\theta$ and $h$ according to def:pol, that is $\tilde{\pi}_s = \mathop{\mathrm{argmin}}_{p \in \Delta(\mathcal{A})} \mathcal{D}_h(p, \nabla h^*(\eta f^\theta_s))$ for all $s \in \mathcal{S}$.

Figures (1)

  • Figure 1: Averaged performance over 50 runs of AMPO in CartPole and Acrobot environments. Note that the maximum values for CartPole and Acrobot are 500 and -80, respectively.

Theorems & Definitions (38)

  • Definition 3.1
  • Example 3.2: Negative entropy
  • Remark 3.3
  • Remark 3.4
  • Definition 3.5: $\omega$-potential mirror map krichene2015efficient
  • Example 3.6: Squared $\ell_2$-norm
  • Example 3.7: Negative entropy
  • Lemma 4.1
  • Proposition 4.2: Quasi-monotonic updates
  • Theorem 4.3
  • ...and 28 more