Table of Contents
Fetching ...

Learning mirror maps in policy mirror descent

Carlo Alfano, Sebastian Towers, Silvia Sapora, Chris Lu, Patrick Rebeschini

TL;DR

Policy Mirror Descent (PMD) regularizes RL updates via a mirror map from the $\omega$-potential family, linking update geometry to learning dynamics. The authors jointly parametrize mirror maps and employ gradient-free search via evolution strategies to maximize the final value $V^T(\mu)$, revealing learned maps that outperform the conventional negative entropy baseline in Grid-World, Basic Control Suite, MinAtar, and MuJoCo. Across tabular, non-tabular, and continuous control settings, the learned mirrors yield higher final performance, better generalization, and reveal a disconnect between theoretical error floors and practical outcomes, highlighting the impact of update geometry on PMD efficacy. The work suggests practical benefits of automatically discovering mirror maps and points to theoretical avenues to characterize mirror-map effects beyond upper bounds.

Abstract

Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD's full potential is limited, with the majority of research focusing on a particular mirror map -- namely, the negative entropy -- which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD's efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World, where we relate existing theoretical bounds with the performance of PMD for a few standard mirror maps and the learned one. We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the MinAtar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments.

Learning mirror maps in policy mirror descent

TL;DR

Policy Mirror Descent (PMD) regularizes RL updates via a mirror map from the -potential family, linking update geometry to learning dynamics. The authors jointly parametrize mirror maps and employ gradient-free search via evolution strategies to maximize the final value , revealing learned maps that outperform the conventional negative entropy baseline in Grid-World, Basic Control Suite, MinAtar, and MuJoCo. Across tabular, non-tabular, and continuous control settings, the learned mirrors yield higher final performance, better generalization, and reveal a disconnect between theoretical error floors and practical outcomes, highlighting the impact of update geometry on PMD efficacy. The work suggests practical benefits of automatically discovering mirror maps and points to theoretical avenues to characterize mirror-map effects beyond upper bounds.

Abstract

Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD's full potential is limited, with the majority of research focusing on a particular mirror map -- namely, the negative entropy -- which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD's efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e. Grid-World, where we relate existing theoretical bounds with the performance of PMD for a few standard mirror maps and the learned one. We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the MinAtar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments.
Paper Structure (30 sections, 5 theorems, 43 equations, 4 figures, 4 tables)

This paper contains 30 sections, 5 theorems, 43 equations, 4 figures, 4 tables.

Key Result

Theorem 2.2

Following update eq:singleup, we have that, for all $t\geq0$ where $\lVert \cdot \rVert_\infty$ and $\lVert \cdot \rVert_1$ represent the $\ell_\infty$ and the $\ell_1$ norms, respectively, and $\widehat{Q}$ is an estimate of the true $Q$-function. Additionally, at each iteration $T>0$, we have

Figures (4)

  • Figure 1: A plot visually demonstrating the parameterization for $\phi$.
  • Figure 2: Comparison between the learned map and the negative entropy and $\ell_2$-norm mirror maps across a range of held-out configurations of Grid-World. We display the average over 256 runs and report the standard error as a shaded region. The column "Random tasks", reports the averaged metrics for 256 randomly sampled configurations of Grid-World.
  • Figure 3: Comparison between the learned mirror map, the $\ell_2$-norm and the negative entropy across a range of standard environments. The top plots present the performance of AMPO for all mirror maps, reporting the average over 100 realizations and a shaded region denoting the standard error around the average. The middle plots report the $\omega$-potentials that induce the mirror maps. The bottom plots report the policy distribution according to \ref{['eq:pi_t']}, for each mirror map and score scales. The score scales are obtained by multiplying the vector $[1,\dots,|\mathcal{A}|]$ by a variable $c\in[0,4]$.
  • Figure 4: Comparison between the mirror map learned on Hopper and the negative entropy on three MuJoCo environments. The plots present the performance of PMD for both mirror maps, reporting the average over 8 realizations and a shaded region denoting the standard error around the average.

Theorems & Definitions (8)

  • Definition 2.1: $\omega$-potential mirror map krichene2015efficient
  • Theorem 2.2: xiao2022convergence
  • Lemma D.1: Performance difference lemma, Lemma 1 in xiao2022convergence
  • Lemma D.2: Three-point decent lemma, Lemma 6 in xiao2022convergence
  • Proposition D.3: Lemma 11 in xiao2022convergence
  • proof
  • Theorem D.4: Theorems 8 and 13 in xiao2022convergence
  • proof