Table of Contents
Fetching ...

PPO in the Fisher-Rao geometry

Razvan-Andrei Lascu, David Šiška, Łukasz Szpruch

TL;DR

A tighter surrogate in the Fisher-Rao (FR) geometry is derived, yielding a novel variant, Fisher-Rao PPO (FR-PPO), which achieves sub-linear convergence without any dependence on the dimensionality of the action or state spaces, marking a significant step toward establishing formal convergence results for PPO-based algorithms.

Abstract

Proximal Policy Optimization (PPO) is widely used in reinforcement learning due to its strong empirical performance, yet it lacks formal guarantees for policy improvement and convergence. PPO's clipped surrogate objective is motivated by a lower bound on linearization of the value function in flat geometry setting. We derive a tighter surrogate objective and introduce Fisher-Rao PPO (FR-PPO) by leveraging the Fisher-Rao (FR) geometry. Our scheme provides strong theoretical guarantees, including monotonic policy improvement. In the direct parametrization setting, we show that FR-PPO achieves sub-linear convergence with no dependence on action or state space dimensions, and for parametrized policies we further obtain sub-linear convergence up to the compatible function approximation error. Finally, although our primary focus is theoretical, we also demonstrate empirically that FR-PPO performs well across a range of standard reinforcement learning tasks.

PPO in the Fisher-Rao geometry

TL;DR

A tighter surrogate in the Fisher-Rao (FR) geometry is derived, yielding a novel variant, Fisher-Rao PPO (FR-PPO), which achieves sub-linear convergence without any dependence on the dimensionality of the action or state spaces, marking a significant step toward establishing formal convergence results for PPO-based algorithms.

Abstract

Proximal Policy Optimization (PPO) is widely used in reinforcement learning due to its strong empirical performance, yet it lacks formal guarantees for policy improvement and convergence. PPO's clipped surrogate objective is motivated by a lower bound on linearization of the value function in flat geometry setting. We derive a tighter surrogate objective and introduce Fisher-Rao PPO (FR-PPO) by leveraging the Fisher-Rao (FR) geometry. Our scheme provides strong theoretical guarantees, including monotonic policy improvement. In the direct parametrization setting, we show that FR-PPO achieves sub-linear convergence with no dependence on action or state space dimensions, and for parametrized policies we further obtain sub-linear convergence up to the compatible function approximation error. Finally, although our primary focus is theoretical, we also demonstrate empirically that FR-PPO performs well across a range of standard reinforcement learning tasks.

Paper Structure

This paper contains 26 sections, 22 theorems, 171 equations, 4 figures, 4 tables, 3 algorithms.

Key Result

Lemma 2.3

Let $\rho \in \mathcal{P}(S)$. For all $\pi', \pi \in \mathcal{P}(A|S),$

Figures (4)

  • Figure 1: Training curve for Atari Breakout using PPO and FR-PPO with various clipping / penalty parameters.
  • Figure 2: Training curve for Atari environments with various clipping / penalty parameters.
  • Figure 3: Training curve for Mujoco environments with various clipping / penalty parameters.
  • Figure 4: Training curves for Walker 2d with alternative $\tau$ values.

Theorems & Definitions (43)

  • Definition 2.1: The squared Fisher--Rao (FR) distance
  • Definition 2.2: Geodesic convexity in the flat and Fisher--Rao geometry
  • Lemma 2.3: Performance difference
  • Theorem 2.4: Policy gradient without parametrization
  • Theorem 2.5
  • Theorem 2.6
  • Remark 3.2: Relation to classical concentrability assumptions
  • Theorem 3.3
  • Corollary 3.4
  • Theorem 3.5: Policy improvement
  • ...and 33 more