Table of Contents
Fetching ...

Wasserstein Proximal Policy Gradient

Zhaoyu Zhu, Shuhan Zhang, Rui Gao, Shuang Li

TL;DR

A global linear convergence rate is established for WPPG, covering both exact policy evaluation and actor-critic implementations with controlled approximation error, and it is established that WPPG is simple to implement and attains competitive performance on standard continuous-control benchmarks.

Abstract

We study policy gradient methods for continuous-action, entropy-regularized reinforcement learning through the lens of Wasserstein geometry. Starting from a Wasserstein proximal update, we derive Wasserstein Proximal Policy Gradient (WPPG) via an operator-splitting scheme that alternates an optimal transport update with a heat step implemented by Gaussian convolution. This formulation avoids evaluating the policy's log density or its gradient, making the method directly applicable to expressive implicit stochastic policies specified as pushforward maps. We establish a global linear convergence rate for WPPG, covering both exact policy evaluation and actor-critic implementations with controlled approximation error. Empirically, WPPG is simple to implement and attains competitive performance on standard continuous-control benchmarks.

Wasserstein Proximal Policy Gradient

TL;DR

A global linear convergence rate is established for WPPG, covering both exact policy evaluation and actor-critic implementations with controlled approximation error, and it is established that WPPG is simple to implement and attains competitive performance on standard continuous-control benchmarks.

Abstract

We study policy gradient methods for continuous-action, entropy-regularized reinforcement learning through the lens of Wasserstein geometry. Starting from a Wasserstein proximal update, we derive Wasserstein Proximal Policy Gradient (WPPG) via an operator-splitting scheme that alternates an optimal transport update with a heat step implemented by Gaussian convolution. This formulation avoids evaluating the policy's log density or its gradient, making the method directly applicable to expressive implicit stochastic policies specified as pushforward maps. We establish a global linear convergence rate for WPPG, covering both exact policy evaluation and actor-critic implementations with controlled approximation error. Empirically, WPPG is simple to implement and attains competitive performance on standard continuous-control benchmarks.
Paper Structure (61 sections, 24 theorems, 186 equations, 7 figures, 6 tables, 3 algorithms)

This paper contains 61 sections, 24 theorems, 186 equations, 7 figures, 6 tables, 3 algorithms.

Key Result

Proposition 1

Assume the range of the function $g(s,\cdot)$ covers the action space $\mathcal{A}$. Then eq:split_transport can be equivalently solved via with $\pi_{k+\frac{1}{2}}=g_{k+\frac{1}{2}}(s,\cdot)_\#\nu$.

Figures (7)

  • Figure 1: Training curves on MuJoCo continuous control benchmarks: Solid lines denote the mean episodic return, while shaded areas represent the 95% confidence interval computed over 10 independent evaluation runs with different random seeds.
  • Figure 2: Multi-Run Evaluation
  • Figure 3: Combined Humanoid Task
  • Figure 4: Ablation study on $\tau$ (left) and $Latent\ Dimension$ (right).
  • Figure 5: Ablation on Double Q Function
  • ...and 2 more figures

Theorems & Definitions (49)

  • Proposition 1
  • Definition 1: $T_2$ transportation-information inequality
  • Remark 1
  • Theorem 1
  • Remark 2
  • Remark 3
  • Theorem 2
  • proof
  • Proposition 2: Entropy $W_2$-flow is the heat equation
  • proof
  • ...and 39 more