Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

Yuehu Gong; Zeyuan Wang; Yulin Chen; Yanwei Fu

Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

Yuehu Gong, Zeyuan Wang, Yulin Chen, Yanwei Fu

Abstract

On-policy reinforcement learning with generative policies is promising but remains underexplored. A central challenge is that proximal policy optimization (PPO) is traditionally formulated in terms of action-space probability ratios, whereas diffusion- and flow-based policies are more naturally represented as trajectory-level generative processes. In this work, we propose GSB-PPO, a path-space formulation of generative PPO inspired by the Generalized Schrödinger Bridge (GSB). Our framework lifts PPO-style proximal updates from terminal actions to full generation trajectories, yielding a unified view of on-policy optimization for generative policies. Within this framework, we develop two concrete objectives: a clipping-based objective, GSB-PPO-Clip, and a penalty-based objective, GSB-PPO-Penalty. Experimental results show that while both objectives are compatible with on-policy training, the penalty formulation consistently delivers better stability and performance than the clipping counterpart. Overall, our results highlight path-space proximal regularization as an effective principle for training generative policies with PPO.

Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

Abstract

Paper Structure (18 sections, 24 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 24 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Related Work
On Policy Reinforcement Learning
Generative Policies in Reinforcement Learning
On-Policy Generative Reinforcement Learning
Preliminaries
On Policy Reinforcement Learning
Generalized Schrödinger Bridge
Method
Path Space Formulation of Generative PPO
GSB Inspired PPO with Clipping
GSB Inspired PPO with Penalty
Experiments
Experimental Setup
Main Results on 10 Playground Environments
...and 3 more sections

Figures (3)

Figure 1: Comparison between GSB-PPO and PPO on the 10 playground environments. We report step-return curves over training.
Figure 2: Comparison between GSB-PPO and FPO on the 10 playground environments. We report step-return curves over training.
Figure 3: KL ablation on CheetahRun.

Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

Abstract

Proximal Policy Optimization in Path Space: A Schrödinger Bridge Perspective

Authors

Abstract

Table of Contents

Figures (3)