Transductive Off-policy Proximal Policy Optimization

Yaozhong Gan; Renye Yan; Xiaoyang Tan; Zhe Wu; Junliang Xing

Transductive Off-policy Proximal Policy Optimization

Yaozhong Gan, Renye Yan, Xiaoyang Tan, Zhe Wu, Junliang Xing

TL;DR

ToPPO addresses the inefficiency of on-policy PPO by leveraging off-policy data through a transductive, off-policy surrogate bound based on $A^{\mu}$ rather than $A^{\pi_k}$. It derives a policy-improvement lower bound and casts updates as a constrained optimization with PPO-like clipping, enabling monotonic improvement while avoiding the need to store many old-policy networks. The approach is validated on MuJoCo and Atari, where ToPPO shows improved sample efficiency and stability over PPO and competitive baselines, and is shown to mitigate biases associated with off-policy advantage estimation in prior methods. Overall, ToPPO provides both theoretical guarantees and practical guidance for effectively reusing off-policy data in policy optimization, with potential impact on sample efficiency in continuous control and related domains.

Abstract

Proximal Policy Optimization (PPO) is a popular model-free reinforcement learning algorithm, esteemed for its simplicity and efficacy. However, due to its inherent on-policy nature, its proficiency in harnessing data from disparate policies is constrained. This paper introduces a novel off-policy extension to the original PPO method, christened Transductive Off-policy PPO (ToPPO). Herein, we provide theoretical justification for incorporating off-policy data in PPO training and prudent guidelines for its safe application. Our contribution includes a novel formulation of the policy improvement lower bound for prospective policies derived from off-policy data, accompanied by a computationally efficient mechanism to optimize this bound, underpinned by assurances of monotonic improvement. Comprehensive experimental results across six representative tasks underscore ToPPO's promising performance.

Transductive Off-policy Proximal Policy Optimization

TL;DR

ToPPO addresses the inefficiency of on-policy PPO by leveraging off-policy data through a transductive, off-policy surrogate bound based on

rather than

. It derives a policy-improvement lower bound and casts updates as a constrained optimization with PPO-like clipping, enabling monotonic improvement while avoiding the need to store many old-policy networks. The approach is validated on MuJoCo and Atari, where ToPPO shows improved sample efficiency and stability over PPO and competitive baselines, and is shown to mitigate biases associated with off-policy advantage estimation in prior methods. Overall, ToPPO provides both theoretical guarantees and practical guidance for effectively reusing off-policy data in policy optimization, with potential impact on sample efficiency in continuous control and related domains.

Abstract

Paper Structure (21 sections, 5 theorems, 40 equations, 5 figures, 1 table)

This paper contains 21 sections, 5 theorems, 40 equations, 5 figures, 1 table.

Introduction
Preliminaries
Policy Improvement Lower Bound
Generalized Proximal Policy Optimization
Transductive Off-policy PPO
Derivation of the Constrained Optimization Problem
The Clipped Surrogate Objection
Selecting policies
Discussion
Reanalyze the PPO Algorithm
Reanalyze the GePPO Algorithm
Experiments
Evaluation
Performance improvement
Ablation Studies
...and 6 more sections

Key Result

Lemma 2.1

(Policy Improvement Lower Bound) Consider a current policy $\pi_{k}$, and any policies $\pi$ and $\mu$, we have where $\epsilon=\max _{s, a}\left| A^{\pi_k}(s, a)\right|$, $\delta^{\pi, \mu}=\mathbb{E}_{s \sim \rho^{\mu}} \mathbb{D}_{\mathcal{T} \mathcal{V}}(\mu, \pi)(s)$, and $\delta_{\max}^{\pi_k, \pi} = \max_s \mathbb{D}_{\mathcal{T} \mathcal{V}}(\pi_k, \pi)(s)$. $\mathbb{D}_{\mathcal{T} \math

Figures (5)

Figure 1: The ratio of the difference between the true $V^{\pi}$ values and the estimated $\hat{V}$ values using V-trace technique to the true $V^{\pi}$ values, i.e.$\hbox{V ration}=|\frac{V^{\pi}-\hat{V}}{V^{\pi}}|$.
Figure 2: Learning curves on the MuJoCo environments. Performance of ToPPO vs. PPO, OTRPO, TRPO, DISC, OPPO, and GePPO. The shaded region indicates the standard deviation of ten random seeds. The X-axis represents the timesteps in the environment.
Figure 3: Final performance of ToPPO vs. ToPPO NOT (remove the constraints of selecting policies)
Figure 4: Learning curves on the Atari environments. Performance of ToPPO vs. PPO. The shaded region indicates the standard deviation of three random seeds.
Figure 5: Learning curves on the some Atari environments. Performance of ToPPO vs. PPO. The shaded region indicates the standard deviation of three random seeds.

Theorems & Definitions (12)

Lemma 2.1
Lemma 3.1
Lemma A.1
proof
Lemma A.2
Lemma A.3
proof
proof
proof
proof
...and 2 more

Transductive Off-policy Proximal Policy Optimization

TL;DR

Abstract

Transductive Off-policy Proximal Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (12)