Table of Contents
Fetching ...

Reflective Policy Optimization

Yaozhong Gan, Renye Yan, Zhe Wu, Junliang Xing

TL;DR

Reflective Policy Optimization (RPO) extends on-policy reinforcement learning by incorporating both past and future state-action information from trajectories to guide current policy updates. By formulating a generalized surrogate with multi-step terms and employing PPO-style clipping, RPO achieves monotonic policy improvement while contracting the policy space, leading to faster convergence and improved sample efficiency. The authors provide theoretical lower bounds and validate the approach on MuJoCo continuous-control tasks and Atari games, supported by a public codebase. The work demonstrates that leveraging reflective information from adjacent trajectory pairs can meaningfully enhance data efficiency for on-policy RL without relying on off-policy data.

Abstract

On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at https://github.com/Edgargan/RPO.

Reflective Policy Optimization

TL;DR

Reflective Policy Optimization (RPO) extends on-policy reinforcement learning by incorporating both past and future state-action information from trajectories to guide current policy updates. By formulating a generalized surrogate with multi-step terms and employing PPO-style clipping, RPO achieves monotonic policy improvement while contracting the policy space, leading to faster convergence and improved sample efficiency. The authors provide theoretical lower bounds and validate the approach on MuJoCo continuous-control tasks and Atari games, supported by a public codebase. The work demonstrates that leveraging reflective information from adjacent trajectory pairs can meaningfully enhance data efficiency for on-policy RL without relying on off-policy data.

Abstract

On-policy reinforcement learning methods, like Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), often demand extensive data per update, leading to sample inefficiency. This paper introduces Reflective Policy Optimization (RPO), a novel on-policy extension that amalgamates past and future state-action information for policy optimization. This approach empowers the agent for introspection, allowing modifications to its actions within the current state. Theoretical analysis confirms that policy performance is monotonically improved and contracts the solution space, consequently expediting the convergence procedure. Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks, culminating in superior sample efficiency. The source code of this work is available at https://github.com/Edgargan/RPO.
Paper Structure (15 sections, 14 theorems, 48 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 15 sections, 14 theorems, 48 equations, 7 figures, 6 tables, 1 algorithm.

Key Result

Lemma 3.1

Consider a current policy $\hat{\pi}$, and any policies $\pi$, we have

Figures (7)

  • Figure 1: (a) is a CliffWalking environment. (b) represents the total number of times the agent fell into the Cliff during the training procedure. (c) represents the agent's steps to reach the goal $G$ during the training procedure. RPO-3 means that when $k=3$, the algorithm uses three ratios.
  • Figure 2: Learning curves on the Gym environments. Performance of RPO vs. PPO. The shaded region indicates the standard deviation of ten random seeds. The X-axis represents the timesteps in the environment. The Y-axis represents the average return.
  • Figure 3: The figure (a) and (b) represent the performance of RPO vs. RPO-3 (means that when $k=3$, the algorithm uses three ratios), and the figure (c) and (d) represent the performance of RPO vs. RPO-clip($r_1r_2$) (means that the two ratios are clipped together.).
  • Figure 4: Normalized Improvement of RPO vs. PPO in 54 Atari 2600 games.
  • Figure 5: Learning curves on the Gym environments. Performance of RPO vs. PPO, TRPO, OTRPO, GePPO, ISPO and TayPO.
  • ...and 2 more figures

Theorems & Definitions (21)

  • Lemma 3.1
  • Theorem 3.2
  • Corollary 3.3
  • Theorem 4.1
  • Theorem 4.2
  • Lemma 1.1
  • Corollary 1.2
  • proof
  • Lemma 1.3
  • Lemma 1.4
  • ...and 11 more