Optimistic Policy Regularization

Mai Pham; Vikrant Vaze; Peter Chin

Optimistic Policy Regularization

Mai Pham, Vikrant Vaze, Peter Chin

TL;DR

Optistic Policy Regularization (OPR), a lightweight mechanism designed to preserve and reinforce historically successful trajectories during policy optimization, is introduced, demonstrating that anchoring policy updates to empirically successful trajectories can improve both sample efficiency and final performance.

Abstract

Deep reinforcement learning agents frequently suffer from premature convergence, where early entropy collapse causes the policy to discard exploratory behaviors before discovering globally optimal strategies. We introduce Optimistic Policy Regularization (OPR), a lightweight mechanism designed to preserve and reinforce historically successful trajectories during policy optimization. OPR maintains a dynamic buffer of high-performing episodes and biases learning toward these behaviors through directional log-ratio reward shaping and an auxiliary behavioral cloning objective. When instantiated on Proximal Policy Optimization (PPO), OPR substantially improves sample efficiency on the Arcade Learning Environment. Across 49 Atari games evaluated at the 10-million step benchmark, OPR achieves the highest score in 22 environments despite baseline methods being reported at the standard 50-million step horizon. Beyond arcade benchmarks, OPR also generalizes to the CAGE Challenge 2 cyber-defense environment, surpassing the competition-winning Cardiff agent while using the same PPO architecture. These results demonstrate that anchoring policy updates to empirically successful trajectories can improve both sample efficiency and final performance.

Optimistic Policy Regularization

TL;DR

Abstract

Paper Structure (17 sections, 8 equations, 5 figures, 4 tables)

This paper contains 17 sections, 8 equations, 5 figures, 4 tables.

Introduction
Related Work
Background: Proximal Policy Optimization (PPO)
Proposed Method: Optimistic Policy Regularization
The Good-Episode Memory Buffer
Directional Log-Ratio Reward Shaping
Auxiliary Behavioral Cloning
Experiments
Experimental Setup
Main Results
Detailed Game Analysis
Detailed Performance at 50M Steps (500 Epochs)
Breakthrough Learning Dynamics
Cyber-Security Domain: CAGE Challenge 2
Conclusion
...and 2 more sections

Figures (5)

Figure 1: The Optimistic Policy Regularization Architecture. Top-performing trajectories are conditionally stored in a FIFO Good-Episode Buffer. During optimization, these historical successes provide both a directional log-ratio reward shaping signal on the primary policy and an auxiliary Behavioral Cloning objective, shielding the agent against pessimistic exploration collapse.
Figure 2: Performance dominance across the Atari suite. The left panel compares the total number of environments where each evaluated configuration achieved the highest peak score at the 10M-step benchmark across the full 49-game suite. The right panel shows the same metric for the 14-game subset evaluated after 50M environment steps. OPR maintains a clear performance advantage across both evaluation horizons.
Figure 3: Learning dynamics of OPR on representative Atari environments where the method demonstrates strong improvements. The green curve shows the performance of PPO augmented with OPR during training, with shaded regions indicating variability across evaluations. Dashed horizontal lines denote benchmark scores achieved by baseline methods (DQN, A2C, PPO, ACPER, and A2C+SIL) after 50M environment interactions. The vertical dashed line marks the 10M-step evaluation point used for the primary comparison in this work. In many environments, OPR approaches or surpasses the final baseline performance within the first 10M interactions and continues improving thereafter. Notably, several curves show no clear performance plateau even near the 50M interaction horizon, suggesting that OPR does not prematurely saturate and continues improving policies over extended training.
Figure 4: Performance of OPR on the CAGE Challenge 2 attacker resilience environment. The right figure shows a zoomed view of the final phase, where OPR surpasses the Cardiff's winning solution.
Figure 5: Breakthrough Learning Dynamics (Balanced Environments): Reward escalation for the remaining 6 environments. In these settings, OPR remains highly competitive with high-resource baselines, demonstrating stable strategy refinement even in environments where dense-reward imitation (SIL) or large-scale prioritized replay (ACPER) provide strong specialized advantages.

Optimistic Policy Regularization

TL;DR

Abstract

Optimistic Policy Regularization

Authors

TL;DR

Abstract

Table of Contents

Figures (5)