Improving Policy Optimization via $\varepsilon$-Retrain
Luca Marzari, Priya L. Donti, Changliu Liu, Enrico Marchesini
TL;DR
This work addresses enforcing behavioral preferences in reinforcement learning while preserving monotonic policy improvement. It introduces ε-retrain, which collects retrain areas around violations and trains the agent from these regions using a decaying ε to blend with the standard uniform restarts, compatible with TRPO, PPO, and Lagrangian variants. The authors prove a bound on the mixed-restart improvement and employ formal verification to quantify adherence to the desired behaviors, demonstrating stronger safety and sample efficiency across locomotion, power networks, and navigation tasks, including real-world embodied tests. The approach offers a practical, verification-backed method to improve safe RL and has potential for broader adoption in safety-critical control systems.
Abstract
We present $\varepsilon$-retrain, an exploration strategy encouraging a behavioral preference while optimizing policies with monotonic improvement guarantees. To this end, we introduce an iterative procedure for collecting retrain areas -- parts of the state space where an agent did not satisfy the behavioral preference. Our method switches between the typical uniform restart state distribution and the retrain areas using a decaying factor $\varepsilon$, allowing agents to retrain on situations where they violated the preference. We also employ formal verification of neural networks to provably quantify the degree to which agents adhere to these behavioral preferences. Experiments over hundreds of seeds across locomotion, power network, and navigation tasks show that our method yields agents that exhibit significant performance and sample efficiency improvements.
