Improving Policy Optimization via $\varepsilon$-Retrain

Luca Marzari; Priya L. Donti; Changliu Liu; Enrico Marchesini

Improving Policy Optimization via $\varepsilon$-Retrain

Luca Marzari, Priya L. Donti, Changliu Liu, Enrico Marchesini

TL;DR

This work addresses enforcing behavioral preferences in reinforcement learning while preserving monotonic policy improvement. It introduces ε-retrain, which collects retrain areas around violations and trains the agent from these regions using a decaying ε to blend with the standard uniform restarts, compatible with TRPO, PPO, and Lagrangian variants. The authors prove a bound on the mixed-restart improvement and employ formal verification to quantify adherence to the desired behaviors, demonstrating stronger safety and sample efficiency across locomotion, power networks, and navigation tasks, including real-world embodied tests. The approach offers a practical, verification-backed method to improve safe RL and has potential for broader adoption in safety-critical control systems.

Abstract

We present $\varepsilon$-retrain, an exploration strategy encouraging a behavioral preference while optimizing policies with monotonic improvement guarantees. To this end, we introduce an iterative procedure for collecting retrain areas -- parts of the state space where an agent did not satisfy the behavioral preference. Our method switches between the typical uniform restart state distribution and the retrain areas using a decaying factor $\varepsilon$, allowing agents to retrain on situations where they violated the preference. We also employ formal verification of neural networks to provably quantify the degree to which agents adhere to these behavioral preferences. Experiments over hundreds of seeds across locomotion, power network, and navigation tasks show that our method yields agents that exhibit significant performance and sample efficiency improvements.

Improving Policy Optimization via $\varepsilon$-Retrain

TL;DR

Abstract

We present

-retrain, an exploration strategy encouraging a behavioral preference while optimizing policies with monotonic improvement guarantees. To this end, we introduce an iterative procedure for collecting retrain areas -- parts of the state space where an agent did not satisfy the behavioral preference. Our method switches between the typical uniform restart state distribution and the retrain areas using a decaying factor

, allowing agents to retrain on situations where they violated the preference. We also employ formal verification of neural networks to provably quantify the degree to which agents adhere to these behavioral preferences. Experiments over hundreds of seeds across locomotion, power network, and navigation tasks show that our method yields agents that exhibit significant performance and sample efficiency improvements.

Paper Structure (21 sections, 3 theorems, 16 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 3 theorems, 16 equations, 12 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries and Related Work
Constrained MDP
Formal Verification of Neural Networks
Policy Optimization via $\varepsilon$-retrain
Generation and Refinement Processes
Retrain Area Generation.
Refinement Procedure
Policy Improvement
Limitations
Experiments
Implementation Details
Empirical Evaluation
Provably Verifying Navigation Behaviors
Real (embodied) experiments
...and 6 more sections

Key Result

Lemma 1

Given two $\alpha$-coupled policies, $\pi$ and $\pi'$, we have that: $\left\vert \mathbb{E}_{s_t\sim\pi'}[\Tilde{A}(s_t)] - \mathbb{E}_{s_t\sim\pi}[\Tilde{A}(s_t)]\right\vert \leq 4\alpha(1-(1-\alpha)^t)\max_{s,a}\vert A_\pi(s,a)\vert$. It follows that: $\vert \psi(\pi') - L_\pi(\pi')\vert$$\leq \fr

Figures (12)

Figure 1: Explanatory overview of $\varepsilon$-retrain.
Figure 2: Overview of FV for neural networks.
Figure 3: (left) The agent collides with an obstacle, receiving a positive cost. (right) A retrain area is created from that state.
Figure 4: Retrain area generation. (a) Collision with an obstacle. (b) A previous unsafe state led to the collision. (c) $\omega$-bubble size to initialize the retrain area. Note that the $\omega$-bubble is the same for all the input features and is depicted in different sizes just for clarity representation purposes.
Figure 5: Left: Explanatory similar unsafe states for a subset of the input features---the two states are within distance $\beta$. Right: Explanatory different unsafe situations---at least a couple of input features have a distance greater than $\beta$.
...and 7 more figures

Theorems & Definitions (4)

Definition 1: $\alpha$-coupled policies TRPO
Lemma 1: TRPO
Lemma 2
Corollary 1

Improving Policy Optimization via $\varepsilon$-Retrain

TL;DR

Abstract

Improving Policy Optimization via $\varepsilon$-Retrain

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (4)