A Provable Approach for End-to-End Safe Reinforcement Learning
Akifumi Wachi, Kohei Miyaguchi, Takumi Tanabe, Rei Sato, Youhei Akimoto
TL;DR
Provably Lifetime Safe RL (PLS) tackles end-to-end safety in reinforcement learning by first learning a return-conditioned policy offline via safe return-conditioned supervised learning, then safely deploying it and optimally tuning a two-dimensional target return vector $(R,G)$ with Gaussian Process models of actual returns. Theoretical guarantees connect target returns to realized performance through a GP-based bound, ensuring safety with probability at least $1-\Delta$ and enabling near-optimal reward with finite online samples. Empirically, PLS outperforms offline and online baselines on Safety-Gym benchmarks by maintaining safety constraints across tasks while achieving high rewards, and it does so with modest online computation thanks to the low-dimensional target-return optimization. The approach offers a practical, data-efficient route to end-to-end safe RL, though it leaves open the challenge of guaranteeing near-optimal policies in addition to near-optimal target returns.
Abstract
A longstanding goal in safe reinforcement learning (RL) is a method to ensure the safety of a policy throughout the entire process, from learning to operation. However, existing safe RL paradigms inherently struggle to achieve this objective. We propose a method, called Provably Lifetime Safe RL (PLS), that integrates offline safe RL with safe policy deployment to address this challenge. Our proposed method learns a policy offline using return-conditioned supervised learning and then deploys the resulting policy while cautiously optimizing a limited set of parameters, known as target returns, using Gaussian processes (GPs). Theoretically, we justify the use of GPs by analyzing the mathematical relationship between target and actual returns. We then prove that PLS finds near-optimal target returns while guaranteeing safety with high probability. Empirically, we demonstrate that PLS outperforms baselines both in safety and reward performance, thereby achieving the longstanding goal to obtain high rewards while ensuring the safety of a policy throughout the lifetime from learning to operation.
