Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

Juntao Dai; Yaodong Yang; Qian Zheng; Gang Pan

Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

Juntao Dai, Yaodong Yang, Qian Zheng, Gang Pan

TL;DR

This work tackles Safe RL under finite-horizon, non-discounted constraints, where previous infinite-horizon ABE methods can misestimate constraint changes and lead to unsafe updates. It introduces Gradient-based Estimation (GBE), a first-order gradient technique that computes objective and constraint changes along finite trajectories, and builds a constrained surrogate problem whose solution yields Constrained Gradient-based Policy Optimization (CGPO) within trust regions. The authors provide theoretical error bounds for the surrogate, develop an adaptive trust-region mechanism, and demonstrate, through differentiable Brax environments and world-model augmentation, that CGPO achieves faster, safer convergence with higher sample efficiency than baseline methods. The results establish a practical framework for reliable policy updates in Safe RL where finite-horizon constraints are prevalent, with broad implications for safety-critical robotic control and beyond.

Abstract

A key aspect of Safe Reinforcement Learning (Safe RL) involves estimating the constraint condition for the next policy, which is crucial for guiding the optimization of safe policy updates. However, the existing Advantage-based Estimation (ABE) method relies on the infinite-horizon discounted advantage function. This dependence leads to catastrophic errors in finite-horizon scenarios with non-discounted constraints, resulting in safety-violation updates. In response, we propose the first estimation method for finite-horizon non-discounted constraints in deep Safe RL, termed Gradient-based Estimation (GBE), which relies on the analytic gradient derived along trajectories. Our theoretical and empirical analyses demonstrate that GBE can effectively estimate constraint changes over a finite horizon. Constructing a surrogate optimization problem with GBE, we developed a novel Safe RL algorithm called Constrained Gradient-based Policy Optimization (CGPO). CGPO identifies feasible optimal policies by iteratively resolving sub-problems within trust regions. Our empirical results reveal that CGPO, unlike baseline algorithms, successfully estimates the constraint functions of subsequent policies, thereby ensuring the efficiency and feasibility of each update.

Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

TL;DR

Abstract

Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (15)