Table of Contents
Fetching ...

Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

Juntao Dai, Yaodong Yang, Qian Zheng, Gang Pan

TL;DR

This work tackles Safe RL under finite-horizon, non-discounted constraints, where previous infinite-horizon ABE methods can misestimate constraint changes and lead to unsafe updates. It introduces Gradient-based Estimation (GBE), a first-order gradient technique that computes objective and constraint changes along finite trajectories, and builds a constrained surrogate problem whose solution yields Constrained Gradient-based Policy Optimization (CGPO) within trust regions. The authors provide theoretical error bounds for the surrogate, develop an adaptive trust-region mechanism, and demonstrate, through differentiable Brax environments and world-model augmentation, that CGPO achieves faster, safer convergence with higher sample efficiency than baseline methods. The results establish a practical framework for reliable policy updates in Safe RL where finite-horizon constraints are prevalent, with broad implications for safety-critical robotic control and beyond.

Abstract

A key aspect of Safe Reinforcement Learning (Safe RL) involves estimating the constraint condition for the next policy, which is crucial for guiding the optimization of safe policy updates. However, the existing Advantage-based Estimation (ABE) method relies on the infinite-horizon discounted advantage function. This dependence leads to catastrophic errors in finite-horizon scenarios with non-discounted constraints, resulting in safety-violation updates. In response, we propose the first estimation method for finite-horizon non-discounted constraints in deep Safe RL, termed Gradient-based Estimation (GBE), which relies on the analytic gradient derived along trajectories. Our theoretical and empirical analyses demonstrate that GBE can effectively estimate constraint changes over a finite horizon. Constructing a surrogate optimization problem with GBE, we developed a novel Safe RL algorithm called Constrained Gradient-based Policy Optimization (CGPO). CGPO identifies feasible optimal policies by iteratively resolving sub-problems within trust regions. Our empirical results reveal that CGPO, unlike baseline algorithms, successfully estimates the constraint functions of subsequent policies, thereby ensuring the efficiency and feasibility of each update.

Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation

TL;DR

This work tackles Safe RL under finite-horizon, non-discounted constraints, where previous infinite-horizon ABE methods can misestimate constraint changes and lead to unsafe updates. It introduces Gradient-based Estimation (GBE), a first-order gradient technique that computes objective and constraint changes along finite trajectories, and builds a constrained surrogate problem whose solution yields Constrained Gradient-based Policy Optimization (CGPO) within trust regions. The authors provide theoretical error bounds for the surrogate, develop an adaptive trust-region mechanism, and demonstrate, through differentiable Brax environments and world-model augmentation, that CGPO achieves faster, safer convergence with higher sample efficiency than baseline methods. The results establish a practical framework for reliable policy updates in Safe RL where finite-horizon constraints are prevalent, with broad implications for safety-critical robotic control and beyond.

Abstract

A key aspect of Safe Reinforcement Learning (Safe RL) involves estimating the constraint condition for the next policy, which is crucial for guiding the optimization of safe policy updates. However, the existing Advantage-based Estimation (ABE) method relies on the infinite-horizon discounted advantage function. This dependence leads to catastrophic errors in finite-horizon scenarios with non-discounted constraints, resulting in safety-violation updates. In response, we propose the first estimation method for finite-horizon non-discounted constraints in deep Safe RL, termed Gradient-based Estimation (GBE), which relies on the analytic gradient derived along trajectories. Our theoretical and empirical analyses demonstrate that GBE can effectively estimate constraint changes over a finite horizon. Constructing a surrogate optimization problem with GBE, we developed a novel Safe RL algorithm called Constrained Gradient-based Policy Optimization (CGPO). CGPO identifies feasible optimal policies by iteratively resolving sub-problems within trust regions. Our empirical results reveal that CGPO, unlike baseline algorithms, successfully estimates the constraint functions of subsequent policies, thereby ensuring the efficiency and feasibility of each update.

Paper Structure

This paper contains 57 sections, 8 theorems, 79 equations, 16 figures, 4 tables, 3 algorithms.

Key Result

Lemma 4.0

Assume $\bm{\theta}_0 \in \Theta$ and $\mathcal{J}_f(\bm{\theta})$ is twice differentiable in a neighborhood surrounding $\bm{\theta}_0$. Let $\bm{\delta}$ be a small update from $\bm{\theta}_0$. If we estimate $\mathcal{J}_f(\bm{\theta}_0+\bm{\delta})$ as $\hat{\mathcal{J}}_f(\bm{\theta}_0+\bm{\del

Figures (16)

  • Figure 1: Advantage-based Estimation fails even in simple environments under finite-horizon constraints. (a) The cost obtained by the agent while traversing along the x-axis, namely, $c_t = c(x_t)$. (b) Relative errors in the estimation of changes in the finite-horizon cumulative constraint (i.e., $\sum_{t=1}^\top c_t \leq b$). The ABE method generates relative errors even greater than $1.0$, showing completely incorrect estimations. Refer to Appendix \ref{['app:simple_env']} for more details.
  • Figure 2: The computational relationship between the policy update $\bm{\delta}$, the gradient of the objective function $\bm{g}$, and the gradient of the constraint function $\bm{g}$ varies in three scenarios.
  • Figure 3: Gradient computation graph for the short-horizon approach. Here, $\mathcal{F}(\bm{s})$ represents the differentiable dynamics of the environment, $R$ is the reward signal, $C$ is the cost signal, $\pi_\theta$ is the parameterized policy, and $V^R_\phi$ and $V^C_\phi$ are the value functions for the return and constraint.
  • Figure 4: Training curves of certain algorithms on different tasks, showing episodic return and constraint for 5 random seeds. Solid lines represent the mean, while the shaded areas indicate variance, without any smoothing to the curves. CGPO demonstrates superior efficiency in improvement and constraint satisfaction. The rest of the training curves can be found in Appendix \ref{['app:more_results']}.
  • Figure 5: GBE and ABE errors across four training stages for equal step length updates and averaging over 100 repetitions, where $\text{Relative Error} = \frac{\text{Estimation Error}}{\text{Constraint Change}}$. GBE effectively predicts the constraint function of the next policy without failing like ABE.
  • ...and 11 more figures

Theorems & Definitions (15)

  • Lemma 4.0
  • proof
  • Theorem 5.1: Solvability Conditions for the Sub-problem
  • proof
  • Corollary 5.1
  • Theorem 5.2: Worst-Case Performance Update and Constraint Violation
  • proof
  • Lemma 1.0
  • proof
  • Theorem 1.1: Solvability Conditions for the Sub-problem
  • ...and 5 more