Table of Contents
Fetching ...

Enhancing Efficiency of Safe Reinforcement Learning via Sample Manipulation

Shangding Gu, Laixi Shi, Yuhao Ding, Alois Knoll, Costas Spanos, Adam Wierman, Ming Jin

TL;DR

This work addresses the sample inefficiency of safe reinforcement learning by introducing Efficient Safe Policy Optimization (ESPO), a primal-based on-policy method that leverages gradient conflict between reward and safety objectives to adaptively manipulate sampling. ESPO integrates a three-mode optimization—focusing on reward, safety, or both—with a gradient-conflict–aware sample size strategy, and provides convergence and stability guarantees in a tabular softmax-policy setting. Theoretical results show a convergence rate of $\widetilde{O}\left(\sqrt{\frac{SA}{(1-\gamma)^3T}}\right)$ and reduced oscillation, along with provable sample-efficiency benefits. Empirically, ESPO outperforms state-of-the-art primal-based and primal-dual baselines on Safety-MuJoCo and Omnisafe, achieving higher rewards with fewer samples and shorter training times, demonstrating practical impact for safety-critical control.

Abstract

Safe reinforcement learning (RL) is crucial for deploying RL agents in real-world applications, as it aims to maximize long-term rewards while satisfying safety constraints. However, safe RL often suffers from sample inefficiency, requiring extensive interactions with the environment to learn a safe policy. We propose Efficient Safe Policy Optimization (ESPO), a novel approach that enhances the efficiency of safe RL through sample manipulation. ESPO employs an optimization framework with three modes: maximizing rewards, minimizing costs, and balancing the trade-off between the two. By dynamically adjusting the sampling process based on the observed conflict between reward and safety gradients, ESPO theoretically guarantees convergence, optimization stability, and improved sample complexity bounds. Experiments on the Safety-MuJoCo and Omnisafe benchmarks demonstrate that ESPO significantly outperforms existing primal-based and primal-dual-based baselines in terms of reward maximization and constraint satisfaction. Moreover, ESPO achieves substantial gains in sample efficiency, requiring 25--29% fewer samples than baselines, and reduces training time by 21--38%.

Enhancing Efficiency of Safe Reinforcement Learning via Sample Manipulation

TL;DR

This work addresses the sample inefficiency of safe reinforcement learning by introducing Efficient Safe Policy Optimization (ESPO), a primal-based on-policy method that leverages gradient conflict between reward and safety objectives to adaptively manipulate sampling. ESPO integrates a three-mode optimization—focusing on reward, safety, or both—with a gradient-conflict–aware sample size strategy, and provides convergence and stability guarantees in a tabular softmax-policy setting. Theoretical results show a convergence rate of and reduced oscillation, along with provable sample-efficiency benefits. Empirically, ESPO outperforms state-of-the-art primal-based and primal-dual baselines on Safety-MuJoCo and Omnisafe, achieving higher rewards with fewer samples and shorter training times, demonstrating practical impact for safety-critical control.

Abstract

Safe reinforcement learning (RL) is crucial for deploying RL agents in real-world applications, as it aims to maximize long-term rewards while satisfying safety constraints. However, safe RL often suffers from sample inefficiency, requiring extensive interactions with the environment to learn a safe policy. We propose Efficient Safe Policy Optimization (ESPO), a novel approach that enhances the efficiency of safe RL through sample manipulation. ESPO employs an optimization framework with three modes: maximizing rewards, minimizing costs, and balancing the trade-off between the two. By dynamically adjusting the sampling process based on the observed conflict between reward and safety gradients, ESPO theoretically guarantees convergence, optimization stability, and improved sample complexity bounds. Experiments on the Safety-MuJoCo and Omnisafe benchmarks demonstrate that ESPO significantly outperforms existing primal-based and primal-dual-based baselines in terms of reward maximization and constraint satisfaction. Moreover, ESPO achieves substantial gains in sample efficiency, requiring 25--29% fewer samples than baselines, and reduces training time by 21--38%.
Paper Structure (37 sections, 9 theorems, 60 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 37 sections, 9 theorems, 60 equations, 5 figures, 7 tables, 1 algorithm.

Key Result

Theorem 4.1

Consider tabular setting with policy class defined in eq:2, and any $\delta \in (0,1)$. For Algorithm alg:ESPO-framework-gradually, applying $T_{\mathsf{pi}} = \widetilde{O}(\frac{T \log(\frac{|{\mathcal{S}}||\mathcal{A}|}{\delta})}{(1-\gamma)^3|{\mathcal{S}}||\mathcal{A}|} )$Throughout this paper, Here, the expectation is taken with respect to the randomness of the output $\widehat{\pi}$, which

Figures (5)

  • Figure 1: Oscillation Analysis compared our method with existing safe RL methods.
  • Figure 2: Compare our algorithm (ESPO) with PCRPO gu2023pcrpo and CRPO xu2021crpo on the Safety-MuJoCo benchmark. Our algorithm consistently and remarkably outperforms the SOTA baseline across multiple performance metrics, including reward maximization, safety assurance, and learning efficiency.
  • Figure 3: Compare our algorithm (ESPO) with PCPO yang2019projection, CUP yang2022constrained and PPOLag ji2023omnisafe on the Omnisafe benchmark. Our algorithm performs significantly better than the SOTA baselines regarding reward, safety, and efficiency performance.
  • Figure 4: Ablation experiments: Experiments of different cost limits and sample sizes.
  • Figure 5: Performance comparisons of safe RL methods on SafetyHumanoidStandup-v4 tasks.

Theorems & Definitions (10)

  • Theorem 4.1
  • Proposition 4.2
  • Proposition 4.3
  • Lemma A.1: Performance difference lemma kakade2002approximately
  • Lemma A.2
  • proof
  • Lemma A.3: Performance improvement bound for approximated NPG
  • Lemma A.4: Suboptimality gap bound for update rules of Algorithm \ref{['alg:ESPO-framework-gradually']}
  • Lemma A.5
  • Lemma A.6: The frequency of optimizing reward objective