Table of Contents
Fetching ...

Safety Optimized Reinforcement Learning via Multi-Objective Policy Optimization

Homayoun Honari, Mehran Ghafarian Tamizi, Homayoun Najjaran

TL;DR

Safety Optimized Reinforcement Learning (SORL) reframes Safe RL as a multi-objective policy optimization problem, enabling simultaneous pursuit of reward and safety without constraining the policy search space. It introduces a safety critic-driven reward shaping scheme, with an aggressiveness parameter Δ guided by a derived safety guarantee to control the safety-performance tradeoff. The method employs dual safety critics and dynamic λ updates within a model-free RL framework, achieving a Pareto-optimal balance between safety violations and returns across seven robotic tasks. Empirical results show SORL reduces safety violations substantially while maintaining competitive or superior returns compared to six state-of-the-art Safe RL methods, underscoring its practicality for safety-critical applications.

Abstract

Safe reinforcement learning (Safe RL) refers to a class of techniques that aim to prevent RL algorithms from violating constraints in the process of decision-making and exploration during trial and error. In this paper, a novel model-free Safe RL algorithm, formulated based on the multi-objective policy optimization framework is introduced where the policy is optimized towards optimality and safety, simultaneously. The optimality is achieved by the environment reward function that is subsequently shaped using a safety critic. The advantage of the Safety Optimized RL (SORL) algorithm compared to the traditional Safe RL algorithms is that it omits the need to constrain the policy search space. This allows SORL to find a natural tradeoff between safety and optimality without compromising the performance in terms of either safety or optimality due to strict search space constraints. Through our theoretical analysis of SORL, we propose a condition for SORL's converged policy to guarantee safety and then use it to introduce an aggressiveness parameter that allows for fine-tuning the mentioned tradeoff. The experimental results obtained in seven different robotic environments indicate a considerable reduction in the number of safety violations along with higher, or competitive, policy returns, in comparison to six different state-of-the-art Safe RL methods. The results demonstrate the significant superiority of the proposed SORL algorithm in safety-critical applications.

Safety Optimized Reinforcement Learning via Multi-Objective Policy Optimization

TL;DR

Safety Optimized Reinforcement Learning (SORL) reframes Safe RL as a multi-objective policy optimization problem, enabling simultaneous pursuit of reward and safety without constraining the policy search space. It introduces a safety critic-driven reward shaping scheme, with an aggressiveness parameter Δ guided by a derived safety guarantee to control the safety-performance tradeoff. The method employs dual safety critics and dynamic λ updates within a model-free RL framework, achieving a Pareto-optimal balance between safety violations and returns across seven robotic tasks. Empirical results show SORL reduces safety violations substantially while maintaining competitive or superior returns compared to six state-of-the-art Safe RL methods, underscoring its practicality for safety-critical applications.

Abstract

Safe reinforcement learning (Safe RL) refers to a class of techniques that aim to prevent RL algorithms from violating constraints in the process of decision-making and exploration during trial and error. In this paper, a novel model-free Safe RL algorithm, formulated based on the multi-objective policy optimization framework is introduced where the policy is optimized towards optimality and safety, simultaneously. The optimality is achieved by the environment reward function that is subsequently shaped using a safety critic. The advantage of the Safety Optimized RL (SORL) algorithm compared to the traditional Safe RL algorithms is that it omits the need to constrain the policy search space. This allows SORL to find a natural tradeoff between safety and optimality without compromising the performance in terms of either safety or optimality due to strict search space constraints. Through our theoretical analysis of SORL, we propose a condition for SORL's converged policy to guarantee safety and then use it to introduce an aggressiveness parameter that allows for fine-tuning the mentioned tradeoff. The experimental results obtained in seven different robotic environments indicate a considerable reduction in the number of safety violations along with higher, or competitive, policy returns, in comparison to six different state-of-the-art Safe RL methods. The results demonstrate the significant superiority of the proposed SORL algorithm in safety-critical applications.
Paper Structure (17 sections, 2 theorems, 19 equations, 3 figures, 1 algorithm)

This paper contains 17 sections, 2 theorems, 19 equations, 3 figures, 1 algorithm.

Key Result

Lemma 1

In an environment where Assumption Horizon holds, for any trajectory $\tau=\{(s_0,a_0),...,(s_{|\tau|},a_{|\tau|})\}$ where $s_0\in S_{irrecoverable}$, $s_t\sim P(s_{t-1},a_{t-1})$, and $s_{|\tau+1|}\sim P(s_{|\tau|},a_{|\tau|})\in S_{unsafe}$, and for any $t\in\{1,...,{|\tau|}\}$ we have:

Figures (3)

  • Figure 1: Visualization of the simulated safety-concerned robotics environments used to evaluate SORL in the context of three main safety topics of system-level safety (a-d), collision avoidance (e,f) and safe manipulation (g). The top row also showcases some possible constraint violation for the system-level safety benchmarks.
  • Figure 2: Benchmark results of SORL compared with six other Safe RL algorithms. (Top row): Return values achieved during the training phase (higher is better). (Middle row): Episodic failure rates suffered during the training phase (lower is better). (Bottom row): Pareto optimality plot corresponding to the return and failure rate values (closer to the top right corner is better). They also display the specific value of $\Delta$ employed for executing SORL. For easier comparison, the return and failure rate values are normalized and the failure rate is scaled to lie within -1 and 0. The Pareto optimality solutions are highlighted through the dotted line.
  • Figure 3: Undiscounted return of the policy versus the total number of violations during the training phase.

Theorems & Definitions (4)

  • Definition 1
  • Lemma 1
  • Theorem 1
  • proof