GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

Zhehua Zhou; Xuan Xie; Jiayang Song; Zhan Shu; Lei Ma

GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

Zhehua Zhou, Xuan Xie, Jiayang Song, Zhan Shu, Lei Ma

TL;DR

Safe Reinforcement Learning (SRL) faces performance and safety challenges during early learning due to data scarcity. The authors propose GenSafe, a Generalizable Safety Enhancer that builds a Reduced Order Markov Decision Process (ROMDP) from online data to act as a low-dimensional safety predictor and to generate action-level corrections that increase constraint satisfaction. ROMDP is constructed via a five-step abstraction pipeline (state, action, cost, transition, policy) using techniques like t-SNE for state reduction and a Gaussian Mixture Model for discretization, with a short-horizon value function $V^r_{C^r}$ derived by a modified value iteration. Experiments on eight Safety-Gym tasks show that GenSafe consistently reduces constraint violations across a range of SRL methods while maintaining competitive task performance, demonstrating broad applicability and practical impact for safer online learning in complex systems.

Abstract

Safe Reinforcement Learning (SRL) aims to realize a safe learning process for Deep Reinforcement Learning (DRL) algorithms by incorporating safety constraints. However, the efficacy of SRL approaches often relies on accurate function approximations, which are notably challenging to achieve in the early learning stages due to data insufficiency. To address this issue, we introduce in this work a novel Generalizable Safety enhancer (GenSafe) that is able to overcome the challenge of data insufficiency and enhance the performance of SRL approaches. Leveraging model order reduction techniques, we first propose an innovative method to construct a Reduced Order Markov Decision Process (ROMDP) as a low-dimensional approximator of the original safety constraints. Then, by solving the reformulated ROMDP-based constraints, GenSafe refines the actions of the agent to increase the possibility of constraint satisfaction. Essentially, GenSafe acts as an additional safety layer for SRL algorithms. We evaluate GenSafe on multiple SRL approaches and benchmark problems. The results demonstrate its capability to improve safety performance, especially in the early learning phases, while maintaining satisfactory task performance. Our proposed GenSafe not only offers a novel measure to augment existing SRL methods but also shows broad compatibility with various SRL algorithms, making it applicable to a wide range of systems and SRL problems.

GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

TL;DR

derived by a modified value iteration. Experiments on eight Safety-Gym tasks show that GenSafe consistently reduces constraint violations across a range of SRL methods while maintaining competitive task performance, demonstrating broad applicability and practical impact for safer online learning in complex systems.

Abstract

Paper Structure (30 sections, 26 equations, 7 figures, 2 algorithms)

This paper contains 30 sections, 26 equations, 7 figures, 2 algorithms.

Introduction
Related Work
Model-free SRL
Model-based SRL
Preliminary
Constrained Markov Decision Process (CMDP)
Safe Reinforcement Learning (SRL)
PPO-Lagrangian
Reduced Order Markov Decision Process
Data Samples
Construction of ROMDP
State Abstraction
Action Abstraction
Cost Abstraction
Transition Abstraction
...and 15 more sections

Figures (7)

Figure 1: SRL with GenSafe. At each timestep, the current SRL policy recommends an action $a_t$ based on the current state $s_t$. Then, the proposed GenSafe performs an action correction to identify a modified action $a_m$ that is more likely to satisfy safety constraints. Such a corrective process involves resolving an optimization problem that considers both the immediate and future cost constraints, which are derived from the constructed ROMDP. We utilize the set of data samples $\mathcal{D}$ observed during the learning process to construct the ROMDP, which serves as a low-dimensional approximator of the original cost function in CMDP.
Figure 2: Example of the state abstraction. Through applying t-SNE, a set of original states $\mathcal{D}_s =\{s_1, \ldots, s_7\}$ is transformed into a corresponding set of deterministic low-dimensional states $\mathcal{D}_l = \{l_1, \ldots, l_7\}$, where similar high-dimensional data points are represented by nearby low-dimensional states. The mapping function $f_l$, trained with $\mathcal{D}_s$ and $\mathcal{D}_l$, reduces the high-dimensional state space $S$ to a two-dimensional state space $S^l$. The GMM classifier $f_{\text{GMM}}$ then divides $S^l$ into $k_s=4$ regions, each assigned with an index $v_s \in \{1,2,3,4\}$. The reduced state $s^r$ is thus determined by using the state abstraction function $s^r = f_s(s) = f_{\text{GMM}}(f_l(s))$, e.g., we have $s^r_1 = f_s(s_1) = s^r_3 = f_s(s_3) = 1$ and $\mathcal{D}_{s^r} = \{s^r_1,\ldots,s^r_7\} = \{1,4,1,2,3,4,3\}$ in this example.
Figure 3: Example of the action abstraction. For a two-dimensional action space $A \subseteq \mathbb{R}^2$, we discretize it using $k_a =3$, which results in a total of $k_a^{n_a}=9$ grids. Each grid cell is assigned an index $v_a \in \{1,\ldots,9\}$. $a_c(1),\ldots,a_c(9)$ denote the center of each grid. For a set of original applied actions $\mathcal{D}_a = \{a_1,\ldots,a_4 \}$, we thus have $a^r_1 = f_a(a_1) =5$, $a^r_2 = f_a(a_2) = a^r_3 = f_a(a_3) = 9$, $a^r_4 = f_a(a_4) = 1$.
Figure 4: Eight SRL tasks used in our experiments. Readers are referred to the supplementary material and ji2023safety for more details about these task environments.
Figure 5: (a) The set of low-dimensional states $\mathcal{D}_{l} = \{l_1, \ldots, l_{20000}\}$ distributed among a two-dimensional state space $S^l$, which is derived by applying t-SNE on the set of original states $\mathcal{D}_{s} = \{s_1, \ldots, s_{20000} \}$. The dimensions $l_x$ and $l_y$ represent two abstract features of these low-dimensional states. Each low-dimensional state corresponds to an observed high-dimensional data that has a cost of either $c=0$ (blue) or $c=1$ (green), where $c=0$ and $c=1$ denote safe and unsafe observations, respectively. (b) Results of the GMM classifier. Each low-dimensional state $l_i$ is categorized into one of the $k_s=100$ cluster regions, with each region represented by a different color and assigned an index $v_s \in \{1,2,\ldots,100 \}$. The reduced state $s^r$ corresponding to each original state $s$ is therefore given by the cluster index, i.e., $s^r = v_s = f_s(s) = f_\text{GMM}(l) = f_\text{GMM}(f_l(s))$. This results in a dataset of reduced states $\mathcal{D}_{s^r} = \{s^r_1,\ldots,s^r_{20000}\}$.
...and 2 more figures

Theorems & Definitions (9)

Remark 1
Definition 1: ROMDP
Remark 2
Example 1
Remark 3
Example 2
Example 3
Remark 4
Remark 5

GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

TL;DR

Abstract

GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (9)