GenSafe: A Generalizable Safety Enhancer for Safe Reinforcement Learning Algorithms Based on Reduced Order Markov Decision Process Model
Zhehua Zhou, Xuan Xie, Jiayang Song, Zhan Shu, Lei Ma
TL;DR
Safe Reinforcement Learning (SRL) faces performance and safety challenges during early learning due to data scarcity. The authors propose GenSafe, a Generalizable Safety Enhancer that builds a Reduced Order Markov Decision Process (ROMDP) from online data to act as a low-dimensional safety predictor and to generate action-level corrections that increase constraint satisfaction. ROMDP is constructed via a five-step abstraction pipeline (state, action, cost, transition, policy) using techniques like t-SNE for state reduction and a Gaussian Mixture Model for discretization, with a short-horizon value function $V^r_{C^r}$ derived by a modified value iteration. Experiments on eight Safety-Gym tasks show that GenSafe consistently reduces constraint violations across a range of SRL methods while maintaining competitive task performance, demonstrating broad applicability and practical impact for safer online learning in complex systems.
Abstract
Safe Reinforcement Learning (SRL) aims to realize a safe learning process for Deep Reinforcement Learning (DRL) algorithms by incorporating safety constraints. However, the efficacy of SRL approaches often relies on accurate function approximations, which are notably challenging to achieve in the early learning stages due to data insufficiency. To address this issue, we introduce in this work a novel Generalizable Safety enhancer (GenSafe) that is able to overcome the challenge of data insufficiency and enhance the performance of SRL approaches. Leveraging model order reduction techniques, we first propose an innovative method to construct a Reduced Order Markov Decision Process (ROMDP) as a low-dimensional approximator of the original safety constraints. Then, by solving the reformulated ROMDP-based constraints, GenSafe refines the actions of the agent to increase the possibility of constraint satisfaction. Essentially, GenSafe acts as an additional safety layer for SRL algorithms. We evaluate GenSafe on multiple SRL approaches and benchmark problems. The results demonstrate its capability to improve safety performance, especially in the early learning phases, while maintaining satisfactory task performance. Our proposed GenSafe not only offers a novel measure to augment existing SRL methods but also shows broad compatibility with various SRL algorithms, making it applicable to a wide range of systems and SRL problems.
