OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research

Jiaming Ji; Jiayi Zhou; Borong Zhang; Juntao Dai; Xuehai Pan; Ruiyang Sun; Weidong Huang; Yiran Geng; Mickel Liu; Yaodong Yang

OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research

Jiaming Ji, Jiayi Zhou, Borong Zhang, Juntao Dai, Xuehai Pan, Ruiyang Sun, Weidong Huang, Yiran Geng, Mickel Liu, Yaodong Yang

TL;DR

This work introduces OmniSafe, an open-source infrastructure designed to accelerate Safe Reinforcement Learning research by providing a highly modular framework, parallelized training capabilities, and rigorous reproducibility. It identifies the lack of cohesive SafeRL tooling and demonstrates OmniSafe’s broad algorithm coverage (On-Policy, Off-Policy, Model-Based, Offline) along with Adapter/Wrapper abstractions that unify problem formulations and environments. The platform emphasizes safety by supporting a wide range of SafeRL paradigms (e.g., primal-dual, penalty methods) and includes extensive documentation, tutorials, and an Experiment Grid to streamline multi-seed experiments. Through validations in Safety-Gym and Mujoco-Velocity, OmniSafe aims to standardize SafeRL tooling and lower barriers to rapid, reliable research toward safer AI systems.

Abstract

AI systems empowered by reinforcement learning (RL) algorithms harbor the immense potential to catalyze societal advancement, yet their deployment is often impeded by significant safety concerns. Particularly in safety-critical applications, researchers have raised concerns about unintended harms or unsafe behaviors of unaligned RL agents. The philosophy of safe reinforcement learning (SafeRL) is to align RL agents with harmless intentions and safe behavioral patterns. In SafeRL, agents learn to develop optimal policies by receiving feedback from the environment, while also fulfilling the requirement of minimizing the risk of unintended harm or unsafe behavior. However, due to the intricate nature of SafeRL algorithm implementation, combining methodologies across various domains presents a formidable challenge. This had led to an absence of a cohesive and efficacious learning framework within the contemporary SafeRL research milieu. In this work, we introduce a foundational framework designed to expedite SafeRL research endeavors. Our comprehensive framework encompasses an array of algorithms spanning different RL domains and places heavy emphasis on safety elements. Our efforts are to make the SafeRL-related research process more streamlined and efficient, therefore facilitating further research in AI safety. Our project is released at: https://github.com/PKU-Alignment/omnisafe.

OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research

TL;DR

Abstract

Paper Structure (14 sections, 15 figures, 4 tables)

This paper contains 14 sections, 15 figures, 4 tables.

Introduction
Lack of OSS Infrastructure for SafeRL Research
Features of OmniSafe
(1) High Modularity.
(2) High-performance parallel computing acceleration.
(3) Code reliability and reproducibility.
(4) Fostering the Growth of SafeRL Communitiy.
DataFlows of OmniSafe
Conclusion and Outlook
Implemented Algorithms of OmniSafe
Safety-Gymnasium Experiment Results of OmniSafe
Experiment Grid of OmniSafe
The Documentation of OmniSafe
Features of OmniSafe

Figures (15)

Figure 1: The core features of OmniSafe include (a) Comprehensive API documentation with user guides, examples, and best practices for efficient learning, the documentation can be found in https://omnisafe.readthedocs.io; (b) Streamlined algorithm training through single-file execution, simplifying setup and management. (c) Achieve versatility through the utilization of algorithm-level abstraction and API interfaces; (d) Enhanced training stability and speed with environment-level asynchronous parallelism and agent asynchronous learning.
Figure 2: A high-level depiction of OmniSafe ’s distributed dataflow process. Each process periodically syncs weights and all-reduce gradients with other processes. Vectorized Environments first generate trajectories of the agent's interactions with the environment. Second, the EnvWrapper monitors and governs the environment's status (e.g. Auto-Reset) and outputs. Then, the Adapter assigns a suitable execution plan that handles data pre-processing. Next, the Learner gathers pre-processed data, calls the learning algorithm, and trains the model. Lastly, the ActionWrapper transforms the model's outputs to the agent's actions interpretable by the environments. Thereby completing a cycle of dataflow.
Figure 3: Training curves in Safety-Gymnasium MuJoCo Velocity environments, covering all classical reinforcement learning algorithms mentioned in \ref{['compare']}. The rewards are obtained from the 1e6 steps interaction.
Figure 4: Training curves in Safety-Gymnasium MuJoCo Navigation and Velocity environments. The rewards and costs are obtained from 1e7 steps interaction. Dashed black lines indicate the target cost value for a safe policy, which is set as 25.0.
Figure 5: OmniSafe 's Experiment Grid. The left side of the figure displays the main function of the run_experiment_grid.py file, while the right side shows the status of the Experiment Grid execution. In this example, three distinct random seeds are selected for the SafetyAntVelocity-v1 and SafetyWalker2dVelocity-v1, then the PPO-Lag and TRPO-Lag algorithms are executed.
...and 10 more figures

OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research

TL;DR

Abstract

OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research

Authors

TL;DR

Abstract

Table of Contents

Figures (15)