Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning
Shangding Gu, Laixi Shi, Muning Wen, Ming Jin, Eric Mazumdar, Yuejie Chi, Adam Wierman, Costas Spanos
TL;DR
Robust-Gymnasium introduces a unified, modular benchmark to systematically evaluate robust reinforcement learning across disruptions occurring at observations, actions, rewards, and environment. By formalizing disruptions within a disrupted-MDP framework and assembling over 60 task bases from robotics, safe RL, and multi-agent RL, the paper enables standardized assessment of SOTA standard, robust, safe, and MARL algorithms, including a novel LLM-driven adversarial disturbance. Experimental results show that existing methods often struggle under multi-source, adversarial, and non-stationary disruptions, underscoring the need for new robust RL approaches and validating the benchmark’s utility for development and comparison. The work aims to accelerate progress toward reliable, real-world capable RL systems by providing rich task diversity, flexible disruption design, and clear evaluation protocols.
Abstract
Driven by inherent uncertainty and the sim-to-real gap, robust reinforcement learning (RL) seeks to improve resilience against the complexity and variability in agent-environment sequential interactions. Despite the existence of a large number of RL benchmarks, there is a lack of standardized benchmarks for robust RL. Current robust RL policies often focus on a specific type of uncertainty and are evaluated in distinct, one-off environments. In this work, we introduce Robust-Gymnasium, a unified modular benchmark designed for robust RL that supports a wide variety of disruptions across all key RL components-agents' observed state and reward, agents' actions, and the environment. Offering over sixty diverse task environments spanning control and robotics, safe RL, and multi-agent RL, it provides an open-source and user-friendly tool for the community to assess current methods and foster the development of robust RL algorithms. In addition, we benchmark existing standard and robust RL algorithms within this framework, uncovering significant deficiencies in each and offering new insights.
