A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

Ankita Kushwaha; Kiran Ravish; Preeti Lamba; Pawan Kumar

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

Ankita Kushwaha, Kiran Ravish, Preeti Lamba, Pawan Kumar

TL;DR

The paper addresses Safe Reinforcement Learning within the CMDP framework, formalizing safety as constraints $J_{c^{(i)}}(\\pi)\\le d_i$ alongside the primary return $J(\\pi)$. It surveys foundational CMDP theory, including $\mathcal{L}(\\pi,\\lambda)$ duality and linear-programming occupancy formulations, and surveys state-of-the-art single-agent methods (Lagrangian-based optimization, safety shields, and risk-sensitive approaches) and SafeMARL extensions (MACPO, scalable MAPPO-Lagrangian, shielding, and Stackelberg methods). It also discusses five open challenges—zero-violation safe exploration, partial observability, decentralized safety, non-stationarity, and competitive/multi-agent safety—and positions CMDP as a unifying language bridging machine learning, control theory, and formal methods. The findings highlight a mature set of SafeRL tools for practical deployment in safety-critical domains and identify SafeMARL as a vibrant frontier requiring advances in theory, scalability, and robustness for real-world multi-agent systems.

Abstract

Safe Reinforcement Learning (SafeRL) is the subfield of reinforcement learning that explicitly deals with safety constraints during the learning and deployment of agents. This survey provides a mathematically rigorous overview of SafeRL formulations based on Constrained Markov Decision Processes (CMDPs) and extensions to Multi-Agent Safe RL (SafeMARL). We review theoretical foundations of CMDPs, covering definitions, constrained optimization techniques, and fundamental theorems. We then summarize state-of-the-art algorithms in SafeRL for single agents, including policy gradient methods with safety guarantees and safe exploration strategies, as well as recent advances in SafeMARL for cooperative and competitive settings. Additionally, we propose five open research problems to advance the field, with three focusing on SafeMARL. Each problem is described with motivation, key challenges, and related prior work. This survey is intended as a technical guide for researchers interested in SafeRL and SafeMARL, highlighting key concepts, methods, and open future research directions.

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

TL;DR

The paper addresses Safe Reinforcement Learning within the CMDP framework, formalizing safety as constraints

alongside the primary return

. It surveys foundational CMDP theory, including

duality and linear-programming occupancy formulations, and surveys state-of-the-art single-agent methods (Lagrangian-based optimization, safety shields, and risk-sensitive approaches) and SafeMARL extensions (MACPO, scalable MAPPO-Lagrangian, shielding, and Stackelberg methods). It also discusses five open challenges—zero-violation safe exploration, partial observability, decentralized safety, non-stationarity, and competitive/multi-agent safety—and positions CMDP as a unifying language bridging machine learning, control theory, and formal methods. The findings highlight a mature set of SafeRL tools for practical deployment in safety-critical domains and identify SafeMARL as a vibrant frontier requiring advances in theory, scalability, and robustness for real-world multi-agent systems.

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

TL;DR

Abstract

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (4)