Table of Contents
Fetching ...

Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash Equilibrium

Zeyang Li, Navid Azizan

TL;DR

A novel theoretical framework for safe MARL with state-wise constraints, where safety requirements are enforced at every state the agents visit, and a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained Markov games, achieving an optimal balance between feasibility and performance.

Abstract

Multi-agent reinforcement learning (MARL) has achieved notable success in cooperative tasks, demonstrating impressive performance and scalability. However, deploying MARL agents in real-world applications presents critical safety challenges. Current safe MARL algorithms are largely based on the constrained Markov decision process (CMDP) framework, which enforces constraints only on discounted cumulative costs and lacks an all-time safety assurance. Moreover, these methods often overlook the feasibility issue (the system will inevitably violate state constraints within certain regions of the constraint set), resulting in either suboptimal performance or increased constraint violations. To address these challenges, we propose a novel theoretical framework for safe MARL with $\textit{state-wise}$ constraints, where safety requirements are enforced at every state the agents visit. To resolve the feasibility issue, we leverage a control-theoretic notion of the feasible region, the controlled invariant set (CIS), characterized by the safety value function. We develop a multi-agent method for identifying CISs, ensuring convergence to a Nash equilibrium on the safety value function. By incorporating CIS identification into the learning process, we introduce a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained cooperative Markov games, achieving an optimal balance between feasibility and performance. Furthermore, for practical deployment in complex high-dimensional systems, we propose $\textit{Multi-Agent Dual Actor-Critic}$ (MADAC), a safe MARL algorithm that approximates the proposed iteration scheme within the deep RL paradigm. Empirical evaluations on safe MARL benchmarks demonstrate that MADAC consistently outperforms existing methods, delivering much higher rewards while reducing constraint violations.

Safe Multi-Agent Reinforcement Learning with Convergence to Generalized Nash Equilibrium

TL;DR

A novel theoretical framework for safe MARL with state-wise constraints, where safety requirements are enforced at every state the agents visit, and a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained Markov games, achieving an optimal balance between feasibility and performance.

Abstract

Multi-agent reinforcement learning (MARL) has achieved notable success in cooperative tasks, demonstrating impressive performance and scalability. However, deploying MARL agents in real-world applications presents critical safety challenges. Current safe MARL algorithms are largely based on the constrained Markov decision process (CMDP) framework, which enforces constraints only on discounted cumulative costs and lacks an all-time safety assurance. Moreover, these methods often overlook the feasibility issue (the system will inevitably violate state constraints within certain regions of the constraint set), resulting in either suboptimal performance or increased constraint violations. To address these challenges, we propose a novel theoretical framework for safe MARL with constraints, where safety requirements are enforced at every state the agents visit. To resolve the feasibility issue, we leverage a control-theoretic notion of the feasible region, the controlled invariant set (CIS), characterized by the safety value function. We develop a multi-agent method for identifying CISs, ensuring convergence to a Nash equilibrium on the safety value function. By incorporating CIS identification into the learning process, we introduce a multi-agent dual policy iteration algorithm that guarantees convergence to a generalized Nash equilibrium in state-wise constrained cooperative Markov games, achieving an optimal balance between feasibility and performance. Furthermore, for practical deployment in complex high-dimensional systems, we propose (MADAC), a safe MARL algorithm that approximates the proposed iteration scheme within the deep RL paradigm. Empirical evaluations on safe MARL benchmarks demonstrate that MADAC consistently outperforms existing methods, delivering much higher rewards while reducing constraint violations.

Paper Structure

This paper contains 9 sections, 6 theorems, 30 equations, 4 figures, 3 algorithms.

Key Result

Lemma 1

The safety value function $V_h^{\pi_h}$ satisfies Furthermore, the safety self-consistency operator $\mathcal{T}_h^{\pi_h}$ defined as is a contraction mapping and $V_h^{\pi_h}$ is the unique fixed point of $\mathcal{T}_h^{\pi_h}$.

Figures (4)

  • Figure 1: Snapshots of three robots.
  • Figure 2: Training curves on HalfCheetah environments. While standard MARL baselines (HASAC and HAPPO) achieve higher rewards, they violate the constraints heavily. MADAC outperforms safe MARL baselines (MACPO and MAPPO-Lagrangian) by achieving significantly higher rewards, while maintaining equal or superior compliance with safety constraints.
  • Figure 3: Training curves on Walker2D environments. While standard MARL baselines (HASAC and HAPPO) achieve higher rewards, they violate the constraints heavily. MADAC outperforms safe MARL baselines (MACPO and MAPPO-Lagrangian) by achieving significantly higher rewards, while maintaining equal or superior compliance with safety constraints.
  • Figure 4: Training curves on Ant environments. Standard MARL baselines (HASAC and HAPPO) violate the constraints heavily. MADAC achieves rewards comparable to HASAC and substantially higher than HAPPO, while adhering to safety requirements. MADAC outperforms safe MARL baselines (MACPO and MAPPO-Lagrangian) by achieving significantly higher rewards, while maintaining equal or superior compliance with safety constraints.

Theorems & Definitions (20)

  • Definition 1: State-wise Constrained Cooperative Markov Game
  • Definition 2: Constraint Set
  • Definition 3: Safety Value Function yu2022reachabilityli2024safe
  • Definition 4: Controlled Invariant Set li2024safe
  • Lemma 1: Self-consistency Condition for Safety Value Function yu2022reachabilityli2024safe
  • Remark 1
  • Theorem 1: Convergence of Multi-Agent Safety Policy Iteration
  • proof
  • Remark 2
  • Definition 5: Invariant Action Set
  • ...and 10 more