Table of Contents
Fetching ...

Solving Multi-Agent Safe Optimal Control with Distributed Epigraph Form MARL

Songyuan Zhang, Oswin So, Mitchell Black, Zachary Serlin, Chuchu Fan

TL;DR

This work tackles multi-agent safe optimal control with zero constraint violation by casting MASOCP into epigraph form and distributing the optimization via Def-MARL. It extends the epigraph approach to the CTDE paradigm, enabling an inner, centralized MARL optimization over a $z$-conditioned policy and a distributed outer optimization that computes the minimal safe cost upper bound online. The approach demonstrates stable training, strong safety guarantees, and competitive performance across diverse simulations and hardware experiments, significantly outperforming penalty-based and Lagrangian baselines that are sensitive to hyperparameters. This framework offers a practical route to safe, scalable coordination in real-world multi-robot systems without sacrificing performance.

Abstract

Tasks for multi-robot systems often require the robots to collaborate and complete a team goal while maintaining safety. This problem is usually formalized as a constrained Markov decision process (CMDP), which targets minimizing a global cost and bringing the mean of constraint violation below a user-defined threshold. Inspired by real-world robotic applications, we define safety as zero constraint violation. While many safe multi-agent reinforcement learning (MARL) algorithms have been proposed to solve CMDPs, these algorithms suffer from unstable training in this setting. To tackle this, we use the epigraph form for constrained optimization to improve training stability and prove that the centralized epigraph form problem can be solved in a distributed fashion by each agent. This results in a novel centralized training distributed execution MARL algorithm named Def-MARL. Simulation experiments on 8 different tasks across 2 different simulators show that Def-MARL achieves the best overall performance, satisfies safety constraints, and maintains stable training. Real-world hardware experiments on Crazyflie quadcopters demonstrate the ability of Def-MARL to safely coordinate agents to complete complex collaborative tasks compared to other methods.

Solving Multi-Agent Safe Optimal Control with Distributed Epigraph Form MARL

TL;DR

This work tackles multi-agent safe optimal control with zero constraint violation by casting MASOCP into epigraph form and distributing the optimization via Def-MARL. It extends the epigraph approach to the CTDE paradigm, enabling an inner, centralized MARL optimization over a -conditioned policy and a distributed outer optimization that computes the minimal safe cost upper bound online. The approach demonstrates stable training, strong safety guarantees, and competitive performance across diverse simulations and hardware experiments, significantly outperforming penalty-based and Lagrangian baselines that are sensitive to hyperparameters. This framework offers a practical route to safe, scalable coordination in real-world multi-robot systems without sacrificing performance.

Abstract

Tasks for multi-robot systems often require the robots to collaborate and complete a team goal while maintaining safety. This problem is usually formalized as a constrained Markov decision process (CMDP), which targets minimizing a global cost and bringing the mean of constraint violation below a user-defined threshold. Inspired by real-world robotic applications, we define safety as zero constraint violation. While many safe multi-agent reinforcement learning (MARL) algorithms have been proposed to solve CMDPs, these algorithms suffer from unstable training in this setting. To tackle this, we use the epigraph form for constrained optimization to improve training stability and prove that the centralized epigraph form problem can be solved in a distributed fashion by each agent. This results in a novel centralized training distributed execution MARL algorithm named Def-MARL. Simulation experiments on 8 different tasks across 2 different simulators show that Def-MARL achieves the best overall performance, satisfies safety constraints, and maintains stable training. Real-world hardware experiments on Crazyflie quadcopters demonstrate the ability of Def-MARL to safely coordinate agents to complete complex collaborative tasks compared to other methods.

Paper Structure

This paper contains 39 sections, 5 theorems, 62 equations, 16 figures, 7 tables, 2 algorithms.

Key Result

Proposition 1

Dynamic programming can be applied to EF-MASOCP (eq: ef-macocp), resulting in

Figures (16)

  • Figure 1: Two agents using Def-MARL to safely and collaboratively inspect a moving target. We propose a novel safe MARL algorithm, Def-MARL, that solves the multi-agent safe optimal control problem. Def-MARL translates the original problem to its epigraph form to avoid unstable training and extends the epigraph form to the CTDE paradigm for distributed execution. (a): Long exposure photo of the trajectories of the drones. The trajectory of the target is shown in green and that of the agents is shown in blue. (b)-(i): Snapshots of the agents' policy. Using Def-MARL, the agents learn to collaborate to maintain visual contact with the target at all times, with each agent being responsible only when the target is on their side.
  • Figure 2: Def-MARL algorithm. Randomly sampled initial states and $z^0$ are used to collect trajectories in $x$ and $z$ using the current policy $\pi$. In the centralized training (orange blocks), distributed constraint-value functions $V^h_i$ and policies $\pi_i$ and a centralized cost-value function $V^l$ are jointly trained. During distributed execution (green blocks), the distributed $V^h_i$ are used to solve the outer problem (\ref{['eq: dec-ef-macocp-2']}) to compute the optimal $z_i$, which is used in each agent's $z$-conditioned policy.
  • Figure 3: Simulation Environments. Visualization of the (top) modified MPE lowe2017multi and (bottom) Safe Multi-agent MuJoCo gu2023safe environments we consider.
  • Figure 4: Comparison on modified MPE ($N=3$) and Safe Multi-agent MuJoCo. Def-MARL is consistently closest to the top-left corner in all environments, achieving low cost with near $100\%$ safety rate. The dots show the mean values and the error bars show one standard deviation.
  • Figure 5: Converged states in Corridor. Def-MARL achieves the global minimum, while other baselines converge to a different optimum (partly) due to training using a different cost function.
  • ...and 11 more figures

Theorems & Definitions (12)

  • Proposition 1
  • Remark 1: Effect of $z$ on the learned policy
  • Theorem 1
  • proof
  • Lemma 1
  • Lemma 2
  • proof
  • proof : Proof of \ref{['lem: z_star_optimal']}
  • Lemma 3
  • proof
  • ...and 2 more