Table of Contents
Fetching ...

Discrete GCBF Proximal Policy Optimization for Multi-agent Safe Optimal Control

Songyuan Zhang, Oswin So, Mitchell Black, Chuchu Fan

TL;DR

This work addresses safe, high-performance control for multi-agent systems with unknown discrete-time dynamics and limited sensing. It introduces DGPPO, a framework that jointly learns a discrete graph CBF (DGCBF) and a safe policy by showing that the constraint-value function V^{h,mu}(x) is a DCBF and by using score-function gradients to perform constrained policy optimization without a fixed nominal policy. The DGCBF extension enables safety under changing neighborhoods through attention-based graph representations, and the full DGPPO algorithm merges stochastic MAPPO-style updates with deterministic rollouts to train the DGCBF. Empirical results across LiDAR, MuJoCo, and VMAS environments demonstrate robust performance with a single hyperparameter set, achieving near-100% safety and competitive task costs, and validating scalability to more agents. This approach reduces reliance on handcrafted controllers or known dynamics, providing scalable, safety-guaranteed learning for complex MAS with input constraints and dynamic topology.

Abstract

Control policies that can achieve high task performance and satisfy safety constraints are desirable for any system, including multi-agent systems (MAS). One promising technique for ensuring the safety of MAS is distributed control barrier functions (CBF). However, it is difficult to design distributed CBF-based policies for MAS that can tackle unknown discrete-time dynamics, partial observability, changing neighborhoods, and input constraints, especially when a distributed high-performance nominal policy that can achieve the task is unavailable. To tackle these challenges, we propose DGPPO, a new framework that simultaneously learns both a discrete graph CBF which handles neighborhood changes and input constraints, and a distributed high-performance safe policy for MAS with unknown discrete-time dynamics. We empirically validate our claims on a suite of multi-agent tasks spanning three different simulation engines. The results suggest that, compared with existing methods, our DGPPO framework obtains policies that achieve high task performance (matching baselines that ignore the safety constraints), and high safety rates (matching the most conservative baselines), with a constant set of hyperparameters across all environments.

Discrete GCBF Proximal Policy Optimization for Multi-agent Safe Optimal Control

TL;DR

This work addresses safe, high-performance control for multi-agent systems with unknown discrete-time dynamics and limited sensing. It introduces DGPPO, a framework that jointly learns a discrete graph CBF (DGCBF) and a safe policy by showing that the constraint-value function V^{h,mu}(x) is a DCBF and by using score-function gradients to perform constrained policy optimization without a fixed nominal policy. The DGCBF extension enables safety under changing neighborhoods through attention-based graph representations, and the full DGPPO algorithm merges stochastic MAPPO-style updates with deterministic rollouts to train the DGCBF. Empirical results across LiDAR, MuJoCo, and VMAS environments demonstrate robust performance with a single hyperparameter set, achieving near-100% safety and competitive task costs, and validating scalability to more agents. This approach reduces reliance on handcrafted controllers or known dynamics, providing scalable, safety-guaranteed learning for complex MAS with input constraints and dynamic topology.

Abstract

Control policies that can achieve high task performance and satisfy safety constraints are desirable for any system, including multi-agent systems (MAS). One promising technique for ensuring the safety of MAS is distributed control barrier functions (CBF). However, it is difficult to design distributed CBF-based policies for MAS that can tackle unknown discrete-time dynamics, partial observability, changing neighborhoods, and input constraints, especially when a distributed high-performance nominal policy that can achieve the task is unavailable. To tackle these challenges, we propose DGPPO, a new framework that simultaneously learns both a discrete graph CBF which handles neighborhood changes and input constraints, and a distributed high-performance safe policy for MAS with unknown discrete-time dynamics. We empirically validate our claims on a suite of multi-agent tasks spanning three different simulation engines. The results suggest that, compared with existing methods, our DGPPO framework obtains policies that achieve high task performance (matching baselines that ignore the safety constraints), and high safety rates (matching the most conservative baselines), with a constant set of hyperparameters across all environments.

Paper Structure

This paper contains 56 sections, 11 theorems, 77 equations, 15 figures, 3 tables.

Key Result

Theorem 1

The set $\mathcal{C} \coloneqq\{ {\mathbf{x}} \mid B({\mathbf{x}}) \leq 0 \}$ is control invariant under any policy ${\bm{\mu}}$ that satisfies

Figures (15)

  • Figure 1: DGPPO algorithm. In addition to the normal MAPPO path (top) using stochastic rollouts, we introduce a second path (bottom) that uses deterministic rollouts to learn a DGCBF.
  • Figure 2: Environments. We test on (top) LiDAR, (bottom) MuJoCo, and VMAS environments.
  • Figure 3: Comparison on $N = 3$ agents.denotes the mean $\pm$ standard deviation. Methods closer to the top left yield lower costs and higher safety rates.
  • Figure 4: Training stability.DGPPO yields smoother training curves compared to the baselines.
  • Figure 5: Scaling to $N=5, 7$. Unlike other methods, DGPPO performs similarly with more agents.
  • ...and 10 more figures

Theorems & Definitions (25)

  • Definition 1
  • Theorem 1
  • Theorem 2: Discrete Policy CBF
  • Definition 2: Discrete GCBF
  • Remark 1: Discontinuity due to neighborhood changes
  • Corollary 1: Discrete Policy GCBF
  • proof
  • Theorem A1
  • proof
  • Theorem A2: Approximate Gradient Projection for Decoupled Policy Parameters
  • ...and 15 more