Table of Contents
Fetching ...

Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form

Xuefeng Wang, Lei Zhang, Henglin Pu, Husheng Li, Ahmed H. Qureshi

TL;DR

This work proposes a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation, and proposes a novel physics-informed neural network (PINN)-based actor-critic method that enables stable and efficient optimization in continuous time.

Abstract

Multi-agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete-time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill-suited for complex multi-agent dynamics, particularly in high-frequency or irregular time-interval settings, leading to degraded performance and motivating the development of continuous-time MARL (CT-MARL). Existing CT-MARL methods are mainly built on Hamilton-Jacobi-Bellman (HJB) equations. However, they rarely account for safety constraints such as collision penalties, since these introduce discontinuities that make HJB-based learning difficult. To address this challenge, we propose a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation. We then solve this by proposing a novel physics-informed neural network (PINN)-based actor-critic method that enables stable and efficient optimization in continuous time. We evaluate our approach on continuous-time safe multi-particle environments (MPE) and safe multi-agent MuJoCo benchmarks. Results demonstrate smoother value approximations, more stable training, and improved performance over safe MARL baselines, validating the effectiveness and robustness of our method.

Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form

TL;DR

This work proposes a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation, and proposes a novel physics-informed neural network (PINN)-based actor-critic method that enables stable and efficient optimization in continuous time.

Abstract

Multi-agent reinforcement learning (MARL) has made significant progress in recent years, but most algorithms still rely on a discrete-time Markov Decision Process (MDP) with fixed decision intervals. This formulation is often ill-suited for complex multi-agent dynamics, particularly in high-frequency or irregular time-interval settings, leading to degraded performance and motivating the development of continuous-time MARL (CT-MARL). Existing CT-MARL methods are mainly built on Hamilton-Jacobi-Bellman (HJB) equations. However, they rarely account for safety constraints such as collision penalties, since these introduce discontinuities that make HJB-based learning difficult. To address this challenge, we propose a continuous-time constrained MDP (CT-CMDP) formulation and a novel MARL framework that transforms discrete MDPs into CT-CMDPs via an epigraph-based reformulation. We then solve this by proposing a novel physics-informed neural network (PINN)-based actor-critic method that enables stable and efficient optimization in continuous time. We evaluate our approach on continuous-time safe multi-particle environments (MPE) and safe multi-agent MuJoCo benchmarks. Results demonstrate smoother value approximations, more stable training, and improved performance over safe MARL baselines, validating the effectiveness and robustness of our method.
Paper Structure (38 sections, 4 theorems, 86 equations, 15 figures, 2 tables, 1 algorithm)

This paper contains 38 sections, 4 theorems, 86 equations, 15 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.1

Suppose the assumptions in Sec. sec:cmdp hold. For all $(t,x,z)\in[0,\infty)\times\mathcal{X}\times\mathbb{R}$, the constrained value $v$ and auxiliary value $V$ are related by

Figures (15)

  • Figure 1: Overview of the proposed epigraph-based CT-MARL framework. The pipeline begins with data collection, where individual agent rollouts are aggregated into a centralized rollout $\mathcal{X}_R$ for the training; the outer optimization computes optimal $z^*$ to balance discounted cumulative cost and safety constraints; the inner optimization corresponds to critic learning, where return networks $V^{\text{ret}}_\psi(x)$ and constraint value networks $V^{\text{cons}}_\phi(x)$ are optimized jointly with the optimal auxiliary state $z^*$; and actor learning leverages the advantage function to improve policies.
  • Figure 2: Overall results for adapted MPE environments.
  • Figure 3: Performance of constraints and cost over MPE settings.
  • Figure 4: Overall results for adapted multi-agent MuJoCo environments.
  • Figure 5: Ablation study of different loss terms in critic network over MPE.
  • ...and 10 more figures

Theorems & Definitions (10)

  • Definition 1: Epigraph Reformulation
  • Lemma 3.1: Value Equivalence
  • Lemma 3.2: Optimality Condition
  • Theorem 3.3: Epigraph-based HJB PDE
  • Definition 2: Epigraph-based Q-function
  • Lemma 3.4: Epigraph-based advantage function
  • proof
  • proof
  • proof
  • proof