Safe Multi-Agent Reinforcement Learning for Behavior-Based Cooperative Navigation

Murad Dawood; Sicong Pan; Nils Dengler; Siqi Zhou; Angela P. Schoellig; Maren Bennewitz

Safe Multi-Agent Reinforcement Learning for Behavior-Based Cooperative Navigation

Murad Dawood, Sicong Pan, Nils Dengler, Siqi Zhou, Angela P. Schoellig, Maren Bennewitz

TL;DR

This work tackles safe behavior-based cooperative navigation with a team of $N$ robots by steering the formation centroid to a target and maintaining inter-robot distances without per-robot targets. It pairs a centralized SAC-based MARL framework with attention-based critics and a distributed NMPC safety filter to override unsafe actions, ensuring zero collisions during training and execution. The approach demonstrates faster convergence, robust zero-collision performance in simulation and real robots, and safe transfer to unseen configurations, while revealing the MPC layer’s role in enabling exploration and safety. Overall, the method advances practical deployment of safe MARL for scalable, centroid-based formation control in real-world robotic teams.

Abstract

In this paper, we address the problem of behavior-based cooperative navigation of mobile robots using safe multi-agent reinforcement learning~(MARL). Our work is the first to focus on cooperative navigation without individual reference targets for the robots, using a single target for the formation's centroid. This eliminates the complexities involved in having several path planners to control a team of robots. To ensure safety, our MARL framework uses model predictive control (MPC) to prevent actions that could lead to collisions during training and execution. We demonstrate the effectiveness of our method in simulation and on real robots, achieving safe behavior-based cooperative navigation without using individual reference targets, with zero collisions, and faster target reaching compared to baselines. Finally, we study the impact of MPC safety filters on the learning process, revealing that we achieve faster convergence during training and we show that our approach can be safely deployed on real robots, even during early stages of the training.

Safe Multi-Agent Reinforcement Learning for Behavior-Based Cooperative Navigation

TL;DR

This work tackles safe behavior-based cooperative navigation with a team of

robots by steering the formation centroid to a target and maintaining inter-robot distances without per-robot targets. It pairs a centralized SAC-based MARL framework with attention-based critics and a distributed NMPC safety filter to override unsafe actions, ensuring zero collisions during training and execution. The approach demonstrates faster convergence, robust zero-collision performance in simulation and real robots, and safe transfer to unseen configurations, while revealing the MPC layer’s role in enabling exploration and safety. Overall, the method advances practical deployment of safe MARL for scalable, centroid-based formation control in real-world robotic teams.

Abstract

Paper Structure (18 sections, 4 equations, 8 figures, 5 tables)

This paper contains 18 sections, 4 equations, 8 figures, 5 tables.

Introduction
Related Work
Problem Statement
Our Approach
Multi-Agent Reinforcement Learning (MARL):
Attention-Based Critics:
Model Predictive Safety Filter:
Safety Filter Formulation
Prediction Model
Optimal Control Problem
Experimental Evaluation
Training With the MPC Filter
Testing Against Baselines in Simulation
Can we execute our method without the MPC?
Integrating the MPC Safety Layer with the Baselines:
...and 3 more sections

Figures (8)

Figure 1: Real-world example for the behavior-based cooperative navigation control. The robots start from random locations and navigate cooperatively to reach the targets for the centroid of the formation (shown in red) while aiming to maintain the predefined distances with respect to each other. The blue and green shades show the robots team at the first and second goals, respectively.
Figure 2: The observation space per robot includes lidar readings (red lines), distances and headings to the goal ($d_{i}^{g}$, $\theta_{i}^{g}$), two neighbors ($d_{ij}$, $\theta_{ij}$), and the closest obstacle ($d_{i}^{obs}$, $\theta_{i}^{obs}$). Additionally, robots have information about the centroid's distance to the goal ($d_{c}^{g}$) $d_{c}^{g}$.
Figure 3: Illustrations of the CTDE architecture and attention module. (a) At each time step, the agent interacts with the environment to receive the current observation $Obs_{RL}$, including relative information about the two neighbors and the closest obstacle, and outputs the Action$_{RL}$. The MPC controller receives the State$_{\mathit{MPC}}$, overrides any unsafe actions to prevent collisions, and sends Action$_{\mathit{MPC}}$ to the robot. During training (blue dashed), critics share all agents' observations, while during execution (red dashed), each actor accesses only its own observation. (b) In the attention-based critics, observations are encoded separately and fed into attention heads to calculate weights based on the query, and the key-value pairs. The attention output is then concatenated with the states and fed into the critics.
Figure 4: The figures show the impact of the safety filter on training, with bold lines representing the average and shaded areas indicating the standard deviation across three random seeds. (a,c) depict the average formation error in the first and second training environments, while (b,d) display the number of goals achieved per episode, in both environments. The agent with the MPC (ATT_MPC) consistently surpasses the pure learning agent (ATT) in reducing formation errors and increasing goal achievements. This demonstrates that incorporating a safety filter reduces the number of episodes required to achieve the desired performance compared to pure learning agents.
Figure 5: Target Reaching Configurations test. The top row indicates the starting configurations, while the bottom row shows the final reached configurations by our approach. The yellow arrows indicate the starting orientation of the robots, while the red circles show the target location for the centroid of the formation. The configurations are arranged from left to right in terms of difficulty. The Collinear and Facing each other configurations were not experienced during the training, making them more difficult compared to the other configurations. However, our approach was able to successfully complete the tasks with zero collisions.
...and 3 more figures

Safe Multi-Agent Reinforcement Learning for Behavior-Based Cooperative Navigation

TL;DR

Abstract

Safe Multi-Agent Reinforcement Learning for Behavior-Based Cooperative Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)