Table of Contents
Fetching ...

Multi-agent assignment via state augmented reinforcement learning

Leopoldo Agorio, Sean Van Alen, Miguel Calvo-Fullana, Santiago Paternain, Juan Andres Bazerque

TL;DR

The paper tackles multi-agent assignment under conflicting regional visitation constraints by replacing standard regularization with a state-augmented MDP where Lagrange multipliers become part of the state and oscillate to induce alternating feasible policies. Coordination across agents is achieved via a gossip-based distributed dual-update that shares multiplier information without full state sharing, enabling fully distributed online execution. The approach combines offline policy training conditioned on multipliers with online, networked dual updates, and provides almost-sure feasibility guarantees under reasonable assumptions. Numerical experiments, including intermittent communications and realistic robot navigation (Gazebo), validate that all constraints are satisfied by the team, illustrating practical impact for scalable, constraint-aware multi-agent systems.

Abstract

We address the conflicting requirements of a multi-agent assignment problem through constrained reinforcement learning, emphasizing the inadequacy of standard regularization techniques for this purpose. Instead, we recur to a state augmentation approach in which the oscillation of dual variables is exploited by agents to alternate between tasks. In addition, we coordinate the actions of the multiple agents acting on their local states through these multipliers, which are gossiped through a communication network, eliminating the need to access other agent states. By these means, we propose a distributed multi-agent assignment protocol with theoretical feasibility guarantees that we corroborate in a monitoring numerical experiment.

Multi-agent assignment via state augmented reinforcement learning

TL;DR

The paper tackles multi-agent assignment under conflicting regional visitation constraints by replacing standard regularization with a state-augmented MDP where Lagrange multipliers become part of the state and oscillate to induce alternating feasible policies. Coordination across agents is achieved via a gossip-based distributed dual-update that shares multiplier information without full state sharing, enabling fully distributed online execution. The approach combines offline policy training conditioned on multipliers with online, networked dual updates, and provides almost-sure feasibility guarantees under reasonable assumptions. Numerical experiments, including intermittent communications and realistic robot navigation (Gazebo), validate that all constraints are satisfied by the team, illustrating practical impact for scalable, constraint-aware multi-agent systems.

Abstract

We address the conflicting requirements of a multi-agent assignment problem through constrained reinforcement learning, emphasizing the inadequacy of standard regularization techniques for this purpose. Instead, we recur to a state augmentation approach in which the oscillation of dual variables is exploited by agents to alternate between tasks. In addition, we coordinate the actions of the multiple agents acting on their local states through these multipliers, which are gossiped through a communication network, eliminating the need to access other agent states. By these means, we propose a distributed multi-agent assignment protocol with theoretical feasibility guarantees that we corroborate in a monitoring numerical experiment.
Paper Structure (7 sections, 7 theorems, 12 equations, 3 figures, 1 algorithm)

This paper contains 7 sections, 7 theorems, 12 equations, 3 figures, 1 algorithm.

Key Result

proposition 1

Assume that the policy of each agent is parameterized by a vector $\theta_n\in\mathbb{R}^n$ and that $A_n\sim \pi_{\theta_n}(\cdot\mid S_n,\lambda)$. Let $\mathcal{L}(\pi_\theta,\lambda)$ and $Q_{n,\lambda}^{\pi_\theta}(S_n,A_n)$ be the functions defined in eqn_lagrangian and eqn_q_others respective

Figures (3)

  • Figure 1: Gossip protocol between two agents $n$ and $i$ communicating through the link $\epsilon_{ni}$. Both agents aim to know if the shaded zone $\mathcal{S}_m$ is occupied at time $\tau=0$, as defined by $\max\{\mathds 1[S_{\tau i}\in \mathcal{S}_m],\mathds 1[S_{\tau n}\in \mathcal{S}_m] \}$. Since agent $n$ is in $\mathcal{S}_m$ at time $t=0$, it knows the actual reward. Hence $\hat{R}_{1,\tau,m,t}=1$ for all $t\geq 0$. Instead, agent $i$ is outside $\mathcal{S}_m$ at $t=0$ so that $R_{i,\tau,m,0}=0$ is incorrect, but is corrected at time $t=1$ by gossiping from agent $n$, so that $R_{i,\tau,m,t}=1$ for all $t\geq 1$.
  • Figure 2: Simulation results obtained after executing for $200{,}000$ iterations, a two-agent policy trained to monitor the regions shown in (e), with requirements $c=[0.3,0.3,0.3,0.3]$. Colors in (a)--(c) correspond to the matching colored region in (e). A subset policy for $\lambda=[5,2.5,0,5]$ and $100$ steps of the runtime trajectory corresponding to $k=(128{,}00,128{,}100)$ are shown in (e), where the communication range of agents is shown by dashed lines.
  • Figure 3: (a) ROS2 Implementation of the multi-agent assignment Algorithm \ref{['algo:alg_main_algorithm']} in a Gazebo TurtleBot environment; (b) floorplan patrol with low-level navigation control; (c) constraint satisfaction as a function of the time step (in thousands) for the floorplan (top) and the TurtleBot (bottom).

Theorems & Definitions (11)

  • proposition 1
  • lemma 1
  • proposition 2
  • theorem 1
  • lemma 2
  • proof
  • lemma 3
  • proof
  • lemma 4
  • proof
  • ...and 1 more