Table of Contents
Fetching ...

Cooperative Multi-Agent Assignment over Stochastic Graphs via Constrained Reinforcement Learning

Leopoldo Agorio, Sean Van Alen, Santiago Paternain, Miguel Calvo-Fullana, Juan Andres Bazerque

TL;DR

The paper addresses coordinating a team of N agents to satisfy joint region-coverage constraints in dynamic environments modeled as a constrained MARL problem with stochastic communication. It develops a state-augmented MDP where dual variables cycle and are shared via a gossip-based one-bit network, augmented by a contractive dual update to bound estimator error. An offline-online training framework yields a distributed, realizable policy that achieves almost-sure feasibility for the time-averaged constraints, with an error that can be made arbitrarily small by design choices. Numerical experiments with five robots patrolling six regions under time-varying ad-hoc connectivity validate the theory and illustrate robust coordination despite intermittent communication.

Abstract

Constrained multi-agent reinforcement learning offers the framework to design scalable and almost surely feasible solutions for teams of agents operating in dynamic environments to carry out conflicting tasks. We address the challenges of multi-agent coordination through an unconventional formulation in which the dual variables are not driven to convergence but are free to cycle, enabling agents to adapt their policies dynamically based on real-time constraint satisfaction levels. The coordination relies on a light single-bit communication protocol over a network with stochastic connectivity. Using this gossiped information, agents update local estimates of the dual variables. Furthermore, we modify the local dual dynamics by introducing a contraction factor, which lets us use finite communication buffers and keep the estimation error bounded. Under this model, we provide theoretical guarantees of almost sure feasibility and corroborate them with numerical experiments in which a team of robots successfully patrols multiple regions, communicating under a time-varying ad-hoc network.

Cooperative Multi-Agent Assignment over Stochastic Graphs via Constrained Reinforcement Learning

TL;DR

The paper addresses coordinating a team of N agents to satisfy joint region-coverage constraints in dynamic environments modeled as a constrained MARL problem with stochastic communication. It develops a state-augmented MDP where dual variables cycle and are shared via a gossip-based one-bit network, augmented by a contractive dual update to bound estimator error. An offline-online training framework yields a distributed, realizable policy that achieves almost-sure feasibility for the time-averaged constraints, with an error that can be made arbitrarily small by design choices. Numerical experiments with five robots patrolling six regions under time-varying ad-hoc connectivity validate the theory and illustrate robust coordination despite intermittent communication.

Abstract

Constrained multi-agent reinforcement learning offers the framework to design scalable and almost surely feasible solutions for teams of agents operating in dynamic environments to carry out conflicting tasks. We address the challenges of multi-agent coordination through an unconventional formulation in which the dual variables are not driven to convergence but are free to cycle, enabling agents to adapt their policies dynamically based on real-time constraint satisfaction levels. The coordination relies on a light single-bit communication protocol over a network with stochastic connectivity. Using this gossiped information, agents update local estimates of the dual variables. Furthermore, we modify the local dual dynamics by introducing a contraction factor, which lets us use finite communication buffers and keep the estimation error bounded. Under this model, we provide theoretical guarantees of almost sure feasibility and corroborate them with numerical experiments in which a team of robots successfully patrols multiple regions, communicating under a time-varying ad-hoc network.

Paper Structure

This paper contains 16 sections, 11 theorems, 62 equations, 5 figures, 2 algorithms.

Key Result

Proposition 1

Assume that the policy of each agent is parameterized by a vector $\theta^n\in\mathbb{R}^n$ and that $A^n\sim\pi_{\theta^n}(\cdot| S^n,\lambda)$, where $S^n$ represents the local state of agent $n$. Let $\mathcal{L}(\pi_\theta,\lambda)$ and $Q^{n}_{\pi_\theta}(S^n,A^n,\lambda)$ be the functions defi

Figures (5)

  • Figure 1: Agent $n$ must receive the information that agent $n^\prime$ is in zone $\mathcal{S}_m$ across the communication graph. Agents $n \in \mathcal{V} = \{1, \dots, N\}$ are nodes in the stochastic graph $\mathcal{G}(\mathcal{V}, \mathcal{E}, p)$, where solid and dashed lines represent the presence or absence of an edge, respectively, at a particular time. If an edge is present, the two nodes connected to the edge can exchange information. Otherwise, the two neighboring agents must wait for a future time in which the edge is present.
  • Figure 2: Gossip timeline: Each agent aims to estimate the global vector of rewards $r(S_{\tau})$. After $t-\tau$ time steps, the local estimation of $r(S_{\tau})$ obtained by agent $n$ is $R_{\tau,t}^n$.
  • Figure 3: (a) Floor plan and sample trajectories for each of the $N=5$ agents, $n=0,\ldots,4$. Black lines represent walls, and colored circles represent the $M=6$ zones to be patrolled by the agents. Each colored dot along a trajectory indicates a new position for each time step. The $12$ gray rectangular regions define distinct possible tiles or observations of an agent’s position, which are the inputs to the policy. Figures (b)--(f): Occupation heat maps. The complete trajectory of an agent across $40{,}000$ timesteps is represented by dots, indicating the position reached by an agent. Each dot color represents the frequency of occupation at that position, with darker hues indicating more frequent positions, under a logarithmic colorbar scale.
  • Figure 4: Satisfaction of the constraints for each zone $m=1,\ldots,6$. Constraint requirements were defined as $0.1$ for zone 1, $0.2$ for zone 2, etc. Dashed lines indicate these constraints. Minimum and maximum satisfaction values are plotted for each timestep for each zone and are filled by a shaded region. Colors of constraint and satisfaction values match those of the zones as depicted in Fig. \ref{['fig:heatmaps']}.
  • Figure 5: (a) Matrix of communication frequencies between agents, counted over time. For each pair of agents ($n$, $n^\prime$), the number of timesteps during which these agents communicate is summed and divided by the total timesteps of simulation, $40{,}000$. Frequencies are indicated by colors matching those in the included colorbar, with white indicating no communication. We define an agent as never communicating with itself, so diagonal entries are white. Figure (b): A snapshot of gossip neighborhood sizes per agent. Neighborhood sizes are plotted for $1{,}000$ time-steps of execution phase. $N=5$ lines are present, one for each agent and of a color matching said agent’s trajectory in Fig \ref{['fig:heatmaps']}. Figure (c): Margin of constraint satisfaction for each communication disc of sizes $d=1,\ldots,6$. A minimum and maximum difference between satisfaction and constraint values across all zones $m = 1,\ldots,6$ are found and plotted for each disc size, as indicated by the bold blue lines. A band shades the area between the maximum and minimum differences.

Theorems & Definitions (28)

  • Example 1
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Remark 1
  • Theorem 1
  • proof
  • ...and 18 more