Cooperative Multi-Agent Assignment over Stochastic Graphs via Constrained Reinforcement Learning
Leopoldo Agorio, Sean Van Alen, Santiago Paternain, Miguel Calvo-Fullana, Juan Andres Bazerque
TL;DR
The paper addresses coordinating a team of N agents to satisfy joint region-coverage constraints in dynamic environments modeled as a constrained MARL problem with stochastic communication. It develops a state-augmented MDP where dual variables cycle and are shared via a gossip-based one-bit network, augmented by a contractive dual update to bound estimator error. An offline-online training framework yields a distributed, realizable policy that achieves almost-sure feasibility for the time-averaged constraints, with an error that can be made arbitrarily small by design choices. Numerical experiments with five robots patrolling six regions under time-varying ad-hoc connectivity validate the theory and illustrate robust coordination despite intermittent communication.
Abstract
Constrained multi-agent reinforcement learning offers the framework to design scalable and almost surely feasible solutions for teams of agents operating in dynamic environments to carry out conflicting tasks. We address the challenges of multi-agent coordination through an unconventional formulation in which the dual variables are not driven to convergence but are free to cycle, enabling agents to adapt their policies dynamically based on real-time constraint satisfaction levels. The coordination relies on a light single-bit communication protocol over a network with stochastic connectivity. Using this gossiped information, agents update local estimates of the dual variables. Furthermore, we modify the local dual dynamics by introducing a contraction factor, which lets us use finite communication buffers and keep the estimation error bounded. Under this model, we provide theoretical guarantees of almost sure feasibility and corroborate them with numerical experiments in which a team of robots successfully patrols multiple regions, communicating under a time-varying ad-hoc network.
