Multi Agent Reinforcement Learning for Sequential Satellite Assignment Problems
Joshua Holder, Natasha Jaques, Mehran Mesbahi
TL;DR
The paper tackles stateful, large-scale sequential assignment problems (SAP) by proposing REDA, a MARL approach that learns per-agent value estimates and feeds them into a centralized, feasible assignment mechanism to produce socially optimal allocations. It leverages a Q-function decomposition $Q^{\pi^\alpha}(s,x)=\sum_i Q_i^{\pi^\alpha}(s,x^i)$ and a distributable mapping $x=\alpha(\mathbf{Q}^{\pi^\alpha})$ to approximate joint optimization without sacrificing feasibility. A contraction-based theoretical justification shows convergence of the per-agent Q-values under REDA, ensuring alignment with the joint objective. Empirical validation on a simple dictator environment and a realistic satellite constellation (324 satellites, 450 tasks) demonstrates substantial improvements over MARL baselines (COMA, IQL, IPPO) and classical methods (HAAL), highlighting REDA’s ability to scale, reduce conflicts, and manage power constraints. The work has practical impact for autonomous, scalable management of large distributed systems such as satellite networks and power grids.
Abstract
Assignment problems are a classic combinatorial optimization problem in which a group of agents must be assigned to a group of tasks such that maximum utility is achieved while satisfying assignment constraints. Given the utility of each agent completing each task, polynomial-time algorithms exist to solve a single assignment problem in its simplest form. However, in many modern-day applications such as satellite constellations, power grids, and mobile robot scheduling, assignment problems unfold over time, with the utility for a given assignment depending heavily on the state of the system. We apply multi-agent reinforcement learning to this problem, learning the value of assignments by bootstrapping from a known polynomial-time greedy solver and then learning from further experience. We then choose assignments using a distributed optimal assignment mechanism rather than by selecting them directly. We demonstrate that this algorithm is theoretically justified and avoids pitfalls experienced by other RL algorithms in this setting. Finally, we show that our algorithm significantly outperforms other methods in the literature, even while scaling to realistic scenarios with hundreds of agents and tasks.
