Table of Contents
Fetching ...

Multi Agent Reinforcement Learning for Sequential Satellite Assignment Problems

Joshua Holder, Natasha Jaques, Mehran Mesbahi

TL;DR

The paper tackles stateful, large-scale sequential assignment problems (SAP) by proposing REDA, a MARL approach that learns per-agent value estimates and feeds them into a centralized, feasible assignment mechanism to produce socially optimal allocations. It leverages a Q-function decomposition $Q^{\pi^\alpha}(s,x)=\sum_i Q_i^{\pi^\alpha}(s,x^i)$ and a distributable mapping $x=\alpha(\mathbf{Q}^{\pi^\alpha})$ to approximate joint optimization without sacrificing feasibility. A contraction-based theoretical justification shows convergence of the per-agent Q-values under REDA, ensuring alignment with the joint objective. Empirical validation on a simple dictator environment and a realistic satellite constellation (324 satellites, 450 tasks) demonstrates substantial improvements over MARL baselines (COMA, IQL, IPPO) and classical methods (HAAL), highlighting REDA’s ability to scale, reduce conflicts, and manage power constraints. The work has practical impact for autonomous, scalable management of large distributed systems such as satellite networks and power grids.

Abstract

Assignment problems are a classic combinatorial optimization problem in which a group of agents must be assigned to a group of tasks such that maximum utility is achieved while satisfying assignment constraints. Given the utility of each agent completing each task, polynomial-time algorithms exist to solve a single assignment problem in its simplest form. However, in many modern-day applications such as satellite constellations, power grids, and mobile robot scheduling, assignment problems unfold over time, with the utility for a given assignment depending heavily on the state of the system. We apply multi-agent reinforcement learning to this problem, learning the value of assignments by bootstrapping from a known polynomial-time greedy solver and then learning from further experience. We then choose assignments using a distributed optimal assignment mechanism rather than by selecting them directly. We demonstrate that this algorithm is theoretically justified and avoids pitfalls experienced by other RL algorithms in this setting. Finally, we show that our algorithm significantly outperforms other methods in the literature, even while scaling to realistic scenarios with hundreds of agents and tasks.

Multi Agent Reinforcement Learning for Sequential Satellite Assignment Problems

TL;DR

The paper tackles stateful, large-scale sequential assignment problems (SAP) by proposing REDA, a MARL approach that learns per-agent value estimates and feeds them into a centralized, feasible assignment mechanism to produce socially optimal allocations. It leverages a Q-function decomposition and a distributable mapping to approximate joint optimization without sacrificing feasibility. A contraction-based theoretical justification shows convergence of the per-agent Q-values under REDA, ensuring alignment with the joint objective. Empirical validation on a simple dictator environment and a realistic satellite constellation (324 satellites, 450 tasks) demonstrates substantial improvements over MARL baselines (COMA, IQL, IPPO) and classical methods (HAAL), highlighting REDA’s ability to scale, reduce conflicts, and manage power constraints. The work has practical impact for autonomous, scalable management of large distributed systems such as satellite networks and power grids.

Abstract

Assignment problems are a classic combinatorial optimization problem in which a group of agents must be assigned to a group of tasks such that maximum utility is achieved while satisfying assignment constraints. Given the utility of each agent completing each task, polynomial-time algorithms exist to solve a single assignment problem in its simplest form. However, in many modern-day applications such as satellite constellations, power grids, and mobile robot scheduling, assignment problems unfold over time, with the utility for a given assignment depending heavily on the state of the system. We apply multi-agent reinforcement learning to this problem, learning the value of assignments by bootstrapping from a known polynomial-time greedy solver and then learning from further experience. We then choose assignments using a distributed optimal assignment mechanism rather than by selecting them directly. We demonstrate that this algorithm is theoretically justified and avoids pitfalls experienced by other RL algorithms in this setting. Finally, we show that our algorithm significantly outperforms other methods in the literature, even while scaling to realistic scenarios with hundreds of agents and tasks.

Paper Structure

This paper contains 23 sections, 4 theorems, 24 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Let $\pi^\alpha: S \to X$ be a constant, deterministic joint policy. Define the $Q$-function for an individual agent with respect to this joint policy $\pi^\alpha$ as: Then, in the assignment problem setting, where $r(s,x)=\sum_{i=1}^n\sum_{j=1}^m\hat{\beta}(s)x_{ij} = \sum_{i=1}^n r_i(s,x^i)$,

Figures (4)

  • Figure 1: Architecture of REDA. 1) Calculate independent estimates of future utility for each agent, combine into a matrix. 2) Select joint assignment $x_k=\alpha(\mathbf{Q}_k^\pi)$ which maximizes social utility, not utility for any given agent. 3) Execute $x_k$ in environment and observe results. 4) Train agents' independent value estimates based on minibatch from replay buffer.
  • Figure 2: Performance over 5 runs of various algorithms in dictator environment, shown with standard deviation shaded. Note that after $\epsilon$ decays to $0$ at $t=10$,$000$, performance for REDA instantly approaches the theoretical maximum, while other algorithms remain significantly below the maximum.
  • Figure 3: Performance over $5$ runs of various algorithms in a realistic constellation environment with $324$ satellites and $450$ tasks, shown with standard deviation shaded. $\epsilon$ is decayed to zero over $300$k time steps. REDA consistently converges and obtains more reward than all other tested algorithms.
  • Figure 4: Performance of tested algorithms on various metrics; percentage of satellites without charge at the end of the episode (lower is better), percentage of satellites completing the same task as another satellite (lower is better), and the average number of time steps satellites are assigned to the same tasks (higher is better). We can see that REDA outperforms IQL and IPPO across the board, while avoiding having satellites run out of power as when using classical methods like HAAL.

Theorems & Definitions (8)

  • Theorem 1: Decomposition of $Q^\pialpha$ into $Q_i^\pialpha$
  • proof
  • Lemma 1
  • proof
  • Theorem 1: Decomposition of $Q^\pialpha$ into $Q_i^\pialpha$
  • proof
  • Lemma 1
  • proof