SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

Manav Vora; Gokul Puthumanaillam; Hiroyasu Tsukamoto; Melkior Ornik

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

Manav Vora, Gokul Puthumanaillam, Hiroyasu Tsukamoto, Melkior Ornik

TL;DR

SCoUT is introduced, which derives counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages, and enables precise credit assignment for both send and recipient-selection decisions.

Abstract

Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning \emph{when} and \emph{who} to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce \textbf{SCoUT} (\textbf{S}calable \textbf{Co}mmunication via \textbf{U}tility-guided \textbf{T}emporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples \textit{soft} agent groups every $K$ environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient-selection decisions. At execution time, all centralized training components are discarded and only the per-agent policy is run, preserving decentralized execution. Project website, videos and code: \hyperlink{https://scout-comm.github.io/}{https://scout-comm.github.io/}

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

TL;DR

Abstract

environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender's contribution from the recipient's aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient-selection decisions. At execution time, all centralized training components are discarded and only the per-agent policy is run, preserving decentralized execution. Project website, videos and code: \hyperlink{https://scout-comm.github.io/}{https://scout-comm.github.io/}

Paper Structure (85 sections, 26 equations, 9 figures, 8 tables)

This paper contains 85 sections, 26 equations, 9 figures, 8 tables.

Introduction
Contributions.
Related Work
Foundations and CTDE training.
Learned communication in MARL.
Targeted routing, attention, and structure.
Scalability and many-agent benchmarks.
Problem Setting
Communication interface.
Policies and objective.
Learning setup (CTDE).
SCoUT
Agent backbone and descriptors
Temporal soft grouping and affinity prior
Grouping objective.
...and 70 more sections

Figures (9)

Figure 1: SCoUT forward-pass overview across two timescales. At each macro-step boundary $t_b$, the grouping module samples soft assignments $Y_{t_b}$ and forms an affinity matrix $G_{t_b}=Y_{t_b}Y_{t_b}^\top$, which is held fixed over the subsequent $K$ primitive steps and used as a log-bias $\log(G_{t_b})$ for recipient selection. At each primitive step $t$, each agent embeds its local observation $o_t^i$ and mailbox input $m_t^i$, updates a shared GRU backbone, and outputs a three-headed policy: environment action $a_t^i$, send decision $c_t^i$, and recipient $\rho_t^i$. If $c_t^i=1$, the agent transmits message content $x_t^i=z_t^{\mathrm{msg},i}$ to the chosen recipient; recipients aggregate incoming messages into the next-step mailbox.
Figure 2: Battle training curves across scales. Episode return vs environment steps (shaded regions indicate variability across seeds). SCoUT learns rapidly and consistently across scales, while baselines exhibit substantial scale- and topology-sensitivity.
Figure 3: Pursuit scaling summary. Catch% (top) and $\mathrm{TT}_{50}$ (bottom) vs. number of pursuers $m$ (error bars: std over 20 evaluation seeds). In the bottom panel, we plot $\mathrm{TT}_{50}{=}500$ when R$_{50}{<}50\%$ (Table \ref{['tab:pursuit_ablations']} reports these cases as N/A).
Figure 4: Battle reference map at $64$ vs $64$ (map size $40\times 40$). Red squares denote controlled agents and blue squares denote opponent agents.
Figure 5: Pursuit reference map at 20P--8E (map size $40\times 40$). Red circles denote pursuers and blue circles denote evaders. Orange translucent squares indicate local observation windows for pursuers.
...and 4 more figures

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

TL;DR

Abstract

SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)