Learning what to say and how precisely: Efficient Communication via Differentiable Discrete Communication Learning
Aditya Kapoor, Yash Bhisikar, Benjamin Freed, Jan Peters, Mingfei Sun
TL;DR
The paper addresses bandwidth constraints in multi-agent reinforcement learning by generalizing Differentiable Discrete Communication Learning (DDCL) to unbounded, signed signals, enabling end-to-end optimization of discrete inter-agent messages as a plug-and-play layer. It derives a differentiable, upper-bounded communication cost for unbounded signals and demonstrates through extensive experiments that DDCL can reduce communication by multiple orders of magnitude while maintaining or improving task performance across four MARL+Comms algorithms and several benchmarks. A key finding is the empirical support for the Bitter Lesson: a simple Transformer-based policy empowered by DDCL can rival or surpass complex, hand-crafted communication architectures, suggesting that scalable, general mechanisms may outperform bespoke designs. The work includes a reproducibility commitment with open-source code and rigorous theoretical underpinnings, highlighting a practical path toward efficient, scalable MARL communication systems.
Abstract
Effective communication in multi-agent reinforcement learning (MARL) is critical for success but constrained by bandwidth, yet past approaches have been limited to complex gating mechanisms that only decide \textit{whether} to communicate, not \textit{how precisely}. Learning to optimize message precision at the bit-level is fundamentally harder, as the required discretization step breaks gradient flow. We address this by generalizing Differentiable Discrete Communication Learning (DDCL), a framework for end-to-end optimization of discrete messages. Our primary contribution is an extension of DDCL to support unbounded signals, transforming it into a universal, plug-and-play layer for any MARL architecture. We verify our approach with three key results. First, through a qualitative analysis in a controlled environment, we demonstrate \textit{how} agents learn to dynamically modulate message precision according to the informational needs of the task. Second, we integrate our variant of DDCL into four state-of-the-art MARL algorithms, showing it reduces bandwidth by over an order of magnitude while matching or exceeding task performance. Finally, we provide direct evidence for the \enquote{Bitter Lesson} in MARL communication: a simple Transformer-based policy leveraging DDCL matches the performance of complex, specialized architectures, questioning the necessity of bespoke communication designs.
