Table of Contents
Fetching ...

Learning what to say and how precisely: Efficient Communication via Differentiable Discrete Communication Learning

Aditya Kapoor, Yash Bhisikar, Benjamin Freed, Jan Peters, Mingfei Sun

TL;DR

The paper addresses bandwidth constraints in multi-agent reinforcement learning by generalizing Differentiable Discrete Communication Learning (DDCL) to unbounded, signed signals, enabling end-to-end optimization of discrete inter-agent messages as a plug-and-play layer. It derives a differentiable, upper-bounded communication cost for unbounded signals and demonstrates through extensive experiments that DDCL can reduce communication by multiple orders of magnitude while maintaining or improving task performance across four MARL+Comms algorithms and several benchmarks. A key finding is the empirical support for the Bitter Lesson: a simple Transformer-based policy empowered by DDCL can rival or surpass complex, hand-crafted communication architectures, suggesting that scalable, general mechanisms may outperform bespoke designs. The work includes a reproducibility commitment with open-source code and rigorous theoretical underpinnings, highlighting a practical path toward efficient, scalable MARL communication systems.

Abstract

Effective communication in multi-agent reinforcement learning (MARL) is critical for success but constrained by bandwidth, yet past approaches have been limited to complex gating mechanisms that only decide \textit{whether} to communicate, not \textit{how precisely}. Learning to optimize message precision at the bit-level is fundamentally harder, as the required discretization step breaks gradient flow. We address this by generalizing Differentiable Discrete Communication Learning (DDCL), a framework for end-to-end optimization of discrete messages. Our primary contribution is an extension of DDCL to support unbounded signals, transforming it into a universal, plug-and-play layer for any MARL architecture. We verify our approach with three key results. First, through a qualitative analysis in a controlled environment, we demonstrate \textit{how} agents learn to dynamically modulate message precision according to the informational needs of the task. Second, we integrate our variant of DDCL into four state-of-the-art MARL algorithms, showing it reduces bandwidth by over an order of magnitude while matching or exceeding task performance. Finally, we provide direct evidence for the \enquote{Bitter Lesson} in MARL communication: a simple Transformer-based policy leveraging DDCL matches the performance of complex, specialized architectures, questioning the necessity of bespoke communication designs.

Learning what to say and how precisely: Efficient Communication via Differentiable Discrete Communication Learning

TL;DR

The paper addresses bandwidth constraints in multi-agent reinforcement learning by generalizing Differentiable Discrete Communication Learning (DDCL) to unbounded, signed signals, enabling end-to-end optimization of discrete inter-agent messages as a plug-and-play layer. It derives a differentiable, upper-bounded communication cost for unbounded signals and demonstrates through extensive experiments that DDCL can reduce communication by multiple orders of magnitude while maintaining or improving task performance across four MARL+Comms algorithms and several benchmarks. A key finding is the empirical support for the Bitter Lesson: a simple Transformer-based policy empowered by DDCL can rival or surpass complex, hand-crafted communication architectures, suggesting that scalable, general mechanisms may outperform bespoke designs. The work includes a reproducibility commitment with open-source code and rigorous theoretical underpinnings, highlighting a practical path toward efficient, scalable MARL communication systems.

Abstract

Effective communication in multi-agent reinforcement learning (MARL) is critical for success but constrained by bandwidth, yet past approaches have been limited to complex gating mechanisms that only decide \textit{whether} to communicate, not \textit{how precisely}. Learning to optimize message precision at the bit-level is fundamentally harder, as the required discretization step breaks gradient flow. We address this by generalizing Differentiable Discrete Communication Learning (DDCL), a framework for end-to-end optimization of discrete messages. Our primary contribution is an extension of DDCL to support unbounded signals, transforming it into a universal, plug-and-play layer for any MARL architecture. We verify our approach with three key results. First, through a qualitative analysis in a controlled environment, we demonstrate \textit{how} agents learn to dynamically modulate message precision according to the informational needs of the task. Second, we integrate our variant of DDCL into four state-of-the-art MARL algorithms, showing it reduces bandwidth by over an order of magnitude while matching or exceeding task performance. Finally, we provide direct evidence for the \enquote{Bitter Lesson} in MARL communication: a simple Transformer-based policy leveraging DDCL matches the performance of complex, specialized architectures, questioning the necessity of bespoke communication designs.

Paper Structure

This paper contains 27 sections, 2 theorems, 24 equations, 12 figures, 6 tables, 1 algorithm.

Key Result

Theorem A.1

For any real-valued signal $z \in \mathbb{R}$, let the reconstructed signal $\hat{z}$ be generated by the quantization procedure described in Section sec:quantization_procedure. The resulting reconstruction error, defined as $e = \hat{z} - z$, is statistically independent of the original signal $z$.

Figures (12)

  • Figure 1: Qualitative analysis of the learned communication protocol in 'CommunicatingGoalEnv' toy problem. (a) Plots the Success Rate and Communication Rate against different $\lambda$ values. The episodic plot illustrates a "lossless compression" regime where the success rate remains perfect (1.0) while the required communication bits are significantly reduced as $\lambda$ increases to $8 \times 10^{-3}$. (b) A per-timestep comparison of the learned communication policy with the ground-truth goal sampling frequency. The strong negative correlation (r=-0.993) demonstrates that the agent learns a frequency-aware code, allocating the fewest bits to the most probable goals.
  • Figure 2: Performance versus communication bandwidth across all benchmark environments. Each point represents an algorithm variant's mean performance (Success Rate) and communication cost (Bits per episode, log scale) over 5 different seeds. Error bars denote 95% confidence intervals. Note that Y-axis scales are often focused on a specific range and may not start at zero. The top-left of each plot represents the ideal outcome (high success, low communication cost), while the bottom-right is the worst. Our DDCL-enhanced variants (red markers) consistently operate on the left side of the plots, demonstrating significant communication savings. The global Pareto frontier, representing the best possible trade-offs, is marked with a thick black border, while algorithm-specific frontiers are marked with a thin black border. 'STE_X' refers to a baseline using a Straight-Through Estimator to quantize 32-bit float messages to 'X' bits.
  • Figure 3: An overview of the generalized DDCL procedure. A sender's unbounded, real-valued signal $z$ is perturbed, quantized, and sent as a discrete message $m$. The receiver uses shared randomness to reconstruct the signal $\hat{z}$ in a way that allows gradients to flow back to the sender.
  • Figure 4: A comparison of the learned communication bit costs (left) with the goal sampling frequency (right) in the 8x8 grid. Our method learns a spatially smooth code that mirrors the underlying probability distribution, assigning the lowest cost to the most frequent goal at '(0,0)' and progressively higher costs to locations further away. Grid coordinates that were not sampled as goals are shown in white.
  • Figure 5: Communication bits per goal, sorted by the goal's frequency category. This chart clearly illustrates the learned inverse relationship: high-frequency goals are encoded with very few bits, while low-frequency goals require significantly more.
  • ...and 7 more figures

Theorems & Definitions (5)

  • Theorem A.1: Statistical Independence of Reconstruction Error
  • proof : Proof of Theorem \ref{['thm:error_independence_via_geometric_interpretation']}: Statistical Independence of Reconstruction Error
  • Theorem A.2: Statistical Independence Between $e$ and $z$.
  • proof
  • proof