Table of Contents
Fetching ...

Reward-Independent Messaging for Decentralized Multi-Agent Reinforcement Learning

Naoto Yoshida, Tadahiro Taniguchi

TL;DR

The paper addresses how decentralized agents in partially observable multi-agent reinforcement learning can establish effective communication without aligned rewards. It introduces MARL-CPC, a CPC-based framework that treats inter-agent messages as latent variables aiding joint state inference, with a variational objective decomposed per agent. Two algorithms, Bandit-CPC and IPPO-CPC, demonstrate that CPC-enabled communication emerges and yields near-maximum group welfare even in non-cooperative tasks, outperforming traditional message-as-action baselines. This work provides a principled, scalable approach to emergent communication in fully decentralized, reward-independent MARL, enabling coordination in complex environments. The findings suggest CPC-based communication can be a robust, information-centric alternative to conventional reward-driven signaling in decentralized systems.

Abstract

In multi-agent reinforcement learning (MARL), effective communication improves agent performance, particularly under partial observability. We propose MARL-CPC, a framework that enables communication among fully decentralized, independent agents without parameter sharing. MARL-CPC incorporates a message learning model based on collective predictive coding (CPC) from emergent communication research. Unlike conventional methods that treat messages as part of the action space and assume cooperation, MARL-CPC links messages to state inference, supporting communication in non-cooperative, reward-independent settings. We introduce two algorithms -Bandit-CPC and IPPO-CPC- and evaluate them in non-cooperative MARL tasks. Benchmarks show that both outperform standard message-as-action approaches, establishing effective communication even when messages offer no direct benefit to the sender. These results highlight MARL-CPC's potential for enabling coordination in complex, decentralized environments.

Reward-Independent Messaging for Decentralized Multi-Agent Reinforcement Learning

TL;DR

The paper addresses how decentralized agents in partially observable multi-agent reinforcement learning can establish effective communication without aligned rewards. It introduces MARL-CPC, a CPC-based framework that treats inter-agent messages as latent variables aiding joint state inference, with a variational objective decomposed per agent. Two algorithms, Bandit-CPC and IPPO-CPC, demonstrate that CPC-enabled communication emerges and yields near-maximum group welfare even in non-cooperative tasks, outperforming traditional message-as-action baselines. This work provides a principled, scalable approach to emergent communication in fully decentralized, reward-independent MARL, enabling coordination in complex environments. The findings suggest CPC-based communication can be a robust, information-centric alternative to conventional reward-driven signaling in decentralized systems.

Abstract

In multi-agent reinforcement learning (MARL), effective communication improves agent performance, particularly under partial observability. We propose MARL-CPC, a framework that enables communication among fully decentralized, independent agents without parameter sharing. MARL-CPC incorporates a message learning model based on collective predictive coding (CPC) from emergent communication research. Unlike conventional methods that treat messages as part of the action space and assume cooperation, MARL-CPC links messages to state inference, supporting communication in non-cooperative, reward-independent settings. We introduce two algorithms -Bandit-CPC and IPPO-CPC- and evaluate them in non-cooperative MARL tasks. Benchmarks show that both outperform standard message-as-action approaches, establishing effective communication even when messages offer no direct benefit to the sender. These results highlight MARL-CPC's potential for enabling coordination in complex, decentralized environments.

Paper Structure

This paper contains 22 sections, 12 equations, 9 figures, 1 algorithm.

Figures (9)

  • Figure 1: Graphical model of the CPC module (2 agents).
  • Figure 2: Overview of the MARL-CPC architecture. The figure is a model with two agents. The components of each agent are represented by filled regions—white and gray, respectively. The central panel corresponds to the CPC module, which forms a pseudo-joint agent and enables message generation and exchange. Based on the messages ${\bm{m}}$ and the hidden states $z$ acquired through the CPC module, the agent performs action selection and value estimation. The dashed arrows in the figure indicate paths through which gradients do not propagate during learning.
  • Figure 3: Agent architectures compared in this experiments. A) Independent agents without communication de2020independent. B) Message agents, where communication is defined as an extension of action cangelosi1998emergencefoerster2016learning. C) CPC-based agents in which messages function as auxiliary variables for the state inference process (ours). D) Agents whose observations are shared in advance (performance upper bound).
  • Figure 4: Multi-agent conditional bandit environment.
  • Figure 5: Results in Bandit environment.
  • ...and 4 more figures