Table of Contents
Fetching ...

On the Fundamental Limitations of Decentralized Learnable Reward Shaping in Cooperative Multi-Agent Reinforcement Learning

Aditya Akella

TL;DR

This work investigates whether decentralized learnable reward shaping can solve coordination challenges in cooperative multi-agent reinforcement learning. It compares a fully decentralized approach (DMARL-RSA) to centralized MAPPO and independent IPPO on the Simple Spread task, using both heuristic shaping and per-agent learnable rewards. The findings show a strong centralized advantage: MAPPO achieves a global reward of $1.92 \pm 0.87$, while DMARL-RSA and IPPO lag far behind at $-24.20 \pm 0.09$ and $-23.19 \pm 0.96$, respectively, despite DMARL-RSA attaining higher landmark coverage ($0.888 \pm 0.029$) than MAPPO ($0.273 \pm 0.008$). The authors identify three fundamental barriers—non-stationarity, exponential credit assignment, and misalignment between local and global objectives—that prevent decentralized reward learning from achieving global coordination, underscoring the necessity of centralized coordination (or hybrid approaches) for effective multi-agent cooperation.

Abstract

Recent advances in learnable reward shaping have shown promise in single-agent reinforcement learning by automatically discovering effective feedback signals. However, the effectiveness of decentralized learnable reward shaping in cooperative multi-agent settings remains poorly understood. We propose DMARL-RSA, a fully decentralized system where each agent learns individual reward shaping, and evaluate it on cooperative navigation tasks in the simple_spread_v3 environment. Despite sophisticated reward learning, DMARL-RSA achieves only -24.20 +/- 0.09 average reward, compared to MAPPO with centralized training at 1.92 +/- 0.87 -- a 26.12-point gap. DMARL-RSA performs similarly to simple independent learning (IPPO: -23.19 +/- 0.96), indicating that advanced reward shaping cannot overcome fundamental decentralized coordination limitations. Interestingly, decentralized methods achieve higher landmark coverage (0.888 +/- 0.029 for DMARL-RSA, 0.960 +/- 0.045 for IPPO out of 3 total) but worse overall performance than centralized MAPPO (0.273 +/- 0.008 landmark coverage) -- revealing a coordination paradox between local optimization and global performance. Analysis identifies three critical barriers: (1) non-stationarity from concurrent policy updates, (2) exponential credit assignment complexity, and (3) misalignment between individual reward optimization and global objectives. These results establish empirical limits for decentralized reward learning and underscore the necessity of centralized coordination for effective multi-agent cooperation.

On the Fundamental Limitations of Decentralized Learnable Reward Shaping in Cooperative Multi-Agent Reinforcement Learning

TL;DR

This work investigates whether decentralized learnable reward shaping can solve coordination challenges in cooperative multi-agent reinforcement learning. It compares a fully decentralized approach (DMARL-RSA) to centralized MAPPO and independent IPPO on the Simple Spread task, using both heuristic shaping and per-agent learnable rewards. The findings show a strong centralized advantage: MAPPO achieves a global reward of , while DMARL-RSA and IPPO lag far behind at and , respectively, despite DMARL-RSA attaining higher landmark coverage () than MAPPO (). The authors identify three fundamental barriers—non-stationarity, exponential credit assignment, and misalignment between local and global objectives—that prevent decentralized reward learning from achieving global coordination, underscoring the necessity of centralized coordination (or hybrid approaches) for effective multi-agent cooperation.

Abstract

Recent advances in learnable reward shaping have shown promise in single-agent reinforcement learning by automatically discovering effective feedback signals. However, the effectiveness of decentralized learnable reward shaping in cooperative multi-agent settings remains poorly understood. We propose DMARL-RSA, a fully decentralized system where each agent learns individual reward shaping, and evaluate it on cooperative navigation tasks in the simple_spread_v3 environment. Despite sophisticated reward learning, DMARL-RSA achieves only -24.20 +/- 0.09 average reward, compared to MAPPO with centralized training at 1.92 +/- 0.87 -- a 26.12-point gap. DMARL-RSA performs similarly to simple independent learning (IPPO: -23.19 +/- 0.96), indicating that advanced reward shaping cannot overcome fundamental decentralized coordination limitations. Interestingly, decentralized methods achieve higher landmark coverage (0.888 +/- 0.029 for DMARL-RSA, 0.960 +/- 0.045 for IPPO out of 3 total) but worse overall performance than centralized MAPPO (0.273 +/- 0.008 landmark coverage) -- revealing a coordination paradox between local optimization and global performance. Analysis identifies three critical barriers: (1) non-stationarity from concurrent policy updates, (2) exponential credit assignment complexity, and (3) misalignment between individual reward optimization and global objectives. These results establish empirical limits for decentralized reward learning and underscore the necessity of centralized coordination for effective multi-agent cooperation.

Paper Structure

This paper contains 20 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Centralized vs Decentralized MARL Architecture Comparison - Technical diagram showing MAPPO's centralized critic versus DMARL-RSA's independent components.
  • Figure 2: Coordination Paradigms in Multi-Agent Systems - Conceptual diagram illustrating centralized coordination versus decentralized decision-making approaches.
  • Figure 3: Learning Curves Comparing Centralized and Decentralized Coordination Approaches - Performance trajectories showing MAPPO, DMARL-RSA, and IPPO across 5,000 training episodes.
  • Figure 4: Final Performance Analysis with Statistical Significance Testing - Bar chart displaying mean rewards $\pm$ standard deviations for all three methods after convergence.
  • Figure 5: Coordination Paradox - Local vs Global Optimization Trade-off. Decentralized methods achieve higher landmark coverage but worse overall performance.