Table of Contents
Fetching ...

Reducing Variance Caused by Communication in Decentralized Multi-agent Deep Reinforcement Learning

Changxi Zhu, Mehdi Dastani, Shihan Wang

TL;DR

This work analyzes how communication among critics in Decentralized Communicating Critics and Decentralized Actors (DCCDA) MADRL affects policy-gradient variance, proving $Var(\hat{g}^i_{DCCDA}) \geq Var(\hat{g}^i_{CTDE})$ under ideal and noisy conditions. It then introduces a modular variance-reduction framework comprising a message-dependent baseline (OB) and KL-based policy regularization to align actors with critic guidance, showing these techniques can be plugged into existing DCCDA methods. The authors validate their approach on StarCraft Multi-Agent Challenge and Traffic Junction, reporting reduced gradient variance and improved learning performance for OB-KL variants (e.g., GAAC-OB-KL, IPPO-Comm-OB-KL). The results suggest practical benefits for robust coordination in partially observable multi-agent systems with communication, with potential extension to continuous message spaces and broader Comm-MADRL settings.

Abstract

In decentralized multi-agent deep reinforcement learning (MADRL), communication can help agents to gain a better understanding of the environment to better coordinate their behaviors. Nevertheless, communication may involve uncertainty, which potentially introduces variance to the learning of decentralized agents. In this paper, we focus on a specific decentralized MADRL setting with communication and conduct a theoretical analysis to study the variance that is caused by communication in policy gradients. We propose modular techniques to reduce the variance in policy gradients during training. We adopt our modular techniques into two existing algorithms for decentralized MADRL with communication and evaluate them on multiple tasks in the StarCraft Multi-Agent Challenge and Traffic Junction domains. The results show that decentralized MADRL communication methods extended with our proposed techniques not only achieve high-performing agents but also reduce variance in policy gradients during training.

Reducing Variance Caused by Communication in Decentralized Multi-agent Deep Reinforcement Learning

TL;DR

This work analyzes how communication among critics in Decentralized Communicating Critics and Decentralized Actors (DCCDA) MADRL affects policy-gradient variance, proving under ideal and noisy conditions. It then introduces a modular variance-reduction framework comprising a message-dependent baseline (OB) and KL-based policy regularization to align actors with critic guidance, showing these techniques can be plugged into existing DCCDA methods. The authors validate their approach on StarCraft Multi-Agent Challenge and Traffic Junction, reporting reduced gradient variance and improved learning performance for OB-KL variants (e.g., GAAC-OB-KL, IPPO-Comm-OB-KL). The results suggest practical benefits for robust coordination in partially observable multi-agent systems with communication, with potential extension to continuous message spaces and broader Comm-MADRL settings.

Abstract

In decentralized multi-agent deep reinforcement learning (MADRL), communication can help agents to gain a better understanding of the environment to better coordinate their behaviors. Nevertheless, communication may involve uncertainty, which potentially introduces variance to the learning of decentralized agents. In this paper, we focus on a specific decentralized MADRL setting with communication and conduct a theoretical analysis to study the variance that is caused by communication in policy gradients. We propose modular techniques to reduce the variance in policy gradients during training. We adopt our modular techniques into two existing algorithms for decentralized MADRL with communication and evaluate them on multiple tasks in the StarCraft Multi-Agent Challenge and Traffic Junction domains. The results show that decentralized MADRL communication methods extended with our proposed techniques not only achieve high-performing agents but also reduce variance in policy gradients during training.

Paper Structure

This paper contains 38 sections, 11 theorems, 39 equations, 6 figures, 6 tables, 2 algorithms.

Key Result

Theorem 1

The DCCDA sample gradient has a variance greater or equal than that of the CTDE sample gradient in idealistic communication setting: $Var(\hat{g}^i_{DCCDA}) \geq Var(\hat{g}^i_{CTDE})$.

Figures (6)

  • Figure 1: DCCDA methods integrated with OB and KL.
  • Figure 2: Averaged win rate of all methods.
  • Figure 3: Averaged win rate when ablating OB and KL.
  • Figure 4: The training and execution phases for CTDE (without communication), CTDE (with communication), DTDE (without communication), and DCCDA using actor-critic methods.
  • Figure 5: Variance in policy gradient norm of all methods.
  • ...and 1 more figures

Theorems & Definitions (11)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Corollary 1
  • Lemma 1
  • Theorem 1
  • Lemma 2
  • Lemma 3
  • Theorem 2
  • Theorem 3
  • ...and 1 more