Table of Contents
Fetching ...

Off-Policy Multi-Agent Decomposed Policy Gradients

Yihan Wang, Beining Han, Tonghan Wang, Heng Dong, Chongjie Zhang

TL;DR

This paper investigates causes that hinder the performance of MAPG algorithms and presents a multi-agent decomposed policy gradient method (DOP), which introduces the idea of value function decomposition into the multi-agent actor-critic framework and formally shows that DOP critics have sufficient representational capability to guarantee convergence.

Abstract

Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at https://sites.google.com/view/dop-mapg/.

Off-Policy Multi-Agent Decomposed Policy Gradients

TL;DR

This paper investigates causes that hinder the performance of MAPG algorithms and presents a multi-agent decomposed policy gradient method (DOP), which introduces the idea of value function decomposition into the multi-agent actor-critic framework and formally shows that DOP critics have sufficient representational capability to guarantee convergence.

Abstract

Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at https://sites.google.com/view/dop-mapg/.

Paper Structure

This paper contains 35 sections, 4 theorems, 43 equations, 4 figures, 2 algorithms.

Key Result

Proposition 1

[Stochastic DOP policy improvement theorem] Under a mild assumption, for any pre-update policy $\bm\pi^o$ which is updated by Eq. equ:s_dop_g to $\bm\pi$, let $\pi_i(a_i | \tau_i) = \pi_i^o(a_i | \tau_i) + \beta_{a_i, \bm\tau} \delta$, where $\delta>0$ is a sufficiently small number. If it holds tha i.e., the joint policy is improved by the update.

Figures (4)

  • Figure 1: A Decomposed critic.
  • Figure 2: Bias-variance trade-off of DOP on the didactic example. Left: gradient variance; Middle: Performance; Right: Average bias in Q estimations; Right-bottom: the element in $i$th row and $j$th column is the local Q value learned by DOP for agent $i$ taking action $j$.
  • Figure 3: Comparisons with baselines and ablations on the SMAC benchmark.
  • Figure 4: Left and middle: performance comparisons with COMA and MAAC on MPE. Right: The learned credit assignment mechanism on task $\mathtt{Mill}$ by deterministic DOP.

Theorems & Definitions (10)

  • Proposition 1
  • proof
  • proof
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Proposition 1
  • proof