TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient

Xingzhou Lou; Junge Zhang; Timothy J. Norman; Kaiqi Huang; Yali Du

TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient

Xingzhou Lou, Junge Zhang, Timothy J. Norman, Kaiqi Huang, Yali Du

TL;DR

This work tackles the CDM issue in centralized-training MAPG by introducing an agent topology that defines coalitions for policy updates, leading to topology-guided variants of MAPG called Stochastic TAPE and Deterministic TAPE. The approach proves a policy-improvement theorem for the stochastic variant and theoretically explains how topology enhances cooperation through diverse parameter updates, quantified by a $p^2$-proportional increase in update diversity. Empirically, ER-based topologies yield the most diverse and effective coalitions, improving performance on matrix games, Level-Based Foraging, and SMAC while mitigating CDM; a heuristic graph-search method analyzes topology choices and demonstrates a practical compromise between cooperation and CDM. Overall, TAPE shows that an explicit coalition-structure over policy updates can both promote cooperation and suppress detrimental cross-agent interference, with potential for adaptive topology learning as future work.

Abstract

Multi-Agent Policy Gradient (MAPG) has made significant progress in recent years. However, centralized critics in state-of-the-art MAPG methods still face the centralized-decentralized mismatch (CDM) issue, which means sub-optimal actions by some agents will affect other agent's policy learning. While using individual critics for policy updates can avoid this issue, they severely limit cooperation among agents. To address this issue, we propose an agent topology framework, which decides whether other agents should be considered in policy gradient and achieves compromise between facilitating cooperation and alleviating the CDM issue. The agent topology allows agents to use coalition utility as learning objective instead of global utility by centralized critics or local utility by individual critics. To constitute the agent topology, various models are studied. We propose Topology-based multi-Agent Policy gradiEnt (TAPE) for both stochastic and deterministic MAPG methods. We prove the policy improvement theorem for stochastic TAPE and give a theoretical explanation for the improved cooperation among agents. Experiment results on several benchmarks show the agent topology is able to facilitate agent cooperation and alleviate CDM issue respectively to improve performance of TAPE. Finally, multiple ablation studies and a heuristic graph search algorithm are devised to show the efficacy of the agent topology.

TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient

TL;DR

-proportional increase in update diversity. Empirically, ER-based topologies yield the most diverse and effective coalitions, improving performance on matrix games, Level-Based Foraging, and SMAC while mitigating CDM; a heuristic graph-search method analyzes topology choices and demonstrates a practical compromise between cooperation and CDM. Overall, TAPE shows that an explicit coalition-structure over policy updates can both promote cooperation and suppress detrimental cross-agent interference, with potential for adaptive topology learning as future work.

Abstract

Paper Structure (14 sections, 39 equations, 6 figures, 1 table, 2 algorithms)

This paper contains 14 sections, 39 equations, 6 figures, 1 table, 2 algorithms.

Introduction
Preliminaries
Related Work
Topology-based Multi-Agent Policy Gradient
Stochastic TAPE
Deterministic TAPE
Analysis
Agent Topology
Theoretical Results
Experiment
Matrix Game
Level-Based Foraging
StarCraft Multi-Agent Challenge
Conclusion and Future Work

Figures (6)

Figure 1: (a) gives the proposed three matrix games of different levels. We use different colors for different levels of game. Blue represents Easy, green represents Medium and red represents Hard. (b), (c) and (d) give evaluation results. Stochastic TAPE has the best performance because the agents directly maximize joint utility to achieve strong cooperation. The only difference between TAPE and DOP is that TAPE adopts the agent topology. Although COMA is seen as a weak baseline on SMAC, it achieves much better performance than DOP. QMIX fails to perform well in these games as they are not monotonic games.
Figure 2: (a) gives a scenario 6x6-3p-4f in LBF. 6x6-3p-4f stands for 6x6 grid-world with 3 players and 4 fruits. (b) In 8x8-2p-3f, stochastic TAPE achieve best performance. While in the more difficult task 15x15-4p-5f (c), deterministic TAPE outperform its base method and all other baselines. See stochastic TAPE against DOP, and deterministic TAPE against PAC for comparison.
Figure 3: Experiment results on SMAC. (a-c) give the results in hard maps, and (d-f) are results in super-hard maps. After adopting our agent topology to facilitate cooperation and alleviate CDM issue, stochastic TAPE and deterministic TAPE outperforms their base methods respectively. See stochastic TAPE against DOP, and deterministic TAPE against PAC for comparison.
Figure 4: (a) and (b) show the results and performance of using different models to constitute agent topologies. BA is Barabási–Albert model, WS is Watts–Strogatz model, ER is Erdős–Rényi model, Edgeless and FC (Fully-Connected) are the topologies adopted in DOP and PAC respectively. ER has the most diverse topoloies and strongest performance. (c) and (d) show the performance of stochastic TAPE and deterministic TAPE in MMM2 with difference hyperparameter $p$ for ER model. Evaluation metric is test win rate and scores are normalized by the base method. In base method DOP, $p=0$ and base method PAC $p=1$. The boxplot is obtained with four different random seeds, and the red lines show the mean performance.
Figure 5: The heatmaps show the difference between the frequency of edges being present and the probability $p$. Source and Destination represent starting node and destination node of an edge. During training, over 1 million agent topology is generated. According to the law of large numbers, the difference is always around 0 when the heuristic graph search technique is not used in (b). In (a) and (c), we adopt the heuristic graph search technique to choose the agent topology with strongest performance. When $p$ is too small (0.01 in (a)), the connection among agents is too sparse, weakening cooperation among agents. Therefore, agent topologies with more edges can facilitate cooperation and are preferred by the graph search technique. As a results, the difference is always positive in (a). On the contrary, when the connection is too dense ($p=0.3$ in (c)), topologies with less edges are preferred because they stop bad influence of sub-optimal actions from spreading and have better performance, resulting in negative differences in (c).
...and 1 more figures

Theorems & Definitions (2)

Definition 1: Coalition Utility
Definition 2: Coalition $Q$

TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient

TL;DR

Abstract

TAPE: Leveraging Agent Topology for Cooperative Multi-Agent Policy Gradient

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (2)