MaskMA: Towards Zero-Shot Multi-Agent Decision Making with Mask-Based Collaborative Learning

Jie Liu; Yinmin Zhang; Chuming Li; Chao Yang; Yaodong Yang; Yu Liu; Wanli Ouyang

MaskMA: Towards Zero-Shot Multi-Agent Decision Making with Mask-Based Collaborative Learning

Jie Liu, Yinmin Zhang, Chuming Li, Chao Yang, Yaodong Yang, Yu Liu, Wanli Ouyang

TL;DR

MaskMA presents a mask-based collaborative learning framework for zero-shot multi-agent decision making that addresses the mismatch between centralized training and decentralized execution and the generalization gap caused by varying agent numbers and action spaces. It combines a transformer backbone with a Mask-Based Training Strategy (MTS) and a Generalizable Action Representation (GAR) to enable robust zero-shot transfer on the SMAC benchmark, training on 11 maps and testing on 60 unseen maps. Empirically, MaskMA achieves about a 77.8% average zero-shot win rate on unseen maps under decentralized execution, with strong performance on downstream tasks like varied policies collaboration, ally malfunction, and ad hoc team play, outperforming the MADT baseline. These results indicate MaskMA as a promising step toward a generalist multi-agent decision-making model with broad applicability and scalability.

Abstract

Building a single generalist agent with strong zero-shot capability has recently sparked significant advancements. However, extending this capability to multi-agent decision making scenarios presents challenges. Most current works struggle with zero-shot transfer, due to two challenges particular to the multi-agent settings: (a) a mismatch between centralized training and decentralized execution; and (b) difficulties in creating generalizable representations across diverse tasks due to varying agent numbers and action spaces. To overcome these challenges, we propose a Mask-Based collaborative learning framework for Multi-Agent decision making (MaskMA). Firstly, we propose to randomly mask part of the units and collaboratively learn the policies of unmasked units to handle the mismatch. In addition, MaskMA integrates a generalizable action representation by dividing the action space into intrinsic actions solely related to the unit itself and interactive actions involving interactions with other units. This flexibility allows MaskMA to tackle tasks with varying agent numbers and thus different action spaces. Extensive experiments in SMAC reveal MaskMA, with a single model trained on 11 training maps, can achieve an impressive 77.8% average zero-shot win rate on 60 unseen test maps by decentralized execution, while also performing effectively on other types of downstream tasks (e.g., varied policies collaboration, ally malfunction, and ad hoc team play).

MaskMA: Towards Zero-Shot Multi-Agent Decision Making with Mask-Based Collaborative Learning

TL;DR

Abstract

Paper Structure (38 sections, 1 equation, 8 figures, 10 tables)

This paper contains 38 sections, 1 equation, 8 figures, 10 tables.

Introduction
Related Work
Masked Training in Single-agent Decision Making.
Multi-agent Decision Making as Sequence Modeling.
Action Representation.
Method
Formulation
Multi-Agent Preliminaries
Multi-Task Imitation Learning
Mask-Based Training Strategy
Generalizable Action Representation
Loss Function
Experiments
Setup.
Baseline
...and 23 more sections

Figures (8)

Figure 1: Win rate on training and testing maps. The blue line separates the 11 training maps on the left from the 60 testing maps on the right, where the performance on the testing maps is the zero-shot performance. The orange line demonstrates the substantial performance advantage of MaskMA over MADT.
Figure 2: MaskMA employs the transformer architecture combined with generalizable action representation and then trained through a mask-based training strategy. It effectively generalizes skills and knowledge from training maps into various downstream tasks, including unseen maps, varied policies collaboration, ally malfunction, and ad hoc team play.
Figure 3: Visualization of the attention matrix in MTS. Left: The attention matrix displays a causal structure along the timestep dimension, complemented by a non-causal configuration within each discrete timestep. Right: The final attention matrix used for training is obtained by randomly masking elements of the left attention matrix.
Figure 4: Visualization of Ad Hoc Team Play on 7m_vs_9m with Marine Inclusion Time = 0.8. This experiment demonstrates that when new Marines are added near the end of an episode, MaskMA still can quickly incorporate them into the team and enable them to contribute effectively. (a) Initial distribution of agents' positions. (b) Prior to the addition of the new Marine, our team is left with only three severely wounded agents, on the brink of defeat. (c) The new agent (indicated by the red arrow) joins our team and immediately engages the enemy. (d) With the assistance of the newly added agent, our team successfully defeats the enemy.
Figure 5: (a) Comparison of learning curve with the win rate on training maps in Decentralized Execution setting. MaskMA consistently outperforms MADT on average win rate. (b) Ablation on timestep with the win rate on training maps in the Decentralized Execution setting. MaskMA performs better with a longer timestep. (c) Ablation on the number of training maps with the win rate on unseen maps in the Decentralized Execution setting. With the increasing of training maps (especially from 5 to 8), the model's performance on various unseen maps improves, indicating better generalization ability.
...and 3 more figures

MaskMA: Towards Zero-Shot Multi-Agent Decision Making with Mask-Based Collaborative Learning

TL;DR

Abstract

MaskMA: Towards Zero-Shot Multi-Agent Decision Making with Mask-Based Collaborative Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)