ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

Xihuai Wang; Shao Zhang; Wenhao Zhang; Wentao Dong; Jingxiao Chen; Ying Wen; Weinan Zhang

ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

Xihuai Wang, Shao Zhang, Wenhao Zhang, Wentao Dong, Jingxiao Chen, Ying Wen, Weinan Zhang

TL;DR

ZSC-Eval is presented, the first evaluation toolkit and benchmark for ZSC algorithms, and a human experiment of current ZSC algorithms is conducted to verify the ZSC-Eval's consistency with human evaluation.

Abstract

Zero-shot coordination (ZSC) is a new cooperative multi-agent reinforcement learning (MARL) challenge that aims to train an ego agent to work with diverse, unseen partners during deployment. The significant difference between the deployment-time partners' distribution and the training partners' distribution determined by the training algorithm makes ZSC a unique out-of-distribution (OOD) generalization challenge. The potential distribution gap between evaluation and deployment-time partners leads to inadequate evaluation, which is exacerbated by the lack of appropriate evaluation metrics. In this paper, we present ZSC-Eval, the first evaluation toolkit and benchmark for ZSC algorithms. ZSC-Eval consists of: 1) Generation of evaluation partner candidates through behavior-preferring rewards to approximate deployment-time partners' distribution; 2) Selection of evaluation partners by Best-Response Diversity (BR-Div); 3) Measurement of generalization performance with various evaluation partners via the Best-Response Proximity (BR-Prox) metric. We use ZSC-Eval to benchmark ZSC algorithms in Overcooked and Google Research Football environments and get novel empirical findings. We also conduct a human experiment of current ZSC algorithms to verify the ZSC-Eval's consistency with human evaluation. ZSC-Eval is now available at https://github.com/sjtu-marl/ZSC-Eval.

ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

TL;DR

Abstract

Paper Structure (33 sections, 2 equations, 26 figures, 11 tables, 1 algorithm)

This paper contains 33 sections, 2 equations, 26 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Background
Decentralized Markov Decision Process
Limitations of Current Evaluation Methods
ZSC-Eval
Generation of Behavior-preferring Agents as Candidates
Selection of Evaluation Partners by Best Response Diversity
Measurement of ZSC Capability by Best Response Proximity
Experiments
Effectiveness of ZSC-Eval
Benchmark Results and Empirical Findings in Overcooked
Evaluating Zero-shot Coordination Capability in Google Research Football
Conclusion
Comparisons among Evaluation Methods
...and 18 more sections

Figures (26)

Figure 1: ZSC-Eval. 1) Generation: generating behavior-preferring agents and their best responses; 2) Selection: selecting evaluation partners by maximizing Best Response Diversity; 3) Measurement: evaluating the ego agent with the evaluation partners and computing Best Response Proximity.
Figure 2: (a) Different partners may respond to similar BRs. (b) Population diversity of BRs to partner subsets selected by two methods with different sizes. A higher vertical axis value at the same subset size indicates more diverse BRs in the subset.
Figure 3: Visualization of high-level behaviors of human proxy agents, different self-play populations, our evaluation partner candidates, and evaluation partners in Overcooked layouts.
Figure 4: BR-Prox performance with 95% confidence intervals of ZSC algorithms with different population sizes in Overcooked. '12$\backslash$25', '24$\backslash$50' and '36$\backslash$75' mean that co-play methods (FCP, MEP, TrajeDi and HSP) are trained with populations of 12, 24 and 36 and that the evolution method (COLE) is trained with populations of 25, 50 and 75. Note that SP and E3T are not population-based.
Figure 5: BR-Prox performance of ZSC algorithms in Overcooked with multiple recipes.
...and 21 more figures

ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

TL;DR

Abstract

ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

Authors

TL;DR

Abstract

Table of Contents

Figures (26)