Table of Contents
Fetching ...

ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

Xihuai Wang, Shao Zhang, Wenhao Zhang, Wentao Dong, Jingxiao Chen, Ying Wen, Weinan Zhang

TL;DR

ZSC-Eval is presented, the first evaluation toolkit and benchmark for ZSC algorithms, and a human experiment of current ZSC algorithms is conducted to verify the ZSC-Eval's consistency with human evaluation.

Abstract

Zero-shot coordination (ZSC) is a new cooperative multi-agent reinforcement learning (MARL) challenge that aims to train an ego agent to work with diverse, unseen partners during deployment. The significant difference between the deployment-time partners' distribution and the training partners' distribution determined by the training algorithm makes ZSC a unique out-of-distribution (OOD) generalization challenge. The potential distribution gap between evaluation and deployment-time partners leads to inadequate evaluation, which is exacerbated by the lack of appropriate evaluation metrics. In this paper, we present ZSC-Eval, the first evaluation toolkit and benchmark for ZSC algorithms. ZSC-Eval consists of: 1) Generation of evaluation partner candidates through behavior-preferring rewards to approximate deployment-time partners' distribution; 2) Selection of evaluation partners by Best-Response Diversity (BR-Div); 3) Measurement of generalization performance with various evaluation partners via the Best-Response Proximity (BR-Prox) metric. We use ZSC-Eval to benchmark ZSC algorithms in Overcooked and Google Research Football environments and get novel empirical findings. We also conduct a human experiment of current ZSC algorithms to verify the ZSC-Eval's consistency with human evaluation. ZSC-Eval is now available at https://github.com/sjtu-marl/ZSC-Eval.

ZSC-Eval: An Evaluation Toolkit and Benchmark for Multi-agent Zero-shot Coordination

TL;DR

ZSC-Eval is presented, the first evaluation toolkit and benchmark for ZSC algorithms, and a human experiment of current ZSC algorithms is conducted to verify the ZSC-Eval's consistency with human evaluation.

Abstract

Zero-shot coordination (ZSC) is a new cooperative multi-agent reinforcement learning (MARL) challenge that aims to train an ego agent to work with diverse, unseen partners during deployment. The significant difference between the deployment-time partners' distribution and the training partners' distribution determined by the training algorithm makes ZSC a unique out-of-distribution (OOD) generalization challenge. The potential distribution gap between evaluation and deployment-time partners leads to inadequate evaluation, which is exacerbated by the lack of appropriate evaluation metrics. In this paper, we present ZSC-Eval, the first evaluation toolkit and benchmark for ZSC algorithms. ZSC-Eval consists of: 1) Generation of evaluation partner candidates through behavior-preferring rewards to approximate deployment-time partners' distribution; 2) Selection of evaluation partners by Best-Response Diversity (BR-Div); 3) Measurement of generalization performance with various evaluation partners via the Best-Response Proximity (BR-Prox) metric. We use ZSC-Eval to benchmark ZSC algorithms in Overcooked and Google Research Football environments and get novel empirical findings. We also conduct a human experiment of current ZSC algorithms to verify the ZSC-Eval's consistency with human evaluation. ZSC-Eval is now available at https://github.com/sjtu-marl/ZSC-Eval.
Paper Structure (33 sections, 2 equations, 26 figures, 11 tables, 1 algorithm)

This paper contains 33 sections, 2 equations, 26 figures, 11 tables, 1 algorithm.

Figures (26)

  • Figure 1: ZSC-Eval. 1) Generation: generating behavior-preferring agents and their best responses; 2) Selection: selecting evaluation partners by maximizing Best Response Diversity; 3) Measurement: evaluating the ego agent with the evaluation partners and computing Best Response Proximity.
  • Figure 2: (a) Different partners may respond to similar BRs. (b) Population diversity of BRs to partner subsets selected by two methods with different sizes. A higher vertical axis value at the same subset size indicates more diverse BRs in the subset.
  • Figure 3: Visualization of high-level behaviors of human proxy agents, different self-play populations, our evaluation partner candidates, and evaluation partners in Overcooked layouts.
  • Figure 4: BR-Prox performance with 95% confidence intervals of ZSC algorithms with different population sizes in Overcooked. '12$\backslash$25', '24$\backslash$50' and '36$\backslash$75' mean that co-play methods (FCP, MEP, TrajeDi and HSP) are trained with populations of 12, 24 and 36 and that the evolution method (COLE) is trained with populations of 25, 50 and 75. Note that SP and E3T are not population-based.
  • Figure 5: BR-Prox performance of ZSC algorithms in Overcooked with multiple recipes.
  • ...and 21 more figures