Table of Contents
Fetching ...

Efficient Reinforcement Learning for Zero-Shot Coordination in Evolving Games

Bingyu Hui, Lebin Yu, Quanming Yao, Yunpeng Qu, Xudong Zhang, Jian Wang

TL;DR

This work tackles zero-shot coordination in evolving multi-agent environments by introducing ScaPT, a scalable population training framework. It combines a hierarchical meta-agent that shares parameters to simulate large populations with a mutual information-based diversity regularizer, enabling efficient and diverse population training. Instantiated in a value-based RL setting and evaluated on matrix games and Hanabi, ScaPT consistently outperforms prior methods, especially as population size grows under fixed computational budgets. The findings underscore the critical role of scalable population design and diversity-promoting objectives for robust zero-shot coordination in complex, evolving games.

Abstract

Zero-shot coordination(ZSC), a key challenge in multi-agent game theory, has become a hot topic in reinforcement learning (RL) research recently, especially in complex evolving games. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators from a diverse, potentially evolving, pool of partners that are not seen before without any fine-tuning. Population-based training, which approximates such an evolving partner pool, has been proven to provide good zero-shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient RL training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi cooperative game and confirms its superiority.

Efficient Reinforcement Learning for Zero-Shot Coordination in Evolving Games

TL;DR

This work tackles zero-shot coordination in evolving multi-agent environments by introducing ScaPT, a scalable population training framework. It combines a hierarchical meta-agent that shares parameters to simulate large populations with a mutual information-based diversity regularizer, enabling efficient and diverse population training. Instantiated in a value-based RL setting and evaluated on matrix games and Hanabi, ScaPT consistently outperforms prior methods, especially as population size grows under fixed computational budgets. The findings underscore the critical role of scalable population design and diversity-promoting objectives for robust zero-shot coordination in complex, evolving games.

Abstract

Zero-shot coordination(ZSC), a key challenge in multi-agent game theory, has become a hot topic in reinforcement learning (RL) research recently, especially in complex evolving games. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators from a diverse, potentially evolving, pool of partners that are not seen before without any fine-tuning. Population-based training, which approximates such an evolving partner pool, has been proven to provide good zero-shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient RL training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi cooperative game and confirms its superiority.

Paper Structure

This paper contains 31 sections, 2 theorems, 10 equations, 31 figures, 5 tables, 2 algorithms.

Key Result

Theorem 1

Given $F(u_j,h_j,a_j)$, if $F$ is update to $F'$ such that: then the corresponding term $I_j$ in $\hat{I}(A;U|H)$ is updated to $I'_j$ and satisfies $I'_j \geq I_j$.

Figures (31)

  • Figure 1: The diagram of different training paradigms in evolving games.
  • Figure 2: Comparison of common population and hierarchical meta-agent population
  • Figure 3: Results for matrix dimension = 50 with different population sizes
  • Figure 4: Comparison of common population and meta-agent population: (a) training time consuming with population size increase; (b) resource consumption with population size increase; (c) sum of 1ZSC-XP scores over 40 experiments with different architectures using the MEP method.
  • Figure 5: Detailed pair-wise 1ZSC-XP scores of TrajeDi, MEP and CMIMP. Deeper colors represent higher scores and each row represents the coordination scores of testing a main agent pairing with 40 non-ZSC agent, thus forming a $5\times40$ heat-map.
  • ...and 26 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • proof