Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach

Nhat-Minh Huynh; Hoang-Giang Cao; I-Chen Wu

Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach

Nhat-Minh Huynh, Hoang-Giang Cao, I-Chen Wu

TL;DR

The paper tackles the challenge of training multi-agent systems for Pommerman under sparse rewards and delayed action effects by proposing a two-stage framework: Curriculum Learning with three progressive phases to acquire core skills, followed by Population-based Self-play to evolve strategies within a diverse agent population. A key innovation is the adaptive annealing of the dense exploration reward, defined by $r_t = \alpha_t e_t + (1-\alpha_t) R$ with $\alpha_t = 1 - \tanh(k x)$ and $k=1.2$, which gradually shifts emphasis from exploration to the sparse game reward as agent performance $x$ (measured by enemy deaths) improves, enabling self-directed strategy development. The second pillar is Elo-based matchmaking, where agent pairings are determined via $E_A = \frac{1}{1+10^{(R_B-R_A)/400}}$, updates $R'_A = R_A + K(S_A - E_A)$, and softmax-derived probabilities bias opponents toward stronger agents to foster progressive learning. Experimental results show that the trained, comms-free agent can outperform top learning agents and several rule-based baselines in Pommerman, achieving high win rates against historical benchmarks and approaching tree-search-based strong opponents, highlighting the practical impact of population-based self-play and adaptive exploration in complex multi-agent domains.

Abstract

Pommerman is a multi-agent environment that has received considerable attention from researchers in recent years. This environment is an ideal benchmark for multi-agent training, providing a battleground for two teams with communication capabilities among allied agents. Pommerman presents significant challenges for model-free reinforcement learning due to delayed action effects, sparse rewards, and false positives, where opponent players can lose due to their own mistakes. This study introduces a system designed to train multi-agent systems to play Pommerman using a combination of curriculum learning and population-based self-play. We also tackle two challenging problems when deploying the multi-agent training system for competitive games: sparse reward and suitable matchmaking mechanism. Specifically, we propose an adaptive annealing factor based on agents' performance to adjust the dense exploration reward during training dynamically. Additionally, we implement a matchmaking mechanism utilizing the Elo rating system to pair agents effectively. Our experimental results demonstrate that our trained agent can outperform top learning agents without requiring communication among allied agents.

Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach

TL;DR

with

and

, which gradually shifts emphasis from exploration to the sparse game reward as agent performance

(measured by enemy deaths) improves, enabling self-directed strategy development. The second pillar is Elo-based matchmaking, where agent pairings are determined via

, updates

, and softmax-derived probabilities bias opponents toward stronger agents to foster progressive learning. Experimental results show that the trained, comms-free agent can outperform top learning agents and several rule-based baselines in Pommerman, achieving high win rates against historical benchmarks and approaching tree-search-based strong opponents, highlighting the practical impact of population-based self-play and adaptive exploration in complex multi-agent domains.

Abstract

Paper Structure (12 sections, 4 equations, 7 figures, 2 tables)

This paper contains 12 sections, 4 equations, 7 figures, 2 tables.

Introduction
Approaches
Curriculum Learning Stage
Curriculum Learning with Incremental Difficulty Agents
Adaptive Exploration Reward by Performance
Population-based Self-play Stage
Population-based Self-play System
Match Making Probability
Experiment Results
Curriculum Learning Stage
Self-play Stage
Conclusions

Figures (7)

Figure 1: Overview of our multi-agent training system with two stages: curriculum learning and population-based self-play.
Figure 2: Annealing factor $\alpha$ function with k = 1.2. Noted annealing factor only calculated in the range of [0, 2], which is the part in red square
Figure 3: Average of Enemy Deaths.
Figure 4: Exploration reward annealing during the training process.
Figure 5: Average number of Enemy deaths using linear annealing factor.
...and 2 more figures

Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach

TL;DR

Abstract

Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (7)