Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach
Nhat-Minh Huynh, Hoang-Giang Cao, I-Chen Wu
TL;DR
The paper tackles the challenge of training multi-agent systems for Pommerman under sparse rewards and delayed action effects by proposing a two-stage framework: Curriculum Learning with three progressive phases to acquire core skills, followed by Population-based Self-play to evolve strategies within a diverse agent population. A key innovation is the adaptive annealing of the dense exploration reward, defined by $r_t = \alpha_t e_t + (1-\alpha_t) R$ with $\alpha_t = 1 - \tanh(k x)$ and $k=1.2$, which gradually shifts emphasis from exploration to the sparse game reward as agent performance $x$ (measured by enemy deaths) improves, enabling self-directed strategy development. The second pillar is Elo-based matchmaking, where agent pairings are determined via $E_A = \frac{1}{1+10^{(R_B-R_A)/400}}$, updates $R'_A = R_A + K(S_A - E_A)$, and softmax-derived probabilities bias opponents toward stronger agents to foster progressive learning. Experimental results show that the trained, comms-free agent can outperform top learning agents and several rule-based baselines in Pommerman, achieving high win rates against historical benchmarks and approaching tree-search-based strong opponents, highlighting the practical impact of population-based self-play and adaptive exploration in complex multi-agent domains.
Abstract
Pommerman is a multi-agent environment that has received considerable attention from researchers in recent years. This environment is an ideal benchmark for multi-agent training, providing a battleground for two teams with communication capabilities among allied agents. Pommerman presents significant challenges for model-free reinforcement learning due to delayed action effects, sparse rewards, and false positives, where opponent players can lose due to their own mistakes. This study introduces a system designed to train multi-agent systems to play Pommerman using a combination of curriculum learning and population-based self-play. We also tackle two challenging problems when deploying the multi-agent training system for competitive games: sparse reward and suitable matchmaking mechanism. Specifically, we propose an adaptive annealing factor based on agents' performance to adjust the dense exploration reward during training dynamically. Additionally, we implement a matchmaking mechanism utilizing the Elo rating system to pair agents effectively. Our experimental results demonstrate that our trained agent can outperform top learning agents without requiring communication among allied agents.
