VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

Cameron Angliss; Jiaxun Cui; Jiaheng Hu; Arrasy Rahman; Peter Stone

VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

Cameron Angliss, Jiaxun Cui, Jiaheng Hu, Arrasy Rahman, Peter Stone

TL;DR

VGC-Bench tackles the challenge of generalizing AI policies across the vast and diverse space of Pokémon VGC team configurations. By formalizing Pokémon VGC as a two-player zero-sum POSG and building a robust benchmark with extensive human-play data, the authors develop a comprehensive suite of baselines, including behavior cloning, multi-agent reinforcement learning with empirical game-theoretic methods, and LLM-based baselines. The experiments reveal a clear generalization frontier: agents trained for a single team perform well in mirror settings but degrade and become highly exploitable as the number of training teams increases, highlighting the need for robust multi-task policies. The work provides significant open-source infrastructure (PettingZoo integration, dataset, and baselines) and outlines concrete directions (generalization to many teams, opponent modeling, and model-based search) to advance robust multi-agent policy learning in highly combinatorial domains. Overall, VGC-Bench offers a principled, reproducible platform to drive progress toward superhuman, generalizable AI in competitive Pokémon battles.

Abstract

Developing AI agents that can robustly adapt to varying strategic landscapes without retraining is a central challenge in multi-agent learning. Pokémon Video Game Championships (VGC) is a domain with a vast space of approximately $10^{139}$ team configurations, far larger than those of other games such as Chess, Go, Poker, StarCraft, or Dota. The combinatorial nature of team building in Pokémon VGC causes optimal strategies to vary substantially depending on both the controlled team and the opponent's team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies a human-play dataset of over 700,000 battle logs and a range of baseline agents based on heuristics, large language models, behavior cloning, and multi-agent reinforcement learning with empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated in a mirror match with a single team configuration, our methods can win against a professional VGC competitor. We repeat this training and evaluation with progressively larger team sets and find that as the number of teams increases, the best-performing algorithm in the single-team setting has worse performance and is more exploitable, but has improved generalization to unseen teams. Our code and dataset are open-sourced at https://github.com/cameronangliss/vgc-bench and https://huggingface.co/datasets/cameronangliss/vgc-battle-logs.

VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

TL;DR

Abstract

team configurations, far larger than those of other games such as Chess, Go, Poker, StarCraft, or Dota. The combinatorial nature of team building in Pokémon VGC causes optimal strategies to vary substantially depending on both the controlled team and the opponent's team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies a human-play dataset of over 700,000 battle logs and a range of baseline agents based on heuristics, large language models, behavior cloning, and multi-agent reinforcement learning with empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated in a mirror match with a single team configuration, our methods can win against a professional VGC competitor. We repeat this training and evaluation with progressively larger team sets and find that as the number of teams increases, the best-performing algorithm in the single-team setting has worse performance and is more exploitable, but has improved generalization to unseen teams. Our code and dataset are open-sourced at https://github.com/cameronangliss/vgc-bench and https://huggingface.co/datasets/cameronangliss/vgc-battle-logs.

VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

TL;DR

Abstract

VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)