Table of Contents
Fetching ...

VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

Cameron Angliss, Jiaxun Cui, Jiaheng Hu, Arrasy Rahman, Peter Stone

TL;DR

VGC-Bench tackles the challenge of generalizing AI policies across the vast and diverse space of Pokémon VGC team configurations. By formalizing Pokémon VGC as a two-player zero-sum POSG and building a robust benchmark with extensive human-play data, the authors develop a comprehensive suite of baselines, including behavior cloning, multi-agent reinforcement learning with empirical game-theoretic methods, and LLM-based baselines. The experiments reveal a clear generalization frontier: agents trained for a single team perform well in mirror settings but degrade and become highly exploitable as the number of training teams increases, highlighting the need for robust multi-task policies. The work provides significant open-source infrastructure (PettingZoo integration, dataset, and baselines) and outlines concrete directions (generalization to many teams, opponent modeling, and model-based search) to advance robust multi-agent policy learning in highly combinatorial domains. Overall, VGC-Bench offers a principled, reproducible platform to drive progress toward superhuman, generalizable AI in competitive Pokémon battles.

Abstract

Developing AI agents that can robustly adapt to varying strategic landscapes without retraining is a central challenge in multi-agent learning. Pokémon Video Game Championships (VGC) is a domain with a vast space of approximately $10^{139}$ team configurations, far larger than those of other games such as Chess, Go, Poker, StarCraft, or Dota. The combinatorial nature of team building in Pokémon VGC causes optimal strategies to vary substantially depending on both the controlled team and the opponent's team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies a human-play dataset of over 700,000 battle logs and a range of baseline agents based on heuristics, large language models, behavior cloning, and multi-agent reinforcement learning with empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated in a mirror match with a single team configuration, our methods can win against a professional VGC competitor. We repeat this training and evaluation with progressively larger team sets and find that as the number of teams increases, the best-performing algorithm in the single-team setting has worse performance and is more exploitable, but has improved generalization to unseen teams. Our code and dataset are open-sourced at https://github.com/cameronangliss/vgc-bench and https://huggingface.co/datasets/cameronangliss/vgc-battle-logs.

VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon

TL;DR

VGC-Bench tackles the challenge of generalizing AI policies across the vast and diverse space of Pokémon VGC team configurations. By formalizing Pokémon VGC as a two-player zero-sum POSG and building a robust benchmark with extensive human-play data, the authors develop a comprehensive suite of baselines, including behavior cloning, multi-agent reinforcement learning with empirical game-theoretic methods, and LLM-based baselines. The experiments reveal a clear generalization frontier: agents trained for a single team perform well in mirror settings but degrade and become highly exploitable as the number of training teams increases, highlighting the need for robust multi-task policies. The work provides significant open-source infrastructure (PettingZoo integration, dataset, and baselines) and outlines concrete directions (generalization to many teams, opponent modeling, and model-based search) to advance robust multi-agent policy learning in highly combinatorial domains. Overall, VGC-Bench offers a principled, reproducible platform to drive progress toward superhuman, generalizable AI in competitive Pokémon battles.

Abstract

Developing AI agents that can robustly adapt to varying strategic landscapes without retraining is a central challenge in multi-agent learning. Pokémon Video Game Championships (VGC) is a domain with a vast space of approximately team configurations, far larger than those of other games such as Chess, Go, Poker, StarCraft, or Dota. The combinatorial nature of team building in Pokémon VGC causes optimal strategies to vary substantially depending on both the controlled team and the opponent's team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies a human-play dataset of over 700,000 battle logs and a range of baseline agents based on heuristics, large language models, behavior cloning, and multi-agent reinforcement learning with empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated in a mirror match with a single team configuration, our methods can win against a professional VGC competitor. We repeat this training and evaluation with progressively larger team sets and find that as the number of teams increases, the best-performing algorithm in the single-team setting has worse performance and is more exploitable, but has improved generalization to unseen teams. Our code and dataset are open-sourced at https://github.com/cameronangliss/vgc-bench and https://huggingface.co/datasets/cameronangliss/vgc-battle-logs.

Paper Structure

This paper contains 34 sections, 13 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: VGC-Bench Overview. VGC-Bench captures the multi-agent multi-team dynamics with PettingZoo integration, provides human-play datasets and a range of baselines, and standardizes evaluation protocols.
  • Figure 2: Pokémon Showdown Gameplay. Closer Pokémon are on the agent's side, and farther Pokémon are on the opponent's side. 1) Pokémon's health bar with percentage of current health remaining. 2) Current status of all party members, with solid colors for revealed, translucent colors for unrevealed, and greyed-out colors for fainted. 3) Effects on Pokémon, including boosts and status conditions. 4) Active side conditions and global fields/weather with a number of turns remaining. 5) Active Tera type being used by Pokémon.
  • Figure 3: Cross-Play Win Rate Heatmaps for varying team set sizes.
  • Figure 4: An exploiter agent initialized randomly trains to exploit the strongest agent as determined in Table \ref{['tab:alpharank']}.
  • Figure 5: An exploiter agent initialized as the behavior cloning agent trains to exploit the strongest agent as determined in Table \ref{['tab:alpharank']}.