Table of Contents
Fetching ...

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin

Abstract

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Abstract

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.
Paper Structure (117 sections, 33 equations, 26 figures, 5 tables)

This paper contains 117 sections, 33 equations, 26 figures, 5 tables.

Figures (26)

  • Figure 1: Game Benchmarks. Pokémon creates vast partially observable state spaces (see Appendix \ref{['sec:state-space-derivation']}). Data from tesauro1995temporalsilver2018generalsilver2017masteringbrown2018superhumanbrown2019superhumanbillings2002challengevinyals2019grandmasteropenai2019dotabaker2022videowangvoyageracher2024gpt4chessma2026mixingyan2025pokerbenchma2024llmstarcraft.
  • Figure 2: Pokémon Battling.
  • Figure 3: Baseline Performance.(Left) Agents vs. Humans: Official ratings on the Showdown ladder. Statistics from the Top 500leaderboard are provided as a frame of reference for experienced human players. (Center/Right) RL vs. RL and LLM vs. LLM: GXE is measured relative only to methods within each plot. We differentiate between prior Metamon RL policies grigsby2025human and baselines newly developed for this work; PC-Llama3.1-8B represents the original PokéChamp agent karten2025pok.
  • Figure 4: Speedrunning Route (Early Game). Milestones from Littleroot Town (1) to Defeating Roxanne (15), with game frames from each waypoint. The geographic overview (right) maps key locations. Although progression appears linear, the route requires substantial exploration and backtracking---agents must revisit earlier areas, navigate branching paths, and manage nonlinear dependencies between objectives. We provide splits from the human world record as an upper bound.
  • Figure 5: Speedrunning Track Baseline Results. Cumulative wall-clock time, actions, tokens, and cost at each milestone for five frontier models (mean $\pm$ min/max range across runs). Gemini 3 Flash completes the route fastest ($\sim$2:24 mean) but requires more actions than Gemini 3 Pro. Claude Sonnet 4.5 completes all milestones but with the highest variance and 3--4$\times$ the cost of Gemini variants. GPT-5.2 falls between the two families in both time and cost.
  • ...and 21 more figures