PokéChamp: an Expert-level Minimax Language Agent
Seth Karten, Andy Luu Nguyen, Chi Jin
TL;DR
PokéChamp presents an expert-level minimax language agent for Pokémon battles by replacing action sampling, opponent modeling, and leaf-value estimation with LLM-powered components within a one-step-lookahead world model. The approach enables planning under partial observability without task-specific training and is supported by a large real-player dataset derived from Pokémon Showdown. Comprehensive evaluations across Gen 8–9 formats, puzzles, and online ladder play demonstrate state-of-the-art performance against both heuristic and LLM-based bots, as well as expert-level play against humans (Elo $\approx$ $1300$-$1500$). The work contributes a scalable framework for integrating LLMs with game-theoretic planning in complex multiagent settings and provides benchmarks and datasets to propel future research in AI for strategic games.
Abstract
We introduce PokéChamp, a minimax agent powered by Large Language Models (LLMs) for Pokémon battles. Built on a general framework for two-player competitive games, PokéChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, PokéChamp consistently outperforms the previous best LLM-based bot, Pokéllmon powered by GPT-4o, with a 64% win rate. PokéChamp attains a projected Elo of 1300-1500 on the Pokémon Showdown online ladder, placing it among the top 30%-10% of human players. In addition, this work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. We hope this work fosters further research that leverage Pokémon battle as benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multiagent problems. Videos, code, and dataset available at https://sites.google.com/view/pokechamp-llm.
