Table of Contents
Fetching ...

PokéChamp: an Expert-level Minimax Language Agent

Seth Karten, Andy Luu Nguyen, Chi Jin

TL;DR

PokéChamp presents an expert-level minimax language agent for Pokémon battles by replacing action sampling, opponent modeling, and leaf-value estimation with LLM-powered components within a one-step-lookahead world model. The approach enables planning under partial observability without task-specific training and is supported by a large real-player dataset derived from Pokémon Showdown. Comprehensive evaluations across Gen 8–9 formats, puzzles, and online ladder play demonstrate state-of-the-art performance against both heuristic and LLM-based bots, as well as expert-level play against humans (Elo $\approx$ $1300$-$1500$). The work contributes a scalable framework for integrating LLMs with game-theoretic planning in complex multiagent settings and provides benchmarks and datasets to propel future research in AI for strategic games.

Abstract

We introduce PokéChamp, a minimax agent powered by Large Language Models (LLMs) for Pokémon battles. Built on a general framework for two-player competitive games, PokéChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, PokéChamp consistently outperforms the previous best LLM-based bot, Pokéllmon powered by GPT-4o, with a 64% win rate. PokéChamp attains a projected Elo of 1300-1500 on the Pokémon Showdown online ladder, placing it among the top 30%-10% of human players. In addition, this work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. We hope this work fosters further research that leverage Pokémon battle as benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multiagent problems. Videos, code, and dataset available at https://sites.google.com/view/pokechamp-llm.

PokéChamp: an Expert-level Minimax Language Agent

TL;DR

PokéChamp presents an expert-level minimax language agent for Pokémon battles by replacing action sampling, opponent modeling, and leaf-value estimation with LLM-powered components within a one-step-lookahead world model. The approach enables planning under partial observability without task-specific training and is supported by a large real-player dataset derived from Pokémon Showdown. Comprehensive evaluations across Gen 8–9 formats, puzzles, and online ladder play demonstrate state-of-the-art performance against both heuristic and LLM-based bots, as well as expert-level play against humans (Elo -). The work contributes a scalable framework for integrating LLMs with game-theoretic planning in complex multiagent settings and provides benchmarks and datasets to propel future research in AI for strategic games.

Abstract

We introduce PokéChamp, a minimax agent powered by Large Language Models (LLMs) for Pokémon battles. Built on a general framework for two-player competitive games, PokéChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76% against the best existing LLM-based bot and 84% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, PokéChamp consistently outperforms the previous best LLM-based bot, Pokéllmon powered by GPT-4o, with a 64% win rate. PokéChamp attains a projected Elo of 1300-1500 on the Pokémon Showdown online ladder, placing it among the top 30%-10% of human players. In addition, this work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks and puzzles to evaluate specific battling skills. We further provide key updates to the local game engine. We hope this work fosters further research that leverage Pokémon battle as benchmark to integrate LLM technologies with game-theoretic algorithms addressing general multiagent problems. Videos, code, and dataset available at https://sites.google.com/view/pokechamp-llm.

Paper Structure

This paper contains 42 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: PokéChamp achieves the 70%-90% percentile of players and a 1300-1500 Elo rating against real players. Higher Elo and percentile denote better performance.
  • Figure 2: PokéChamp uses one-step lookahead prompts to gain admissible heuristic information regarding the likely effect of actions under the current metagame.
  • Figure 3: An example of teambuilding for competitive Pokémon. A player must decide on six Pokémon for their team. For each Pokémon, a player must configure the item, ability, moves, stats (EVs/IVs), and nature.
  • Figure 4: PokéChamp replaces three components of minimax tree search with LLM-based generations: (1) sampling potential actions for the player corresponding to the first part of the edge between states., (2) modeling the opponent and sampling opponent actions corresponding to the second part of the edge between states, and (3) generating a potential game state value based on the depth $K$ cutoff. PokéChamp provides the action with the best minimax value to be used in battle.
  • Figure 5: Left: Elo distribution for collected battles across game formats. Right: Relationship between game length and player Elo rating.
  • ...and 4 more figures