Table of Contents
Fetching ...

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

Lingfeng Li, Yunlong Lu, Yuefei Zhang, Jingyu Yao, Yixin Zhu, KeYuan Cheng, Yongyi Wang, Qirui Zheng, Xionghui Yang, Wenxin Li

TL;DR

It is shown that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability, establishing a scalable and reusable framework for assessing interactive AI capabilities.

Abstract

Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.

BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors

TL;DR

It is shown that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability, establishing a scalable and reusable framework for assessing interactive AI capabilities.

Abstract

Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making, yet systematic evaluation of these capabilities remains challenging. Existing benchmarks for LLMs primarily assess static reasoning through isolated tasks and fail to capture dynamic strategic abilities. Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools, incurring quadratic computational costs and lacking stable performance anchors for longitudinal tracking. The central challenge is establishing a scalable evaluation framework that measures LLM strategic reasoning against consistent, interpretable standards rather than volatile peer models. Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement with stable cross-temporal interpretability. Built on the Botzone platform's established competitive infrastructure, our BotzoneBench evaluates LLMs across eight diverse games spanning deterministic perfect-information board games to stochastic imperfect-information card games. Through systematic assessment of 177,047 state-action pairs from five flagship models, we reveal significant performance disparities and identify distinct strategic behaviors, with top-performing models achieving proficiency comparable to mid-to-high-tier specialized game AI in multiple domains. This anchored evaluation paradigm generalizes beyond games to any domain with well-defined skill hierarchies, establishing a scalable and reusable framework for assessing interactive AI capabilities.
Paper Structure (37 sections, 2 figures, 20 tables)

This paper contains 37 sections, 2 figures, 20 tables.

Figures (2)

  • Figure 1: Anchored evaluation in BotzoneBench compared to existing game benchmarks. Existing benchmarks employ llm-vs-llm tournaments requiring $O(N^2)$ pairwise evaluations with relative rankings. BotzoneBench anchors evaluation to fixed skill-calibrated bot hierarchies, enabling $O(N)$ scalable assessment with absolute ratings across eight games spanning deterministic perfect-information (Tic-Tac-Toe, Gomoku, Reversi, Chess, Ataxx) and stochastic imperfect-information settings (Texas Hold'em, Fight the Landlord, Mahjong).
  • Figure 2: Prompt template for llm-based bots.Underscored text is dynamically generated for each state. The system prompt provides static game rules and output format requirements, while the user prompt delivers dynamic turn-specific information: player identity, current game state, action history, and legal moves. This design supplies all essential context while leaving strategic reasoning entirely to llm.