Table of Contents
Fetching ...

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Jincheng Liu, Sijun He, Jingjing Wu, Xiangsen Wang, Yang Chen, Zhaoqi Kuang, Siqi Bao, Yuan Yao

TL;DR

ChessArena presents a chess-centric benchmark to probe the strategic reasoning of large language models through multi-mode play, a Glicko-based leaderboard, and fine-grained tasks that dissect basic understanding, move selection, and puzzle solving. It demonstrates substantial gaps between contemporary LLMs and a human-amateur chess engine (Maia-1100), and shows that post-training with chess-focused data and reinforcement learning can yield meaningful performance gains, particularly for non-thinking models. The work combines a scalable evaluation platform with a data-driven training pipeline (SFT and GRPO) to generate a stronger chess-aware LLM (Qwen3-8B-Chess) and provides rich datasets for ongoing research. Overall, ChessArena offers a practical, extensible framework to study and advance strategic reasoning in LLMs and to collect high-quality reasoning data for domain-specific training.

Abstract

Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

TL;DR

ChessArena presents a chess-centric benchmark to probe the strategic reasoning of large language models through multi-mode play, a Glicko-based leaderboard, and fine-grained tasks that dissect basic understanding, move selection, and puzzle solving. It demonstrates substantial gaps between contemporary LLMs and a human-amateur chess engine (Maia-1100), and shows that post-training with chess-focused data and reinforcement learning can yield meaningful performance gains, particularly for non-thinking models. The work combines a scalable evaluation platform with a data-driven training pipeline (SFT and GRPO) to generate a stronger chess-aware LLM (Qwen3-8B-Chess) and provides rich datasets for ongoing research. Overall, ChessArena offers a practical, extensible framework to study and advance strategic reasoning in LLMs and to collect high-quality reasoning data for domain-specific training.

Abstract

Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine reasoning skills particularly complex strategic reasoning or are they primarily excelling at sophisticated pattern recognition within their training data? To address this question, this paper presents a chess testbed, ChessArena, to evaluate the strategic reasoning capabilities of LLMs. Chess requires complex strategic reasoning capabilities including long-term planning, strict rule comprehension, and multi-turn conversation memorization. Specifically, ChessArena is a competitive framework where LLMs play against each other, under four different play modes. The testbed is equipped with a ranking algorithm and a leaderboard. The testbed can also evaluate fine-grained capabilities including basic understanding, move selection, and puzzle solving. Over 13 LLMs with different modes are evaluated in ChessArena, playing over 800 games. The results reveal significant shortcomings in current LLMs: no model can beat Maia-1100 (a chess engine at human amateur level), while some even failed to defeat a random player that selects moves arbitrarily. We also present a strong baseline to the testbed: our fine-tuned Qwen3-8B substantially improved performance, approaching much larger state-of-the-art reasoning models.

Paper Structure

This paper contains 62 sections, 16 equations, 9 figures, 19 tables.

Figures (9)

  • Figure 1: Overview of ChessArena competitions, fine-grained evaluation, and ChessLLM training. (1) An LLM can be integrated into ChessArena to compete against other models. After a certain number of competitions, each model is assigned a reliable Glicko rating and added to the leaderboard. (2) Three additional evaluation tasks are integrated into ChessArena to evaluate the chess capabilities at a fine-grained level. (3) We can extract high-quality chess reasoning data from the gameplay process, which can be used for training an LLM specially for chess.
  • Figure 2: Input prompt format for Blitz and Standard chess competition. Whether to provide legal moves is optional.
  • Figure 3: Input prompt format for Bullet chess competition. Whether to provide legal moves is optional. Thinking is forbidden.
  • Figure 4: Input prompt format for Blindfold chess competition. Whether to provide legal moves is optional. This is a multi-round conversation template. LLMs should reconstruct the chessboard from the conversation history.
  • Figure 5: Input prompt format for basic understanding
  • ...and 4 more figures