Table of Contents
Fetching ...

ChessQA: Evaluating Large Language Models for Chess Understanding

Qianfeng Wen, Zhenwei Tang, Ashton Anderson

TL;DR

ChessQA introduces a dynamic, five-category benchmark to evaluate LLMs on chess understanding, spanning Structural, Motifs, Short Tactics, Position Judgment, and Semantic tasks. It combines carefully sourced data (Lichess puzzles, evaluations, ChessBase commentary) with robust tooling (python-chess, Stockfish, FAISS) to benchmark 15 contemporary LLMs across two modes of reasoning, revealing persistent weaknesses across categories but clear gains from explicit, multimove reasoning and larger models. The work provides a versioned, extensible evaluation framework and a public leaderboard, enabling ongoing, comparable progress assessment in LLM chess understanding. By highlighting category-specific challenges such as Short Tactics and Position Judgment, ChessQA offers targeted diagnostics to guide future model development and evaluation. Overall, ChessQA advances how we measure nuanced reasoning, planning, and language understanding in domain-specific LLM applications with practical, reproducible analytics.

Abstract

Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs), as it has well-defined structure and objective ground truth while admitting a wide spectrum of skill levels. However, existing evaluations of LLM ability in chess are ad hoc and narrow in scope, making it difficult to accurately measure LLM chess understanding and how it varies with scale, post-training methodologies, or architecture choices. We present ChessQA, a comprehensive benchmark that assesses LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, and Semantic), which approximately correspond to the ascending abstractions that players master as they accumulate chess knowledge, from understanding basic rules and learning tactical motifs to correctly calculating tactics, evaluating positions, and semantically describing high-level concepts. In this way, ChessQA captures a more comprehensive picture of chess ability and understanding, going significantly beyond the simple move quality evaluations done previously, and offers a controlled, consistent setting for diagnosis and comparison. Furthermore, ChessQA is inherently dynamic, with prompts, answer keys, and construction scripts that can evolve as models improve. Evaluating a range of contemporary LLMs, we find persistent weaknesses across all five categories and provide results and error analyses by category. We will release the code, periodically refreshed datasets, and a public leaderboard to support further research.

ChessQA: Evaluating Large Language Models for Chess Understanding

TL;DR

ChessQA introduces a dynamic, five-category benchmark to evaluate LLMs on chess understanding, spanning Structural, Motifs, Short Tactics, Position Judgment, and Semantic tasks. It combines carefully sourced data (Lichess puzzles, evaluations, ChessBase commentary) with robust tooling (python-chess, Stockfish, FAISS) to benchmark 15 contemporary LLMs across two modes of reasoning, revealing persistent weaknesses across categories but clear gains from explicit, multimove reasoning and larger models. The work provides a versioned, extensible evaluation framework and a public leaderboard, enabling ongoing, comparable progress assessment in LLM chess understanding. By highlighting category-specific challenges such as Short Tactics and Position Judgment, ChessQA offers targeted diagnostics to guide future model development and evaluation. Overall, ChessQA advances how we measure nuanced reasoning, planning, and language understanding in domain-specific LLM applications with practical, reproducible analytics.

Abstract

Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs), as it has well-defined structure and objective ground truth while admitting a wide spectrum of skill levels. However, existing evaluations of LLM ability in chess are ad hoc and narrow in scope, making it difficult to accurately measure LLM chess understanding and how it varies with scale, post-training methodologies, or architecture choices. We present ChessQA, a comprehensive benchmark that assesses LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, and Semantic), which approximately correspond to the ascending abstractions that players master as they accumulate chess knowledge, from understanding basic rules and learning tactical motifs to correctly calculating tactics, evaluating positions, and semantically describing high-level concepts. In this way, ChessQA captures a more comprehensive picture of chess ability and understanding, going significantly beyond the simple move quality evaluations done previously, and offers a controlled, consistent setting for diagnosis and comparison. Furthermore, ChessQA is inherently dynamic, with prompts, answer keys, and construction scripts that can evolve as models improve. Evaluating a range of contemporary LLMs, we find persistent weaknesses across all five categories and provide results and error analyses by category. We will release the code, periodically refreshed datasets, and a public leaderboard to support further research.

Paper Structure

This paper contains 87 sections, 8 figures, 4 tables, 8 algorithms.

Figures (8)

  • Figure 1: ChessQA at a glance.
  • Figure 2: Task distribution in ChessQA.
  • Figure 3: The overall and per-category performance comparison. * denotes thinking enabled.
  • Figure 4: Breakdown of response evaluation results.
  • Figure 5: Performance comparison w.r.t #tokens per problem. * denotes thinking enabled.
  • ...and 3 more figures