Table of Contents
Fetching ...

Explore the Reasoning Capability of LLMs in the Chess Testbed

Shu Wang, Lei Ji, Renxi Wang, Wenxiao Zhao, Haokun Liu, Yifan Hou, Ying Nian Wu

TL;DR

This work investigates the reasoning capabilities of large language models in chess by treating chess as a testbed for long-term planning and short-term tactical analysis. It introduces MATE, a dataset of roughly 1 million chess positions annotated with expert strategy and tactic descriptions, and finetunes the open-source LLaMA-3-8B model to compare against commercial LLMs. The results show that providing language-based explanations and integrating strategy and tactic markedly improves move selection, with the MATE-ST setup delivering the strongest performance. The findings suggest that language-enabled reasoning can effectively augment chess play and potentially generalize to other complex, multi-step tasks.

Abstract

Reasoning is a central capability of human intelligence. In recent years, with the advent of large-scale datasets, pretrained large language models have emerged with new capabilities, including reasoning. However, these models still struggle with long-term, complex reasoning tasks, such as playing chess. Based on the observation that expert chess players employ a dual approach combining long-term strategic play with short-term tactical play along with language explanation, we propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic. Specifically, we collect a dataset named MATE, which consists of 1 million chess positions with candidate moves annotated by chess experts for strategy and tactics. We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves. Our experiments show that our models perform better than GPT, Claude, and Gemini models. We find that language explanations can enhance the reasoning capability of large language models.

Explore the Reasoning Capability of LLMs in the Chess Testbed

TL;DR

This work investigates the reasoning capabilities of large language models in chess by treating chess as a testbed for long-term planning and short-term tactical analysis. It introduces MATE, a dataset of roughly 1 million chess positions annotated with expert strategy and tactic descriptions, and finetunes the open-source LLaMA-3-8B model to compare against commercial LLMs. The results show that providing language-based explanations and integrating strategy and tactic markedly improves move selection, with the MATE-ST setup delivering the strongest performance. The findings suggest that language-enabled reasoning can effectively augment chess play and potentially generalize to other complex, multi-step tasks.

Abstract

Reasoning is a central capability of human intelligence. In recent years, with the advent of large-scale datasets, pretrained large language models have emerged with new capabilities, including reasoning. However, these models still struggle with long-term, complex reasoning tasks, such as playing chess. Based on the observation that expert chess players employ a dual approach combining long-term strategic play with short-term tactical play along with language explanation, we propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic. Specifically, we collect a dataset named MATE, which consists of 1 million chess positions with candidate moves annotated by chess experts for strategy and tactics. We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves. Our experiments show that our models perform better than GPT, Claude, and Gemini models. We find that language explanations can enhance the reasoning capability of large language models.

Paper Structure

This paper contains 31 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Strategy and Tactic (a)White E2 pawn moves to E4, takes more space in the center, and exerts pressure on black. Black will have a hard time struggling to develop its pieces. (b)White E2 bishop moves to F3 and pins the knight on C6. The black knight cannot move, or the A8 rook behind the knight will be taken. White will take black knight for free in the next move.
  • Figure 2: A data example in MATE-Strategy&Tactic.
  • Figure 3: Dataset Summary (a)Distribution of samples across the MATE subsets. (b)Distribution of strategy in the MATE. (c)Distribution of tactic in the MATE.
  • Figure 4: Case Study:Claude 3.5 Sonnet.
  • Figure 5: Case Study:o1-preview.
  • ...and 1 more figures