Table of Contents
Fetching ...

Complete Chess Games Enable LLM Become A Chess Master

Yinqi Zhang, Xintian Han, Haolong Li, Kedi Chen, Shaohui Lin

TL;DR

This work reframes chess as a text-based game for large language models by encoding positions with FEN and generating moves via text. A large-scale dataset of Fen-BestMove pairs, built from Stockfish evaluations, enables supervised fine-tuning of ChessLLM (open-llama-3B) to play complete games. ChessLLM achieves an Elo of $1788$ against Stockfish (sampling up to 10) and demonstrates strong move legality (Pass@1 > $90\%$) with substantial improvements from long-round supervision (≈$350$ Elo). The study validates evaluation against actual games and an auxiliary evaluation set, and shows data quality and quantity strongly drive performance, with promising cross-model comparisons and clear paths for future enhancements via RLHF and self-play. This approach highlights a practical route for scaling LLMs to strategic, abstract domains beyond natural language tasks, with potential implications for AI-assisted decision-making in complex games.

Abstract

Large language models (LLM) have shown remarkable abilities in text generation, question answering, language translation, reasoning and many other tasks. It continues to advance rapidly and is becoming increasingly influential in various fields, from technology and business to education and entertainment. Despite LLM's success in multiple areas, its ability to play abstract games, such as chess, is underexplored. Chess-playing requires the language models to output legal and reasonable moves from textual inputs. Here, we propose the Large language model ChessLLM to play full chess games. We transform the game into a textual format with the best move represented in the Forsyth-Edwards Notation. We show that by simply supervised fine-tuning, our model has achieved a professional-level Elo rating of 1788 in matches against the standard Elo-rated Stockfish when permitted to sample 10 times. We further show that data quality is important. Long-round data supervision enjoys a 350 Elo rating improvement over short-round data.

Complete Chess Games Enable LLM Become A Chess Master

TL;DR

This work reframes chess as a text-based game for large language models by encoding positions with FEN and generating moves via text. A large-scale dataset of Fen-BestMove pairs, built from Stockfish evaluations, enables supervised fine-tuning of ChessLLM (open-llama-3B) to play complete games. ChessLLM achieves an Elo of against Stockfish (sampling up to 10) and demonstrates strong move legality (Pass@1 > ) with substantial improvements from long-round supervision (≈ Elo). The study validates evaluation against actual games and an auxiliary evaluation set, and shows data quality and quantity strongly drive performance, with promising cross-model comparisons and clear paths for future enhancements via RLHF and self-play. This approach highlights a practical route for scaling LLMs to strategic, abstract domains beyond natural language tasks, with potential implications for AI-assisted decision-making in complex games.

Abstract

Large language models (LLM) have shown remarkable abilities in text generation, question answering, language translation, reasoning and many other tasks. It continues to advance rapidly and is becoming increasingly influential in various fields, from technology and business to education and entertainment. Despite LLM's success in multiple areas, its ability to play abstract games, such as chess, is underexplored. Chess-playing requires the language models to output legal and reasonable moves from textual inputs. Here, we propose the Large language model ChessLLM to play full chess games. We transform the game into a textual format with the best move represented in the Forsyth-Edwards Notation. We show that by simply supervised fine-tuning, our model has achieved a professional-level Elo rating of 1788 in matches against the standard Elo-rated Stockfish when permitted to sample 10 times. We further show that data quality is important. Long-round data supervision enjoys a 350 Elo rating improvement over short-round data.

Paper Structure

This paper contains 31 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Left: $pass@1$ increases with the number of tokens. After introducing long-round data, $pass@1$ further increases. Right: The Elo Rating of ChessLLM with the number of training tokens. Skill level indicates the level of Stockfish.
  • Figure 2: One example of training data.
  • Figure 3: Left: Best Move Accuracy of ChessLLM training with short round data. The accuracy of the best move increases with the number of training tokens. Right: Legal Move Accuracy of ChessLLM training with short round data. The accuracy of the legal move increases with the number of training tokens.
  • Figure 4: Left: Correlation between ChessLLM's best move accuracy and its Elo rating. Right: Correlation between ChessLLM's legal move accuracy and its Elo rating.