Table of Contents
Fetching ...

Tracking vs. Deciding: The Dual-Capability Bottleneck in Searchless Chess Transformers

Quanhao Li, Wei Jiang

Abstract

A human-like chess engine should mimic the style, errors, and consistency of a strong human player rather than maximize playing strength. We show that training from move sequences alone forces a model to learn two capabilities: state tracking, which reconstructs the board from move history, and decision quality, which selects good moves from that reconstructed state. These impose contradictory data requirements: low-rated games provide the diversity needed for tracking, while high-rated games provide the quality signal for decision learning. Removing low-rated data degrades performance. We formalize this tension as a dual-capability bottleneck, P <= min(T,Q), where overall performance is limited by the weaker capability. Guided by this view, we scale the model from 28M to 120M parameters to improve tracking, then introduce Elo-weighted training to improve decisions while preserving diversity. A 2 x 2 factorial ablation shows that scaling improves tracking, weighting improves decisions, and their combination is superadditive. Linear weighting works best, while overly aggressive weighting harms tracking despite lower validation loss. We also introduce a coverage-decay formula, t* = log(N/kcrit)/log b, as a reliability horizon for intra-game degeneration risk. Our final 120M-parameter model, without search, reached Lichess bullet 2570 over 253 rated games. On human move prediction it achieves 55.2% Top-1 accuracy, exceeding Maia-2 rapid and Maia-2 blitz. Unlike position-based methods, sequence input naturally encodes full game history, enabling history-dependent decisions that single-position models cannot exhibit.

Tracking vs. Deciding: The Dual-Capability Bottleneck in Searchless Chess Transformers

Abstract

A human-like chess engine should mimic the style, errors, and consistency of a strong human player rather than maximize playing strength. We show that training from move sequences alone forces a model to learn two capabilities: state tracking, which reconstructs the board from move history, and decision quality, which selects good moves from that reconstructed state. These impose contradictory data requirements: low-rated games provide the diversity needed for tracking, while high-rated games provide the quality signal for decision learning. Removing low-rated data degrades performance. We formalize this tension as a dual-capability bottleneck, P <= min(T,Q), where overall performance is limited by the weaker capability. Guided by this view, we scale the model from 28M to 120M parameters to improve tracking, then introduce Elo-weighted training to improve decisions while preserving diversity. A 2 x 2 factorial ablation shows that scaling improves tracking, weighting improves decisions, and their combination is superadditive. Linear weighting works best, while overly aggressive weighting harms tracking despite lower validation loss. We also introduce a coverage-decay formula, t* = log(N/kcrit)/log b, as a reliability horizon for intra-game degeneration risk. Our final 120M-parameter model, without search, reached Lichess bullet 2570 over 253 rated games. On human move prediction it achieves 55.2% Top-1 accuracy, exceeding Maia-2 rapid and Maia-2 blitz. Unlike position-based methods, sequence input naturally encodes full game history, enabling history-dependent decisions that single-position models cannot exhibit.

Paper Structure

This paper contains 62 sections, 9 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Core variable relationships. Two controllable levers (model capacity and data weighting) feed two capabilities (tracking $T$ and decision quality $Q$), governed by the bottleneck $P \le \min(T, Q)$.
  • Figure 2: Experimental causal chain. Each step's failure diagnosis motivates the next intervention, from baseline through filtering failure, model scaling, Elo weighting, and the discovery of the weighting sweet spot.
  • Figure 3: Layer-wise board-state probe accuracy. The all-Elo base model (V1.0) maintains higher square accuracy throughout the network, with both models peaking in the final layer.
  • Figure 4: Probe accuracy by evaluation slice. The gap between V1.0 and V1.1 is smallest on standard and opening positions, and largest on non-standard, middlegame, and endgame slices---precisely where low-Elo games contribute the most diverse positions.
  • Figure 5: Linear-probe board-state accuracy for three models (V1.0, V1.0w, V2.2) under unified evaluation. Scaling from 28 M to 120 M produces a large tracking improvement (93.4% $\to$ 98.0%), while weighting at fixed 28 M scale does not improve---and slightly reduces---probe accuracy.
  • ...and 10 more figures