Table of Contents
Fetching ...

Exploring Human-AI Conceptual Alignment through the Prism of Chess

Semyon Lomasov, Judah Goldfeder, Mehmet Hamza Erol, Matthew So, Yao Yan, Addison Howard, Nathan Kutz, Ravid Shwartz Ziv

TL;DR

This work interrogates whether neural chess programs truly grasp human concepts or rely on pattern-matching. It combines a novel Chess960 dataset, three probing methods, and layer-wise activation analysis on a 270M-parameter transformer to assess conceptual alignment across layers. The findings show strong human-aligned concept detection in early layers (up to ~85%), but deep layers converge on alien representations, and Chess960 perturbs concept recognition by 10–20%, indicating reliance on memorized patterns rather than abstract principles. These results reveal a fundamental tension between optimizing for performance and maintaining human-aligned reasoning, with important implications for designing creative AI and guiding future interpretability research; the authors release dataset and code to foster further investigation.

Abstract

Do AI systems truly understand human concepts or merely mimic surface patterns? We investigate this through chess, where human creativity meets precise strategic concepts. Analyzing a 270M-parameter transformer that achieves grandmaster-level play, we uncover a striking paradox: while early layers encode human concepts like center control and knight outposts with up to 85\% accuracy, deeper layers, despite driving superior performance, drift toward alien representations, dropping to 50-65\% accuracy. To test conceptual robustness beyond memorization, we introduce the first Chess960 dataset: 240 expert-annotated positions across 6 strategic concepts. When opening theory is eliminated through randomized starting positions, concept recognition drops 10-20\% across all methods, revealing the model's reliance on memorized patterns rather than abstract understanding. Our layer-wise analysis exposes a fundamental tension in current architectures: the representations that win games diverge from those that align with human thinking. These findings suggest that as AI systems optimize for performance, they develop increasingly alien intelligence, a critical challenge for creative AI applications requiring genuine human-AI collaboration. Dataset and code are available at: https://github.com/slomasov/ChessConceptsLLM.

Exploring Human-AI Conceptual Alignment through the Prism of Chess

TL;DR

This work interrogates whether neural chess programs truly grasp human concepts or rely on pattern-matching. It combines a novel Chess960 dataset, three probing methods, and layer-wise activation analysis on a 270M-parameter transformer to assess conceptual alignment across layers. The findings show strong human-aligned concept detection in early layers (up to ~85%), but deep layers converge on alien representations, and Chess960 perturbs concept recognition by 10–20%, indicating reliance on memorized patterns rather than abstract principles. These results reveal a fundamental tension between optimizing for performance and maintaining human-aligned reasoning, with important implications for designing creative AI and guiding future interpretability research; the authors release dataset and code to foster further investigation.

Abstract

Do AI systems truly understand human concepts or merely mimic surface patterns? We investigate this through chess, where human creativity meets precise strategic concepts. Analyzing a 270M-parameter transformer that achieves grandmaster-level play, we uncover a striking paradox: while early layers encode human concepts like center control and knight outposts with up to 85\% accuracy, deeper layers, despite driving superior performance, drift toward alien representations, dropping to 50-65\% accuracy. To test conceptual robustness beyond memorization, we introduce the first Chess960 dataset: 240 expert-annotated positions across 6 strategic concepts. When opening theory is eliminated through randomized starting positions, concept recognition drops 10-20\% across all methods, revealing the model's reliance on memorized patterns rather than abstract understanding. Our layer-wise analysis exposes a fundamental tension in current architectures: the representations that win games diverge from those that align with human thinking. These findings suggest that as AI systems optimize for performance, they develop increasingly alien intelligence, a critical challenge for creative AI applications requiring genuine human-AI collaboration. Dataset and code are available at: https://github.com/slomasov/ChessConceptsLLM.

Paper Structure

This paper contains 19 sections, 4 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: From board to move: tracking where human concepts disappear in the processing pipeline. The model encodes positions as FEN strings, appends move tokens, and processes them through transformer layers. Recording activations at each layer reveals where strategic understanding shifts from human-recognizable patterns to alien representations.
  • Figure 2: Human concepts fade as the network goes deeper: early layers think like humans, late layers think like aliens. Layer-wise accuracy for detecting six chess concepts using Logistic Regression probing. For a comparison of all 3 probing methods, see Appendix \ref{['app:res']}. Early layers (2-5) achieve 70-85% accuracy, dropping to 50-65% by layer 15 across all methods. Standard chess shows smooth degradation while Chess960 becomes erratic, revealing unstable representations without memorized patterns. This universal decline exposes the trade-off between human interpretability and performance.
  • Figure 3: Full results of Figure \ref{['fig:concept_trends']}, the overall trend supports the takeaway that the human concepts fade as the network goes deeper.
  • Figure 4: Examples of Human-Concept Categories. The top left illustrates a black bishop outpost, where a black bishop is anchored by pawns deep into the white position. The top right illustrates a black weak queen, where the queen is either directly under attack, or is pinned to a piece. On the bottom left, we have a white king located on a side of the board where neither player has any pawns, and on the bottom right, we illustrate a white rook on a semi-open file, where there is no white pawn opposing it.