Table of Contents
Fetching ...

Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

Yoann Poupart

TL;DR

This work addresses the opacity of neural planning in chess AI by introducing Contrastive Sparse Autoencoders (CSAE) to disentangle planning concepts from latent representations. By contrasting optimal $S^+_{\leq T}(s_0)$ and suboptimal $S^-_{\leq T}(s_0)$ rollouts, the method separates common vs differentiating features into a dictionary of concepts ($c$ and $d$), coupled with reconstruction and a linear probe for validation. The approach yields interpretable concepts such as safe-positions and rook threats, supported by automated sanity checks and a dynamic concept taxonomy, and is validated through qualitative analyses and clustering studies. This framework advances transparent reasoning in multi-step planning for chess agents and lays groundwork for applying contrastive, dictionary-based interpretability to other planning-centric domains.

Abstract

AI led chess systems to a superhuman level, yet these systems heavily rely on black-box algorithms. This is unsustainable in ensuring transparency to the end-user, particularly when these systems are responsible for sensitive decision-making. Recent interpretability work has shown that the inner representations of Deep Neural Networks (DNNs) were fathomable and contained human-understandable concepts. Yet, these methods are seldom contextualised and are often based on a single hidden state, which makes them unable to interpret multi-step reasoning, e.g. planning. In this respect, we propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated feature taxonomy. Furthermore, to evaluate the quality of our trained CSAE, we devise sanity checks to wave spurious correlations in our results.

Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents

TL;DR

This work addresses the opacity of neural planning in chess AI by introducing Contrastive Sparse Autoencoders (CSAE) to disentangle planning concepts from latent representations. By contrasting optimal and suboptimal rollouts, the method separates common vs differentiating features into a dictionary of concepts ( and ), coupled with reconstruction and a linear probe for validation. The approach yields interpretable concepts such as safe-positions and rook threats, supported by automated sanity checks and a dynamic concept taxonomy, and is validated through qualitative analyses and clustering studies. This framework advances transparent reasoning in multi-step planning for chess agents and lays groundwork for applying contrastive, dictionary-based interpretability to other planning-centric domains.

Abstract

AI led chess systems to a superhuman level, yet these systems heavily rely on black-box algorithms. This is unsustainable in ensuring transparency to the end-user, particularly when these systems are responsible for sensitive decision-making. Recent interpretability work has shown that the inner representations of Deep Neural Networks (DNNs) were fathomable and contained human-understandable concepts. Yet, these methods are seldom contextualised and are often based on a single hidden state, which makes them unable to interpret multi-step reasoning, e.g. planning. In this respect, we propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated feature taxonomy. Furthermore, to evaluate the quality of our trained CSAE, we devise sanity checks to wave spurious correlations in our results.
Paper Structure (52 sections, 12 equations, 15 figures, 4 tables)

This paper contains 52 sections, 12 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Better viewed in colour. Our proposed framework aims to retrieve planning concepts, represented as icons at the bottom. For that, we analyse the plans of a chess-playing agent. A sampling of an optimal trajectory $\mathbb{S}^-_{\leq 3}(s_0)$ (in green) and a suboptimal trajectory $\mathbb{S}^+_{\leq 3}(s_0)$ (in blue) from a root node $s_0$. The star represents a concept meaningfully to the optimal trajectory while the lightning represents a concept relevant to the suboptimal trajectory.
  • Figure 2: Modelling components; first, the boards are encoded into planes (a) and fed to the network backbone (b). The different heads use the extracted features to make heuristic predictions (c) guiding the MCTS when encountering new nodes (d).
  • Figure 3: Better viewed in colour. (a) Contrastive SAEs are trained using a contrast of an optimal trajectory (green) and suboptimal trajectories (blue). They take in input the root hidden state $h(s_0)$ and a subsequent node's hidden state $h(s_t^\pm)$. The $c$-features are represented in red, and the $d$-features are in blue and green. (b) Schematic view of concepts extraction from different rollouts. The dynamical concepts from the rollout $\mathbb{S}^+_{\leq 3}(s_0)$ is extracted in $d^+$ and for $\mathbb{S}^-_{\leq 3}(s_0)$ in $d^-$.
  • Figure 4: (a) Illustration of the process of interpreting a feature using activation maximisation. The most activated samples are retrieved and analysed. (b) In order to compare a pair of features, the first indicator is the correlation of the feature activation (right). It is also possible to count common samples retrieved using activation maximisation.
  • Figure 5: Agglomerative clustering of the test samples after an NMF followed by a t-SNE for the visualisation Maaten2008VisualizingDUscikit-learn. We present the first 100 clusters, and colours are repeated. Each colour represents 5 different clusters, and the colours are independent of (a) and (b). While the structures are similar (due to the t-SNE projection), the labels are uncorrelated, suggesting a difference in representations for the $c$-features and $d$-features.
  • ...and 10 more figures