Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
Yoann Poupart
TL;DR
This work addresses the opacity of neural planning in chess AI by introducing Contrastive Sparse Autoencoders (CSAE) to disentangle planning concepts from latent representations. By contrasting optimal $S^+_{\leq T}(s_0)$ and suboptimal $S^-_{\leq T}(s_0)$ rollouts, the method separates common vs differentiating features into a dictionary of concepts ($c$ and $d$), coupled with reconstruction and a linear probe for validation. The approach yields interpretable concepts such as safe-positions and rook threats, supported by automated sanity checks and a dynamic concept taxonomy, and is validated through qualitative analyses and clustering studies. This framework advances transparent reasoning in multi-step planning for chess agents and lays groundwork for applying contrastive, dictionary-based interpretability to other planning-centric domains.
Abstract
AI led chess systems to a superhuman level, yet these systems heavily rely on black-box algorithms. This is unsustainable in ensuring transparency to the end-user, particularly when these systems are responsible for sensitive decision-making. Recent interpretability work has shown that the inner representations of Deep Neural Networks (DNNs) were fathomable and contained human-understandable concepts. Yet, these methods are seldom contextualised and are often based on a single hidden state, which makes them unable to interpret multi-step reasoning, e.g. planning. In this respect, we propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated feature taxonomy. Furthermore, to evaluate the quality of our trained CSAE, we devise sanity checks to wave spurious correlations in our results.
