Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks
TL;DR
This work tackles the challenge of evaluating interpretable features learned by sparse autoencoders (SAEs) applied to language models trained on chess and Othello transcripts. It introduces board-state properties (BSPs) and two objective metrics—Coverage and Board reconstruction—to quantify SAE quality, alongside a novel training method, $p$-annealing, which transitions from $L_1$ to a non-convex $L_p^p$ sparsity penalty. The authors show that SAEs trained with $p$-annealing can match the performance of Gated SAEs on existing proxies while providing improved insights via the new metrics, and they reveal that coverage and board reconstruction capture aspects of interpretability not reflected in traditional metrics. The results, together with open-sourcing of 500+ SAEs, establish an actionable, domain-grounded framework for advancing dictionary learning in LM interpretability and offer a practical path toward more objective evaluation of learned features.
Abstract
What latent features are encoded in language model (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM representations has shown significant promise. However, evaluating the quality of these SAEs is difficult because we lack a ground-truth collection of interpretable features that we expect good SAEs to recover. We thus propose to measure progress in interpretable dictionary learning by working in the setting of LMs trained on chess and Othello transcripts. These settings carry natural collections of interpretable features -- for example, "there is a knight on F3" -- which we leverage into $\textit{supervised}$ metrics for SAE quality. To guide progress in interpretable dictionary learning, we introduce a new SAE training technique, $\textit{p-annealing}$, which improves performance on prior unsupervised metrics as well as our new metrics.
