Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Adam Karvonen; Benjamin Wright; Can Rager; Rico Angell; Jannik Brinkmann; Logan Smith; Claudio Mayrink Verdun; David Bau; Samuel Marks

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, Samuel Marks

TL;DR

This work tackles the challenge of evaluating interpretable features learned by sparse autoencoders (SAEs) applied to language models trained on chess and Othello transcripts. It introduces board-state properties (BSPs) and two objective metrics—Coverage and Board reconstruction—to quantify SAE quality, alongside a novel training method, $p$-annealing, which transitions from $L_1$ to a non-convex $L_p^p$ sparsity penalty. The authors show that SAEs trained with $p$-annealing can match the performance of Gated SAEs on existing proxies while providing improved insights via the new metrics, and they reveal that coverage and board reconstruction capture aspects of interpretability not reflected in traditional metrics. The results, together with open-sourcing of 500+ SAEs, establish an actionable, domain-grounded framework for advancing dictionary learning in LM interpretability and offer a practical path toward more objective evaluation of learned features.

Abstract

What latent features are encoded in language model (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM representations has shown significant promise. However, evaluating the quality of these SAEs is difficult because we lack a ground-truth collection of interpretable features that we expect good SAEs to recover. We thus propose to measure progress in interpretable dictionary learning by working in the setting of LMs trained on chess and Othello transcripts. These settings carry natural collections of interpretable features -- for example, "there is a knight on F3" -- which we leverage into $\textit{supervised}$ metrics for SAE quality. To guide progress in interpretable dictionary learning, we introduce a new SAE training technique, $\textit{p-annealing}$, which improves performance on prior unsupervised metrics as well as our new metrics.

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

TL;DR

-annealing, which transitions from

to a non-convex

sparsity penalty. The authors show that SAEs trained with

-annealing can match the performance of Gated SAEs on existing proxies while providing improved insights via the new metrics, and they reveal that coverage and board reconstruction capture aspects of interpretability not reflected in traditional metrics. The results, together with open-sourcing of 500+ SAEs, establish an actionable, domain-grounded framework for advancing dictionary learning in LM interpretability and offer a practical path toward more objective evaluation of learned features.

Abstract

metrics for SAE quality. To guide progress in interpretable dictionary learning, we introduce a new SAE training technique,

, which improves performance on prior unsupervised metrics as well as our new metrics.

Paper Structure (36 sections, 14 equations, 6 figures, 5 tables)

This paper contains 36 sections, 14 equations, 6 figures, 5 tables.

Introduction
Background
Language models for Othello and chess
Othello.
Chess.
Sparse autoencoders
Measuring autoencoder quality for chess and Othello models
Board state properties in chess and Othello models
Measuring SAE quality with board state properties
Coverage.
Board reconstruction.
Training methodologies for SAEs
Standard SAEs
Gated SAEs
p-Annealing
...and 21 more sections

Figures (6)

Figure 1: We find SAE features that detect interpretable board state properties (BSP) with high precision (i.e., above 0.95). This figure illustrates three distinct chessboard states, each an example of a BSP associated with a high activation of a particular SAE feature. Left: A board state detector identifies a knight on square f3, owned by the player to move. Middle: A rook threat detector indicates an immediate threat posed by a rook to a queen regardless of location and piece threatened. Right: A pin detector recognizes moves that resolve a check on a diagonal by creating a pin, again, regardless of location and piece pinned.
Figure 2: Comparison of the coverage and board reconstruction metrics for chess SAE quality on $\mathcal{G}_\text{board state}$. The coverage score reports the mean F1 scores over BSPs. The top row corresponds to coverage, and the bottom row corresponds to board reconstruction. The left column contains a scatterplot of loss recovered vs. $L_0$, with the scheme color corresponding to the coverage score and each point representing different hyperparameters. We differentiate between SAE training methods with shapes.
Figure 3: Comparison of the coverage and board reconstruction metrics for chess SAE quality on $\mathcal{G}_\text{strategy}$. The metrics represent the average coverage and board reconstruction obtained across all BSPs in $\mathcal{G}_\text{strategy}$. The coverage score reports the mean of maximal F1 scores over BSPs. The absolute coverage scores vary significantly between strategy BSPs, as discussed in Appendix \ref{['app:linear-probes']}. The top row corresponds to coverage, and the bottom row corresponds to board reconstruction. The left column contains a scatterplot of loss recovered vs. $L_0$, with the color scheme corresponding to the coverage score and each point representing different hyperparameters. We differentiate between SAE training methods with shapes.
Figure 4: Comparison of the coverage and board reconstruction metrics for Othello SAE quality on $\mathcal{G}_\text{board state}$. The coverage score reports the mean of maximal F1 scores over BSPs. The top row corresponds to coverage, and the bottom row corresponds to board reconstruction. The left column contains a scatterplot of loss recovered vs. $L_0$, with the color scheme corresponding to the coverage score and each point representing different hyperparameters. We differentiate between SAE training methods with shapes.
Figure 5: Comparison of the relative reconstruction bias metric $\gamma$ quantifying feature activation shrinkage across a suite of SAEs. $\gamma < 1$ indicates shrinkage. A perfectly unbiased SAE would have $\gamma = 1$.
...and 1 more figures

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

TL;DR

Abstract

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)