Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Alex Tamkin; Mohammad Taufeeque; Noah D. Goodman

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Alex Tamkin, Mohammad Taufeeque, Noah D. Goodman

TL;DR

This work introduces codebook features—sparse, discrete bottlenecks inserted into each layer of a neural network to compress activations into a small set of learned codes. The authors finetune transformer models with a codebook module alongside a standard language-model objective and a reconstruction term, showing end-to-end training is feasible even for networks with dozens of layers and large codebooks. They demonstrate that code activations align with disentangled concepts in both a controlled FSM task and natural language modeling, and that activating specific codes can causally steer model behavior, including topic-focused generation. Overall, codebook features offer a promising new unit of interpretation and control for neural networks, with open-source code and models enabling broader exploration across domains.

Abstract

Understanding neural networks is challenging in part because of the dense, continuous nature of their hidden states. We explore whether we can train neural networks to have hidden states that are sparse, discrete, and more interpretable by quantizing their continuous features into what we call codebook features. Codebook features are produced by finetuning neural networks with vector quantization bottlenecks at each layer, producing a network whose hidden features are the sum of a small number of discrete vector codes chosen from a larger codebook. Surprisingly, we find that neural networks can operate under this extreme bottleneck with only modest degradation in performance. This sparse, discrete bottleneck also provides an intuitive way of controlling neural network behavior: first, find codes that activate when the desired behavior is present, then activate those same codes during generation to elicit that behavior. We validate our approach by training codebook Transformers on several different datasets. First, we explore a finite state machine dataset with far more hidden states than neurons. In this setting, our approach overcomes the superposition problem by assigning states to distinct codes, and we find that we can make the neural network behave as if it is in a different state by activating the code for that state. Second, we train Transformer language models with up to 410M parameters on two natural language datasets. We identify codes in these models representing diverse, disentangled concepts (ranging from negative emotions to months of the year) and find that we can guide the model to generate different topics by activating the appropriate codes during inference. Overall, codebook features appear to be a promising unit of analysis and control for neural networks and interpretability. Our codebase and models are open-sourced at https://github.com/taufeeque9/codebook-features.

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

TL;DR

Abstract

Paper Structure (52 sections, 6 figures, 13 tables)

This paper contains 52 sections, 6 figures, 13 tables.

Introduction
Method
Training with codebooks
Using codebooks for understanding and control
Algorithmic sequence modeling
Generating hypotheses for the role of codes
Steering the network by activating codes
Language modeling
Steering the network by activating topic codes
Related work
Discussion and future work
Author contributions
General Training and Optimization Details
Layer norm
Optimizer hyperparameters
...and 37 more sections

Figures (6)

Figure 1: Codebook features attempt to combine the expressivity of neural networks with the sparse, discrete state often found in traditional software.
Figure 2: Interventions on the state and state-plus-digit codes in a sequence. Changing just the MLP codes to codes associated with another state shifts the output distribution almost entirely to the target state. Changing codes in other layers has a much smaller effect. Normalized JS Div stands for the normalized Jensen-Shannon Divergence, where the initial difference (None) is normalized to 1.
Figure 3: Codes are better classifiers of simple textual features than neurons.Y-axis: precision of a given code at classifying a regular expression. X-axis: precision of the best neuron in the network, with a threshold chosen to match the recall of the code. Red line: $y=x$
Figure 4: Code activation frequencies appear to follow a power law Frequency of code activations by rank from TinyStories 1-layer attention-only codebook model. The x-axis denotes the rank of the code in terms of frequency on a subset of the training set. We observe that most codes activate very rarely, while a long tail of codes activate very frequently.
Figure 5: Codebook training overcomes the superposition challenge in the first layer. We plot the fraction of codes which are pure at each layer, meaning they activate only on a single state (in the case of bigrams) or state + first digit (in the case of trigrams). We see very high levels of purity for both bigram and trigram models. Because the number of hidden states is 128, and there are 1000 trigram combinations for the model to learn, the network cannot allocate each state to a different neuron. The high purity of the codes demonstrates that codebook training has mostly resolved the superposition problem at the first layer. Code purity declines in higher layers as the model forms its prediction of the next token (see \ref{['fig:fsm-jsd-purity-plots']}). Experiment performed on the MLP codebooks of the $k=1$ Attn + MLP codebook TokFSM model over all 100 and 1000 possible combinations of the first two and three digits, respectively.
...and 1 more figures

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

TL;DR

Abstract

Codebook Features: Sparse and Discrete Interpretability for Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)