Table of Contents
Fetching ...

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Hakaze Cho, Haolin Yang, Yanshu Li, Brian M. Kurkoski, Naoya Inoue

TL;DR

This work introduces Binary Autoencoder (BAE), a novel autoencoder variant that binarizes hidden activations and imposes a minibatch entropy constraint to promote global sparsity and feature independence, addressing dense, non-atomized representations in traditional sparse autoencoders. The method combines a linear input projection, 1-bit binarization of activations, a dual loss (self-regression and entropy/covariance) framework, and a gradient-estimation approach for the non-differentiable binarization, with an optional more-bit decoding pathway to mitigate information loss. Empirically, BAE enables accurate entropy estimation of hidden state sets and yields a large, diverse set of atomized, interpretable features, outperforming baselines in feature extraction while suppressing dense and dead features. The paper also introduces ComSem, a linguistically grounded interpretation pipeline, and demonstrates BAE’s utility for analyzing LM behavior, layer information bandwidth, and in-context learning dynamics, offering a practical tool for mechanistic interpretability in large language models.

Abstract

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs). However, they typically rely on autoencoders constrained by some training-time regularization on single training instances, without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which can be leveraged to characterize the inference dynamics of LLMs. (2) Feature untangling. Compared to typical methods, due to improved training strategy, BAE avoids dense features while producing the largest number of interpretable ones among baselines.

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

TL;DR

This work introduces Binary Autoencoder (BAE), a novel autoencoder variant that binarizes hidden activations and imposes a minibatch entropy constraint to promote global sparsity and feature independence, addressing dense, non-atomized representations in traditional sparse autoencoders. The method combines a linear input projection, 1-bit binarization of activations, a dual loss (self-regression and entropy/covariance) framework, and a gradient-estimation approach for the non-differentiable binarization, with an optional more-bit decoding pathway to mitigate information loss. Empirically, BAE enables accurate entropy estimation of hidden state sets and yields a large, diverse set of atomized, interpretable features, outperforming baselines in feature extraction while suppressing dense and dead features. The paper also introduces ComSem, a linguistically grounded interpretation pipeline, and demonstrates BAE’s utility for analyzing LM behavior, layer information bandwidth, and in-context learning dynamics, offering a practical tool for mechanistic interpretability in large language models.

Abstract

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs). However, they typically rely on autoencoders constrained by some training-time regularization on single training instances, without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which can be leveraged to characterize the inference dynamics of LLMs. (2) Feature untangling. Compared to typical methods, due to improved training strategy, BAE avoids dense features while producing the largest number of interpretable ones among baselines.

Paper Structure

This paper contains 62 sections, 1 theorem, 8 equations, 44 figures, 4 tables.

Key Result

Theorem 5.1

Let $X_1\sim \mathrm{Bernoulli}(p_1)$, $X_2\sim \mathrm{Bernoulli}(p_2)$, the information of an observation (i.e., the actual value of hidden activation in a specified channel) $x\in\{0,1\}$ can be written as $I_X(x):=-\log \Pr[X=x]$. Then:

Figures (44)

  • Figure 1: Feed-forward computation and training objective of BAE. Hidden states $h_0$ from LLM layers are mapped by $W_\text{in}$, binarized into $h_B$ via a step function, and projected back by $W_\text{out}$ as $\hat{h_0}$. The $\hat{h_0}$ is fed to the self-regression loss, while $h_B$ is fed to the information bottleneck loss. More-bit decode: to reduce the information loss of the BAE, as mentioned in §\ref{['sec:more_bits']}, we aggregate real-valued hidden activation elements from multiple binary bits, and perform decoding using the reconstructed real-valued vector.
  • Figure 2: Evaluation of BAE entropy calculation on the synthetic dataset. Horizontal axis: rank $r$ of the current dataset, vertical axis: calculated entropy, the green/red color refers to whether $\mathcal{L}_e$ is enabled, and the opacity refers to the $\mathcal{L}_r$ on the whole input set.
  • Figure 3: Entropy calculated on the hidden states extracted from specific layers and token locations from Pile and Llama 3.2-1B. The curve colors refer to the extracted layers, the scatter colors refer to the $\mathcal{L}_r$ on the whole input set.
  • Figure 4: Entropy calculated on the hidden states extracted from specific layers and the last token from ICL inputs from SST-2. The curve colors refer to the number of demonstrations.
  • Figure 5: Feature activation frequency distribution of Layer 11 (more layers in Appendix \ref{['appendix.more_frequency']}).
  • ...and 39 more figures

Theorems & Definitions (1)

  • Theorem 5.1: Burst Features Carry More Information