Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Hakaze Cho; Haolin Yang; Yanshu Li; Brian M. Kurkoski; Naoya Inoue

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Hakaze Cho, Haolin Yang, Yanshu Li, Brian M. Kurkoski, Naoya Inoue

TL;DR

This work introduces Binary Autoencoder (BAE), a novel autoencoder variant that binarizes hidden activations and imposes a minibatch entropy constraint to promote global sparsity and feature independence, addressing dense, non-atomized representations in traditional sparse autoencoders. The method combines a linear input projection, 1-bit binarization of activations, a dual loss (self-regression and entropy/covariance) framework, and a gradient-estimation approach for the non-differentiable binarization, with an optional more-bit decoding pathway to mitigate information loss. Empirically, BAE enables accurate entropy estimation of hidden state sets and yields a large, diverse set of atomized, interpretable features, outperforming baselines in feature extraction while suppressing dense and dead features. The paper also introduces ComSem, a linguistically grounded interpretation pipeline, and demonstrates BAE’s utility for analyzing LM behavior, layer information bandwidth, and in-context learning dynamics, offering a practical tool for mechanistic interpretability in large language models.

Abstract

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs). However, they typically rely on autoencoders constrained by some training-time regularization on single training instances, without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which can be leveraged to characterize the inference dynamics of LLMs. (2) Feature untangling. Compared to typical methods, due to improved training strategy, BAE avoids dense features while producing the largest number of interpretable ones among baselines.

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

TL;DR

Abstract

Binary Autoencoder for Mechanistic Interpretability of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (44)

Theorems & Definitions (1)