Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders
Charles O'Neill, Alim Gumran, David Klindt
TL;DR
The paper establishes a theoretical amortisation gap for sparse autoencoders (SAEs) by showing that, under sparse data projected into fewer dimensions ($M < N$) with the restricted isometry property, a one-pass linear–nonlinear SAE encoder cannot recover all sparse codes optimally. By decoupling encoding and decoding, it empirically compares SAEs, MLPs, and traditional sparse coding across synthetic data and GPT-2 activations, demonstrating that more expressive encoders (notably sparse coding and sometimes MLPs) yield superior sparse inference and interpretability at higher compute costs. The findings generalize to large language model activations, suggesting that richer encoders can improve interpretability without sacrificing validity, albeit with increased computation and optimization complexity. The work advances understanding of neural representations, highlighting when and why amortised inference in SAEs may be suboptimal and pointing toward practical strategies for improved feature extraction in NN interpretability contexts.
Abstract
A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. Using compressed sensing theory, we prove that an SAE encoder is inherently insufficient for accurate sparse inference, even in solvable cases. We then decouple encoding and decoding processes to empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our results reveal substantial performance gains with minimal compute increases in correct inference of sparse codes. We demonstrate this generalises to SAEs applied to large language models, where more expressive encoders achieve greater interpretability. This work opens new avenues for understanding neural network representations and analysing large language model activations.
