Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon, Manish Shrivastava, David Krueger, Ekdeep Singh Lubana
TL;DR
The paper investigates the (in)abilities of sparse autoencoders (SAEs) to extract causally meaningful, interpretable features from Transformer representations trained on synthetic formal languages (Dyck-2, Expr, English PCFGs). It reveals that while SAEs can uncover semantically meaningful latents, their identifiability and potential causal impact are highly sensitive to inductive biases and hyperparameters, mirroring findings from vision. To address causality, the authors introduce a causal regularization term with weak supervision that leverages latent interpolations, and show that in the English fragment this approach yields latents with more predictable causal effects on next-token distributions, albeit with a trade-off in grammaticality and limited success on Dyck-2/Expr. The work emphasizes embedding causality into SAE objectives as a first-class concern and provides a proof-of-concept that such approaches can uncover causally relevant representations in language-model contexts, motivating further exploration of causality-aware interpretability pipelines for NLP models.
Abstract
Autoencoders have been used for finding interpretable and disentangled features underlying neural network representations in both image and text domains. While the efficacy and pitfalls of such methods are well-studied in vision, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We aim to address this gap by training sparse autoencoders (SAEs) on a synthetic testbed of formal languages. Specifically, we train SAEs on the hidden representations of models trained on formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of hyperparameter settings, finding interpretable latents often emerge in the features learned by our SAEs. However, similar to vision, we find performance turns out to be highly sensitive to inductive biases of the training pipeline. Moreover, we show latents correlating to certain features of the input do not always induce a causal impact on model's computation. We thus argue that causality has to become a central target in SAE training: learning of causal features should be incentivized from the ground-up. Motivated by this, we propose and perform preliminary investigations for an approach that promotes learning of causally relevant features in our formal language setting.
