Table of Contents
Fetching ...

Analyzing (In)Abilities of SAEs via Formal Languages

Abhinav Menon, Manish Shrivastava, David Krueger, Ekdeep Singh Lubana

TL;DR

The paper investigates the (in)abilities of sparse autoencoders (SAEs) to extract causally meaningful, interpretable features from Transformer representations trained on synthetic formal languages (Dyck-2, Expr, English PCFGs). It reveals that while SAEs can uncover semantically meaningful latents, their identifiability and potential causal impact are highly sensitive to inductive biases and hyperparameters, mirroring findings from vision. To address causality, the authors introduce a causal regularization term with weak supervision that leverages latent interpolations, and show that in the English fragment this approach yields latents with more predictable causal effects on next-token distributions, albeit with a trade-off in grammaticality and limited success on Dyck-2/Expr. The work emphasizes embedding causality into SAE objectives as a first-class concern and provides a proof-of-concept that such approaches can uncover causally relevant representations in language-model contexts, motivating further exploration of causality-aware interpretability pipelines for NLP models.

Abstract

Autoencoders have been used for finding interpretable and disentangled features underlying neural network representations in both image and text domains. While the efficacy and pitfalls of such methods are well-studied in vision, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We aim to address this gap by training sparse autoencoders (SAEs) on a synthetic testbed of formal languages. Specifically, we train SAEs on the hidden representations of models trained on formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of hyperparameter settings, finding interpretable latents often emerge in the features learned by our SAEs. However, similar to vision, we find performance turns out to be highly sensitive to inductive biases of the training pipeline. Moreover, we show latents correlating to certain features of the input do not always induce a causal impact on model's computation. We thus argue that causality has to become a central target in SAE training: learning of causal features should be incentivized from the ground-up. Motivated by this, we propose and perform preliminary investigations for an approach that promotes learning of causally relevant features in our formal language setting.

Analyzing (In)Abilities of SAEs via Formal Languages

TL;DR

The paper investigates the (in)abilities of sparse autoencoders (SAEs) to extract causally meaningful, interpretable features from Transformer representations trained on synthetic formal languages (Dyck-2, Expr, English PCFGs). It reveals that while SAEs can uncover semantically meaningful latents, their identifiability and potential causal impact are highly sensitive to inductive biases and hyperparameters, mirroring findings from vision. To address causality, the authors introduce a causal regularization term with weak supervision that leverages latent interpolations, and show that in the English fragment this approach yields latents with more predictable causal effects on next-token distributions, albeit with a trade-off in grammaticality and limited success on Dyck-2/Expr. The work emphasizes embedding causality into SAE objectives as a first-class concern and provides a proof-of-concept that such approaches can uncover causally relevant representations in language-model contexts, motivating further exploration of causality-aware interpretability pipelines for NLP models.

Abstract

Autoencoders have been used for finding interpretable and disentangled features underlying neural network representations in both image and text domains. While the efficacy and pitfalls of such methods are well-studied in vision, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We aim to address this gap by training sparse autoencoders (SAEs) on a synthetic testbed of formal languages. Specifically, we train SAEs on the hidden representations of models trained on formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of hyperparameter settings, finding interpretable latents often emerge in the features learned by our SAEs. However, similar to vision, we find performance turns out to be highly sensitive to inductive biases of the training pipeline. Moreover, we show latents correlating to certain features of the input do not always induce a causal impact on model's computation. We thus argue that causality has to become a central target in SAE training: learning of causal features should be incentivized from the ground-up. Motivated by this, we propose and perform preliminary investigations for an approach that promotes learning of causally relevant features in our formal language setting.

Paper Structure

This paper contains 35 sections, 12 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: The autoencoder paradigm for interpretability. Autoencoders have formed the basis of approaches to disentanglement in the vision domain (higgins2016beta). Utilizing synthetic testbeds, prior work has shown several limitations in this general pipeline locatello2019challengingtrauble2021disentangled. While SAEs have similarly been used to disentangle hidden representations of language models for interpretability, we aim to perform a similar study as ones in vision to understand the (in)abilities of SAEs.
  • Figure 2: A feature matching corresponding opening and closing brackets. Each line represents a pair of brackets, and joins the opening bracket's activation (left) to the closing bracket's (right). We note that the depth and opening activation are sufficient to determine the closing activation, and that the opening and closing activations are sufficient to determine the depth.
  • Figure 3: A feature that activates when exactly one more expression is required. Here, the x-axis is token depth, and the y-axis is token index. The lines connect the operators to their operands.
  • Figure 4: A feature that activates only on adjectives, at any position. Here, depth is represented by the y-axis and position by the x-axis; the lines connect nonterminals to their productions (see App. \ref{['sec:grammars_app']} for the production rules). The cell color represents the activation magnitude.
  • Figure 5: Behavior of the English model under interventions. We intervene on the model by replacing its hidden representations with the SAE's reconstructions, where an SAE latent (specifically, one corresponding to adjectives) is clamped to a fixed value. These values are selected at uniform intervals from $[-v_\text{max}, v_\text{max}]$, where $v_\text{max}$ is the maximum value taken by that latent (in line with templeton2024scaling). For each value (x-axis), we plot the fraction of each part of speech (nouns, pronouns, adjectives, verbs, adverbs, and conjunctions) in the output (left) and the fraction of outputs that are grammatical (right). We see interventions yield essentially no visible effects.
  • ...and 13 more figures