Table of Contents
Fetching ...

Priors in Time: Missing Inductive Biases for Language Model Interpretability

Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, Aaron Mueller

TL;DR

This work identifies a fundamental mismatch between the i.i.d. priors implicit in Sparse Autoencoders (SAEs) and the rich, nonstationary temporal dynamics of language model activations. It introduces Temporal Feature Analysis (TFA), a predictive framework that decomposes activations into a slow, context-informed predictable component and a fast, novel residual component, allowing time-aware interpretation of LM representations. Across narrative, garden-path, and in-context dialogue domains, TFA reveals that predictive codes align with event boundaries, temporal structure, and discourse-level relations, while the novel codes capture transient information similar to traditional SAEs. The findings argue for inductive biases that reflect temporal structure in interpretability tools, suggesting that features are best viewed as evolving manifolds rather than independent axes, with potential implications for robust model understanding and intervention.

Abstract

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.

Priors in Time: Missing Inductive Biases for Language Model Interpretability

TL;DR

This work identifies a fundamental mismatch between the i.i.d. priors implicit in Sparse Autoencoders (SAEs) and the rich, nonstationary temporal dynamics of language model activations. It introduces Temporal Feature Analysis (TFA), a predictive framework that decomposes activations into a slow, context-informed predictable component and a fast, novel residual component, allowing time-aware interpretation of LM representations. Across narrative, garden-path, and in-context dialogue domains, TFA reveals that predictive codes align with event boundaries, temporal structure, and discourse-level relations, while the novel codes capture transient information similar to traditional SAEs. The findings argue for inductive biases that reflect temporal structure in interpretability tools, suggesting that features are best viewed as evolving manifolds rather than independent axes, with potential implications for robust model understanding and intervention.

Abstract

Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions, it is unclear if this assumption can capture the rich temporal structure of language. Specifically, via a Bayesian lens, we demonstrate that Sparse Autoencoders (SAEs) impose priors that assume independence of concepts across time, implying stationarity. Meanwhile, language model representations exhibit rich temporal dynamics, including systematic growth in conceptual dimensionality, context-dependent correlations, and pronounced non-stationarity, in direct conflict with the priors of SAEs. Taking inspiration from computational neuroscience, we introduce a new interpretability objective -- Temporal Feature Analysis -- which possesses a temporal inductive bias to decompose representations at a given time into two parts: a predictable component, which can be inferred from the context, and a residual component, which captures novel information unexplained by the context. Temporal Feature Analyzers correctly parse garden path sentences, identify event boundaries, and more broadly delineate abstract, slow-moving information from novel, fast-moving information, while existing SAEs show significant pitfalls in all the above tasks. Overall, our results underscore the need for inductive biases that match the data in designing robust interpretability tools.

Paper Structure

This paper contains 84 sections, 7 theorems, 43 equations, 60 figures, 6 tables.

Key Result

Proposition 4.1

Consider the SAE maximum aposteriori (MAP) objective from Eq. eq:sae-sparsecoding. Since the sparsity constraints are additive over time, this objective has an independent and identically distributed (i.i.d.) prior over time:

Figures (60)

  • Figure 1: The mismatch between SAE assumptions and temporal structure of language. An illustrative sentence describing attributes of Harry Potter is shown. When passed into a language model (LM), it leads to activations $\bm{x}_t$ that include concepts within them (possibly entangled): note the presence of large numbers of shared attributes over time, which manifest as correlations across time of activations. Sparse Autoencoders (SAEs) implicitly have an independence (i.i.d.) prior across time $t$ over their latents and thereby over concepts, which clashes with the true structure of language.
  • Figure 1: Temporal Feature Analysis and SAEs achieves similar NMSE across domains (Simple Stories, Webtext, Code).
  • Figure 2: Temporal structure of LLM activations reveals nonstationarity. We use Pile samples monology2021pile-uncopyrighted to analyze temporal structure from activations of two pretrained LMs, comparing it to a surrogate signal that is stationary in nature (see App. \ref{['app:surrogate']}). (a, e) Intrinsic dimension of model activations and stationary surrogate. (b, f) Autocorrelations $A(\bm{x}_t, \bm{x}_{t-\tau})$ as a function of sequence position ($t$) and lag ($\tau$). (c, g) Autocorrelation of the stationary surrogate. (d, h) Variance explained by projecting current representation $\bm{x}_t$ onto past context window $\{\bm{x}_{t-1}, \dots, \bm{x}_{t-w}\}$ with different sizes $w$, along with a baseline. Results consistently show representations getting 'denser' over time and being significantly more structured than a stationary surrogate.
  • Figure 3: Sparsity splits local structure. (left) Temporally correlated inputs can yield geometrically structured activations. (right) If the sparsity budget is lower than the intrinsic dimensionality of the activation geometry, an SAE is incentivized to partition the manifold into local regions such that even nearby points map to disjoint codes and local structure is lost.
  • Figure 4: Schematic of Temporal Feature Analysis. Temporal Feature Analyzers decompose activations $\bm{x}_t$ into two components: a predictable component, obtained by projecting $\bm{x}_t$ onto a context direction (derived from the past $\bm{x}_{<t}$ using attention), and a sparse, novel component orthogonal to the predictable component that captures new information seen at time $t$.
  • ...and 55 more figures

Theorems & Definitions (10)

  • Proposition 4.1: Independence prior over time
  • Corollary 4.1.1: Assumptions of time-invariant sparsity
  • Proposition 4.2: Restrictive Sparsity Budget Leads to Support Switching in SAEs
  • Proposition H.1: Independence priors over time
  • proof
  • Corollary H.1.1: Assumptions of time-invariant sparsity
  • proof
  • Proposition H.2: Restrictive Sparsity Budget Leads to Support Switching in SAEs
  • proof
  • Proposition I.1: SAE Priors on Sparse Code