Table of Contents
Fetching ...

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon

TL;DR

This work tackles the interpretability gap in dictionary-based features learned from language models by introducing Temporal Sparse Autoencoders (T-SAEs). By partitioning latent space into high-level semantic and low-level syntactic features and adding a temporal contrastive loss, T-SAEs enforce consistency of semantic activations across adjacent tokens, yielding smoother, more coherent semantic representations without sacrificing reconstruction quality. Across multiple models and datasets, T-SAEs demonstrate improved semantic and contextual disentanglement while maintaining competitive SAE performance, and they enable practical benefits such as dataset understanding and steering for alignment data. The approach provides a self-supervised pathway to uncover meaningful linguistic concepts with sequence-level interpretability, holding promise for safer and more controllable language models.

Abstract

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as "the phrase 'The' at the start of sentences". In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

TL;DR

This work tackles the interpretability gap in dictionary-based features learned from language models by introducing Temporal Sparse Autoencoders (T-SAEs). By partitioning latent space into high-level semantic and low-level syntactic features and adding a temporal contrastive loss, T-SAEs enforce consistency of semantic activations across adjacent tokens, yielding smoother, more coherent semantic representations without sacrificing reconstruction quality. Across multiple models and datasets, T-SAEs demonstrate improved semantic and contextual disentanglement while maintaining competitive SAE performance, and they enable practical benefits such as dataset understanding and steering for alignment data. The approach provides a self-supervised pathway to uncover meaningful linguistic concepts with sequence-level interpretability, holding promise for safer and more controllable language models.

Abstract

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability. While recent dictionary learning methods such as Sparse Autoencoders (SAEs) provide a promising route to discover human-interpretable features, they suffer from a variety of problems, including a systematic failure to capture the rich conceptual information that drives linguistic understanding. Instead, they exhibit a bias towards shallow, token-specific, or noisy features, such as "the phrase 'The' at the start of sentences". In this work, we propose that this is due to a fundamental issue with how dictionary learning methods for LLMs are trained. Language itself has a rich, well-studied structure spanning syntax, semantics, and pragmatics; however, current unsupervised methods largely ignore this linguistic knowledge, leading to poor feature discovery that favors superficial patterns over meaningful concepts. We focus on a simple but important aspect of language: semantic content has long-range dependencies and tends to be smooth over a sequence, whereas syntactic information is much more local. Building on this insight, we introduce Temporal Sparse Autoencoders (T-SAEs), which incorporate a novel contrastive loss encouraging consistent activations of high-level features over adjacent tokens. This simple yet powerful modification enables SAEs to disentangle semantic from syntactic features in a self-supervised manner. Across multiple datasets and models, T-SAEs recover smoother, more coherent semantic concepts without sacrificing reconstruction quality. Strikingly, they exhibit clear semantic structure despite being trained without explicit semantic signal, offering a new pathway for unsupervised interpretability in language models.

Paper Structure

This paper contains 25 sections, 6 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: A) Human language production involves high-level features such as semantic content and surrounding context, as well as low-level features such as syntactical requirements and specific word choices. B) While existing SAEs mostly recover syntactic information, Temporal SAEs balance recovery of semantics, syntax, and context. C) When decomposing a sequence composed of three passages: Newton’s Principia, an MMLU genetics question, and the Bhagavat Gita, Temporal SAEs (bottom) are able to smoothly detect the semantic shifts in the passage, with highly active features strongly correlating to the true content of the text, whereas existing SAEs (such as Matryoshka, top), are much noisier, varying on almost a per-token basis, and do not easily depict these shifts.
  • Figure 2: TSNE visualizations of Pythia-160m SAE decompositions of MMLU questions, labeled by question category (left), question number (middle column), and token part-of-speech (right). We see that the high-level features from T-SAEs (top) recover semantic and contextual information. The low-level features of T-SAEs (middle row), as well as Matryoshka SAEs (bottom), recover syntactic information.
  • Figure 3: Accuracy of probes trained on SAE decompositions for Gemma2-2b, as well as probes trained directly on model latents (orange), with semantic labels (right), contextual labels (middle), and syntactic labels (right) with varying levels of probe sparsity (setup from kantamneni2025sparse). T-SAEs significantly outperform baseline SAEs for semantics and context.
  • Figure 4: Top 8 most active Gemma2-2b Temporal SAE features over a concatenated sequence of text. T-SAE features exhibit clear phase transitions between sequences, are relatively smooth, and have explanations relevant to the semantic content of each component sequence.
  • Figure 5: Left. We study the features describing the HH-RLHF bai2022training dataset given by our T-SAE and a Matryoshka SAE provided by Neuronpedia. The Matryoshka SAE appears to find more random features, whereas the T-SAe is able to capture safety-relevant features. Additionally, the T-SAE sheds light on a potential length spurious correlation in the dataset, where rejected completions are longer than chosen completions. Upon investigating the dataset, we find this is true with statistical significance, highlighting the potential of the high-level features for data filtering, steering, and targeted finetuning interventions. Right. We find that high-level features Pareto dominate baselines in their ability to steer LLMs. Specifically, they are better at preserving coherence while successfully changing the semantics of model generation.
  • ...and 5 more figures