Table of Contents
Fetching ...

Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

Ruixuan Deng, Xiaoyang Hu, Miles Gilberti, Shane Storks, Aman Taxali, Mike Angstadt, Chandra Sripada, Joyce Chai

TL;DR

This work identifies modular, causally relevant components in large language models by constructing inter-layer feature networks from SAE coactivations and pruning to sparse, context-consistent groups that encode concepts and relations. By ablating or amplifying these components, the authors demonstrate predictable, sometimes counterfactual, shifts in outputs and show that combining concept and relation components produces compound effects, indicating compositional knowledge representations. They find a layerwise organization where concrete concepts emerge in early layers while abstract relations concentrate in later layers, and that these components outperform individual SAE features in steering tasks. The approach provides a lightweight, interpretable framework for manipulating and understanding relational reasoning in LLMs, with implications for targeted model control and transparency, while acknowledging limitations related to dataset size, model diversity, and reliance on sparse autoencoders. The methodology offers a scalable path toward modular mechanistic interpretability and safe, controllable LLM deployment.

Abstract

We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on concept-relation prediction tasks, we show that ablating these components for concepts (e.g., countries and words) and relations (e.g., capital city and translation language) changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and concept components yields compound counterfactual outputs. Further analysis reveals that while most concept components emerge from the very first layer, more abstract relation components are concentrated in later layers. Lastly, we show that extracted components more comprehensively capture concepts and relations than individual features while maintaining specificity. Overall, our findings suggest a modular organization of knowledge accessed through compositional operations, and advance methods for efficient, targeted LLM manipulation.

Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

TL;DR

This work identifies modular, causally relevant components in large language models by constructing inter-layer feature networks from SAE coactivations and pruning to sparse, context-consistent groups that encode concepts and relations. By ablating or amplifying these components, the authors demonstrate predictable, sometimes counterfactual, shifts in outputs and show that combining concept and relation components produces compound effects, indicating compositional knowledge representations. They find a layerwise organization where concrete concepts emerge in early layers while abstract relations concentrate in later layers, and that these components outperform individual SAE features in steering tasks. The approach provides a lightweight, interpretable framework for manipulating and understanding relational reasoning in LLMs, with implications for targeted model control and transparency, while acknowledging limitations related to dataset size, model diversity, and reliance on sparse autoencoders. The methodology offers a scalable path toward modular mechanistic interpretability and safe, controllable LLM deployment.

Abstract

We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on concept-relation prediction tasks, we show that ablating these components for concepts (e.g., countries and words) and relations (e.g., capital city and translation language) changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and concept components yields compound counterfactual outputs. Further analysis reveals that while most concept components emerge from the very first layer, more abstract relation components are concentrated in later layers. Lastly, we show that extracted components more comprehensively capture concepts and relations than individual features while maintaining specificity. Overall, our findings suggest a modular organization of knowledge accessed through compositional operations, and advance methods for efficient, targeted LLM manipulation.

Paper Structure

This paper contains 56 sections, 1 equation, 17 figures, 22 tables.

Figures (17)

  • Figure 1: (A) We construct inter-layer feature networks from SAE coactivation patterns, prune high-density features, and extract task-relevant components. (B) Components are often consistent across contexts. (C) Selective ablation and amplification of components steers the model toward counterfactual outputs, overriding the original prompt.
  • Figure 2: We extract components from LLM queries to translate love into Spanish, French, and German. Each component is plotted with its feature count on the $x$-axis and the KL divergence between pre- and post-ablation output token distributions on the $y$-axis. Notably, for love and other concept, only a small number of components exert significant causal effects. For love, the top three components for each relation consistently correspond to either the queried word or language.
  • Figure 3: Components representing China (blue) and country language (green), visualized by network layer.
  • Figure 4: KL divergence between pre- and post-ablation output token distributions for each node in the China, Nigeria, and country fact components, plotted by layer. Linear regression lines plotted in red.
  • Figure 5: Gemma 2 9B China components extracted from capital and currency prompts.
  • ...and 12 more figures