Table of Contents
Fetching ...

SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han, Wujiang Xu, Mingyu Jin, Mengnan Du

TL;DR

The paper tackles the opacity of large language models by leveraging sparse autoencoders to decompose internal representations into interpretable features. It introduces SAGE, an agentic framework that turns feature explanation into an active, experiment-driven process, maintaining multiple explanations and refining them through empirical activation feedback. Across diverse LLMs and SAE configurations, SAGE achieves substantially higher generative and predictive accuracy than the Neuronpedia baseline, and it naturally uncovers polysemantic feature explanations. This approach improves interpretability and reliability of LLMs, with potential impacts on safety, alignment, and targeted model steering. The work also highlights practical trade-offs and provides guidelines for choosing the number of initial explanations and evaluation metrics.

Abstract

Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

TL;DR

The paper tackles the opacity of large language models by leveraging sparse autoencoders to decompose internal representations into interpretable features. It introduces SAGE, an agentic framework that turns feature explanation into an active, experiment-driven process, maintaining multiple explanations and refining them through empirical activation feedback. Across diverse LLMs and SAE configurations, SAGE achieves substantially higher generative and predictive accuracy than the Neuronpedia baseline, and it naturally uncovers polysemantic feature explanations. This approach improves interpretability and reliability of LLMs, with potential impacts on safety, alignment, and targeted model steering. The work also highlights practical trade-offs and provides guidelines for choosing the number of initial explanations and evaluation metrics.

Abstract

Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

Paper Structure

This paper contains 21 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the SAGE framework. The process begins when an explainer LLM generates an initial explanations ($H_i$) from high-activation text derived from the target LLM and SAE. A designer LLM then creates test text ($T_i$) to validate this explanation, initiating a multi-turn explanation refinement loop. Within this loop, an analyzer LLM observes the activations produced when $T_i$ is fed into the target LLM. A reviewer LLM then evaluates this feedback and decides whether to accept, reject, refute, or refine the current explanations. This iterative process continues until an explanations is accepted, culminating in the final explanation synthesis ($H^*$).
  • Figure 2: Ablation study on initial explanation count $k$. Prediction accuracy saturates at $k=10$ while token consumption continues increasing, demonstrating optimal efficiency at $k=10$.