Table of Contents
Fetching ...

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

Yifei Yao, Mengnan Du

TL;DR

This work tackles the interpretability bottleneck of large language models by proposing AdaptiveK Sparse Autoencoders, which adapt sparsity per input based on a learned complexity signal. By showing that context complexity is linearly encoded in LLM activations via linear probes, the approach uses a complexity predictor to set per-context sparsity k_adp through a sigmoid mapping, enabling dynamic TopK activations during training. Across eight LLMs and multiple scales, AdaptiveK consistently improves reconstruction fidelity, explained variance, cosine similarity, and various interpretability metrics while reducing the need for hyperparameter tuning. The results indicate substantial gains in efficiency and semantic disentanglement, with clear implications for scalable, interpretable representation learning in large language models.

Abstract

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically adjusts sparsity levels based on the semantic complexity of each input. Leveraging linear probes, we demonstrate that context complexity is linearly encoded in LLM representations, and we use this signal to guide feature allocation during training. Experiments across ten language models (from 70M to 14B parameters) demonstrate that this complexity-driven adaptation significantly outperforms fixed-sparsity approaches on reconstruction fidelity, explained variance, cosine similarity and interpretability metrics while eliminating the computational burden of extensive hyperparameter tuning.

AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations

TL;DR

This work tackles the interpretability bottleneck of large language models by proposing AdaptiveK Sparse Autoencoders, which adapt sparsity per input based on a learned complexity signal. By showing that context complexity is linearly encoded in LLM activations via linear probes, the approach uses a complexity predictor to set per-context sparsity k_adp through a sigmoid mapping, enabling dynamic TopK activations during training. Across eight LLMs and multiple scales, AdaptiveK consistently improves reconstruction fidelity, explained variance, cosine similarity, and various interpretability metrics while reducing the need for hyperparameter tuning. The results indicate substantial gains in efficiency and semantic disentanglement, with clear implications for scalable, interpretable representation learning in large language models.

Abstract

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically adjusts sparsity levels based on the semantic complexity of each input. Leveraging linear probes, we demonstrate that context complexity is linearly encoded in LLM representations, and we use this signal to guide feature allocation during training. Experiments across ten language models (from 70M to 14B parameters) demonstrate that this complexity-driven adaptation significantly outperforms fixed-sparsity approaches on reconstruction fidelity, explained variance, cosine similarity and interpretability metrics while eliminating the computational burden of extensive hyperparameter tuning.

Paper Structure

This paper contains 52 sections, 15 equations, 28 figures, 10 tables, 1 algorithm.

Figures (28)

  • Figure 1: Two samples were selected from each complexity level (simplest=0, moderate=5.5, and most complex=9.6) from test set. In TopK SAE, feature activation strictly follows the k value between 20-320, but often falls below the threshold when k=640. For Pythia-160M, fixed TopK SAEs (blue) maintain constant activation, while AdaptiveK (red) dynamically scales with text complexity.
  • Figure 2: Overall pipeline of the AdaptiveK SAE. Input text is fed into a LLM to extract internal activations, which are then passed through both a linear probe that predicts text complexity and a SAE for decomposition. During training, the linear probe's complexity score dynamically determines the number of features to activate, allowing more features for complex inputs and fewer for simple ones.
  • Figure 3: Visualization of linear probe performance across different LLM scales. Points represent test contexts, with redder areas indicating higher sample density. The red line depicts predicted complexity trends. Most samples fall within prediction intervals, confirming the linear probe's effectiveness. Spearman Correlation and RMSE values (upper left) demonstrate improved prediction accuracy with increasing model scale. More LLM results are in Figure \ref{['fig:more linear probe']}.
  • Figure 4: Visualization of Dynamic Feature Allocation by Text Complexity showing the relationship between complexity scores and allocated feature counts (K values). Average K values per complexity interval (connected by red lines) demonstrate that complex texts receive higher K allocations, with this relationship becoming increasingly linear as LLM scale grows. Horizontal lines indicate fixed Standard TopK baselines with K values on the right. More LLM results are in Figure \ref{['fig:more k']}.
  • Figure 5: L2 Loss pareto frontier results. More LLM results are in Figure \ref{['fig:more pareto_l2']}.
  • ...and 23 more figures