AdaptiveK Sparse Autoencoders: Dynamic Sparsity Allocation for Interpretable LLM Representations
Yifei Yao, Mengnan Du
TL;DR
This work tackles the interpretability bottleneck of large language models by proposing AdaptiveK Sparse Autoencoders, which adapt sparsity per input based on a learned complexity signal. By showing that context complexity is linearly encoded in LLM activations via linear probes, the approach uses a complexity predictor to set per-context sparsity k_adp through a sigmoid mapping, enabling dynamic TopK activations during training. Across eight LLMs and multiple scales, AdaptiveK consistently improves reconstruction fidelity, explained variance, cosine similarity, and various interpretability metrics while reducing the need for hyperparameter tuning. The results indicate substantial gains in efficiency and semantic disentanglement, with clear implications for scalable, interpretable representation learning in large language models.
Abstract
Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically adjusts sparsity levels based on the semantic complexity of each input. Leveraging linear probes, we demonstrate that context complexity is linearly encoded in LLM representations, and we use this signal to guide feature allocation during training. Experiments across ten language models (from 70M to 14B parameters) demonstrate that this complexity-driven adaptation significantly outperforms fixed-sparsity approaches on reconstruction fidelity, explained variance, cosine similarity and interpretability metrics while eliminating the computational burden of extensive hyperparameter tuning.
