Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A. B. Siddique
TL;DR
The paper addresses the insufficiency of discrete neuron-to-concept mappings in large language models due to polysemanticity. It introduces NeuronLens, a range-based interpretability framework that localizes concepts to activation ranges within neurons, revealing Gaussian-like, concept-specific activation distributions. Through extensive experiments on encoder- and decoder-based LLMs across multiple classification datasets, NeuronLens achieves significantly reduced unintended interference while enabling precise manipulation of targeted concepts, outperforming traditional attribution. An adaptive dampening variant further improves robustness, preserving language modeling capabilities and latent tasks. This work advances interpretable control over LLMs and provides a quantitative basis for disentangling concept representations, with potential applications in safety and debiasing.
Abstract
Interpreting the internal mechanisms of large language models (LLMs) is crucial for improving their trustworthiness and utility. Prior work has primarily focused on mapping individual neurons to discrete semantic concepts. However, such mappings struggle to handle the inherent polysemanticity in LLMs, where individual neurons encode multiple, distinct concepts. Through a comprehensive analysis of both encoder and decoder-based LLMs across diverse datasets, we observe that even highly salient neurons, identified via various attribution techniques for specific semantic concepts, consistently exhibit polysemantic behavior. Importantly, activation magnitudes for fine-grained concepts follow distinct, often Gaussian-like distributions with minimal overlap. This observation motivates a shift from neuron attribution to range-based interpretation. We hypothesize that interpreting and manipulating neuron activation ranges would enable more precise interpretability and targeted interventions in LLMs. To validate our hypothesis, we introduce NeuronLens, a novel range-based interpretation and manipulation framework that provides a finer view of neuron activation distributions to localize concept attribution within a neuron. Extensive empirical evaluations demonstrate that NeuronLens significantly reduces unintended interference, while maintaining precise manipulation of targeted concepts, outperforming neuron attribution.
