Table of Contents
Fetching ...

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A. B. Siddique

TL;DR

The paper addresses the insufficiency of discrete neuron-to-concept mappings in large language models due to polysemanticity. It introduces NeuronLens, a range-based interpretability framework that localizes concepts to activation ranges within neurons, revealing Gaussian-like, concept-specific activation distributions. Through extensive experiments on encoder- and decoder-based LLMs across multiple classification datasets, NeuronLens achieves significantly reduced unintended interference while enabling precise manipulation of targeted concepts, outperforming traditional attribution. An adaptive dampening variant further improves robustness, preserving language modeling capabilities and latent tasks. This work advances interpretable control over LLMs and provides a quantitative basis for disentangling concept representations, with potential applications in safety and debiasing.

Abstract

Interpreting the internal mechanisms of large language models (LLMs) is crucial for improving their trustworthiness and utility. Prior work has primarily focused on mapping individual neurons to discrete semantic concepts. However, such mappings struggle to handle the inherent polysemanticity in LLMs, where individual neurons encode multiple, distinct concepts. Through a comprehensive analysis of both encoder and decoder-based LLMs across diverse datasets, we observe that even highly salient neurons, identified via various attribution techniques for specific semantic concepts, consistently exhibit polysemantic behavior. Importantly, activation magnitudes for fine-grained concepts follow distinct, often Gaussian-like distributions with minimal overlap. This observation motivates a shift from neuron attribution to range-based interpretation. We hypothesize that interpreting and manipulating neuron activation ranges would enable more precise interpretability and targeted interventions in LLMs. To validate our hypothesis, we introduce NeuronLens, a novel range-based interpretation and manipulation framework that provides a finer view of neuron activation distributions to localize concept attribution within a neuron. Extensive empirical evaluations demonstrate that NeuronLens significantly reduces unintended interference, while maintaining precise manipulation of targeted concepts, outperforming neuron attribution.

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

TL;DR

The paper addresses the insufficiency of discrete neuron-to-concept mappings in large language models due to polysemanticity. It introduces NeuronLens, a range-based interpretability framework that localizes concepts to activation ranges within neurons, revealing Gaussian-like, concept-specific activation distributions. Through extensive experiments on encoder- and decoder-based LLMs across multiple classification datasets, NeuronLens achieves significantly reduced unintended interference while enabling precise manipulation of targeted concepts, outperforming traditional attribution. An adaptive dampening variant further improves robustness, preserving language modeling capabilities and latent tasks. This work advances interpretable control over LLMs and provides a quantitative basis for disentangling concept representations, with potential applications in safety and debiasing.

Abstract

Interpreting the internal mechanisms of large language models (LLMs) is crucial for improving their trustworthiness and utility. Prior work has primarily focused on mapping individual neurons to discrete semantic concepts. However, such mappings struggle to handle the inherent polysemanticity in LLMs, where individual neurons encode multiple, distinct concepts. Through a comprehensive analysis of both encoder and decoder-based LLMs across diverse datasets, we observe that even highly salient neurons, identified via various attribution techniques for specific semantic concepts, consistently exhibit polysemantic behavior. Importantly, activation magnitudes for fine-grained concepts follow distinct, often Gaussian-like distributions with minimal overlap. This observation motivates a shift from neuron attribution to range-based interpretation. We hypothesize that interpreting and manipulating neuron activation ranges would enable more precise interpretability and targeted interventions in LLMs. To validate our hypothesis, we introduce NeuronLens, a novel range-based interpretation and manipulation framework that provides a finer view of neuron activation distributions to localize concept attribution within a neuron. Extensive empirical evaluations demonstrate that NeuronLens significantly reduces unintended interference, while maintaining precise manipulation of targeted concepts, outperforming neuron attribution.

Paper Structure

This paper contains 28 sections, 4 equations, 18 figures, 17 tables.

Figures (18)

  • Figure 1: $\mathsf{NeuronLens}$ leverages distinct, Gaussian-like activation patterns to enable fine-grained concept attribution.
  • Figure 2: Overlap of top 30% salient neurons across classes.
  • Figure 3: Neuronal Activation Patterns of six neurons on AG-News dataset class 1. Neurons 418 and 447 are the highest activating neurons, neurons 132 and 387 are middle-ranked neurons, and neurons 721 and 365 are the lowest activating neurons.
  • Figure 4: Comparison of neurons 480 and 675 showing class-specific activation patterns and fitted Gaussian curves. Both neurons were salient across all classes in top 5% on AG-News.
  • Figure 5: Box plot of neural activation of 11 polysemantic neurons (i.e: neurons in the salient group for all classes, percentage selected: 5% top salient) for 4 randomly selected classes out of 14 classes of DBPedia-14 dataset.
  • ...and 13 more figures