Neuron-Level Knowledge Attribution in Large Language Models

Zeping Yu; Sophia Ananiadou

Neuron-Level Knowledge Attribution in Large Language Models

Zeping Yu, Sophia Ananiadou

TL;DR

This work tackles the challenge of neuron-level attribution in large language models by introducing a scalable static framework that identifies influential neurons. It defines value neurons via a log probability increase score $Imp(v^l) = \log p(w|v^l+h^{l-1}) - \log p(w|h^{l-1})$ and complements this with a mechanism to locate query neurons—based on inner products with subkeys—that activate these value neurons. Through experiments on GPT2-large and Llama-7B across six knowledge types, the method outperforms seven static baselines on three metrics, revealing that both attention and FFN layers store knowledge with most impactful neurons located in deeper layers, and that a relatively small set of neurons drives most of the final predictions. These findings advance mechanistic interpretability and offer directions for knowledge editing and debugging in LLMs.

Abstract

Identifying important neurons for final predictions is essential for understanding the mechanisms of large language models. Due to computational constraints, current attribution techniques struggle to operate at neuron level. In this paper, we propose a static method for pinpointing significant neurons. Compared to seven other methods, our approach demonstrates superior performance across three metrics. Additionally, since most static methods typically only identify "value neurons" directly contributing to the final prediction, we propose a method for identifying "query neurons" which activate these "value neurons". Finally, we apply our methods to analyze six types of knowledge across both attention and feed-forward network (FFN) layers. Our method and analysis are helpful for understanding the mechanisms of knowledge storage and set the stage for future research in knowledge editing. The code is available on https://github.com/zepingyu0512/neuron-attribution.

Neuron-Level Knowledge Attribution in Large Language Models

TL;DR

and complements this with a mechanism to locate query neurons—based on inner products with subkeys—that activate these value neurons. Through experiments on GPT2-large and Llama-7B across six knowledge types, the method outperforms seven static baselines on three metrics, revealing that both attention and FFN layers store knowledge with most impactful neurons located in deeper layers, and that a relatively small set of neurons drives most of the final predictions. These findings advance mechanistic interpretability and offer directions for knowledge editing and debugging in LLMs.

Abstract

Paper Structure (31 sections, 13 equations, 10 figures, 14 tables)

This paper contains 31 sections, 13 equations, 10 figures, 14 tables.

Introduction
Related Work
Attribution Methods for Transformers
Mechanistic Interpretability
Methodology
Background
Definition of "neuron".
Distribution Change Caused by Neurons
Importance Score for "Value Neurons"
Importance Score for "Query Neurons"
Experiments
Comparison of Attribution Methods
Dataset.
Models.
Attribution methods.
...and 16 more sections

Figures (10)

Figure 1: (a) Query neurons in shallow FFN layers. (b) Attention query/value neurons in attention heads. (c) Value neurons in deep FFN layers.
Figure 2: Neuron distribution on all layers in Llama-7B.
Figure 3: Curves of log probability increase (left) and probability increase (right) on Llama-7B.
Figure 4: Top10 important "value layers" in GPT2.
Figure 5: Top10 important "value layers" in Llama.
...and 5 more figures

Neuron-Level Knowledge Attribution in Large Language Models

TL;DR

Abstract

Neuron-Level Knowledge Attribution in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)