Neuron-Level Knowledge Attribution in Large Language Models
Zeping Yu, Sophia Ananiadou
TL;DR
This work tackles the challenge of neuron-level attribution in large language models by introducing a scalable static framework that identifies influential neurons. It defines value neurons via a log probability increase score $Imp(v^l) = \log p(w|v^l+h^{l-1}) - \log p(w|h^{l-1})$ and complements this with a mechanism to locate query neurons—based on inner products with subkeys—that activate these value neurons. Through experiments on GPT2-large and Llama-7B across six knowledge types, the method outperforms seven static baselines on three metrics, revealing that both attention and FFN layers store knowledge with most impactful neurons located in deeper layers, and that a relatively small set of neurons drives most of the final predictions. These findings advance mechanistic interpretability and offer directions for knowledge editing and debugging in LLMs.
Abstract
Identifying important neurons for final predictions is essential for understanding the mechanisms of large language models. Due to computational constraints, current attribution techniques struggle to operate at neuron level. In this paper, we propose a static method for pinpointing significant neurons. Compared to seven other methods, our approach demonstrates superior performance across three metrics. Additionally, since most static methods typically only identify "value neurons" directly contributing to the final prediction, we propose a method for identifying "query neurons" which activate these "value neurons". Finally, we apply our methods to analyze six types of knowledge across both attention and feed-forward network (FFN) layers. Our method and analysis are helpful for understanding the mechanisms of knowledge storage and set the stage for future research in knowledge editing. The code is available on https://github.com/zepingyu0512/neuron-attribution.
