Unveiling Language Competence Neurons: A Psycholinguistic Approach to Model Interpretability
Xufeng Duan, Xinyu Zhou, Bei Xiao, Zhenguang G. Cai
TL;DR
This study probes neuron-level language competence in GPT-2-XL by applying psycholinguistic tasks (sound-shape, sound-gender, implicit causality) and using accumulative direct effect to identify top contributing neurons. Through targeted ablation and activation manipulation, it demonstrates causal links between specific neurons and human-like performance in the sound-gender and implicit causality tasks, while showing no such specialization for the sound-shape task. The findings suggest that certain linguistic abilities in LLMs are supported by identifiable neurons, advancing interpretability by connecting cognitive phenomena to neural substrates. However, the approach reveals limitations in tasks requiring distributed representations and raises questions about generalization to more capable, modern models.
Abstract
As large language models (LLMs) advance in their linguistic capacity, understanding how they capture aspects of language competence remains a significant challenge. This study therefore employs psycholinguistic paradigms in English, which are well-suited for probing deeper cognitive aspects of language processing, to explore neuron-level representations in language model across three tasks: sound-shape association, sound-gender association, and implicit causality. Our findings indicate that while GPT-2-XL struggles with the sound-shape task, it demonstrates human-like abilities in both sound-gender association and implicit causality. Targeted neuron ablation and activation manipulation reveal a crucial relationship: When GPT-2-XL displays a linguistic ability, specific neurons correspond to that competence; conversely, the absence of such an ability indicates a lack of specialized neurons. This study is the first to utilize psycholinguistic experiments to investigate deep language competence at the neuron level, providing a new level of granularity in model interpretability and insights into the internal mechanisms driving language ability in the transformer-based LLM.
