The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights
Nura Aljaafari, Danilo S. Carvalho, André Freitas
TL;DR
The paper investigates how conceptually grounded interpretations emerge in transformer LLMs by introducing concept editing applied to reverse dictionary tasks. It employs causal tracing to localise when and where internal representations form, examining MLP, MHA, and hidden states across English and Spanish WordNets with GPT-J-6B and BERTIN-GPT-J-6B. Key findings show MLPs store and retrieve concepts via a key-value mechanism and Token Integration Mechanism; MHA layers perform distributed, compositional semantic processing with a last-token aggregator, while hidden states emphasize the final input and top layers, revealing gradual information aggregation. The study generalises across languages and proposes a conceptual interpretation mechanism that informs targeted interpretability interventions, with implications for safe knowledge editing and LM transparency.
Abstract
Locating and editing knowledge in large language models (LLMs) is crucial for enhancing their accuracy, safety, and inference rationale. We introduce ``concept editing'', an innovative variation of knowledge editing that uncovers conceptualisation mechanisms within these models. Using the reverse dictionary task, inference tracing, and input abstraction, we analyse the Multi-Layer Perceptron (MLP), Multi-Head Attention (MHA), and hidden state components of transformer models. Our results reveal distinct patterns: MLP layers employ key-value retrieval mechanism and context-dependent processing, which are highly associated with relative input tokens. MHA layers demonstrate a distributed nature with significant higher-level activations, suggesting sophisticated semantic integration. Hidden states emphasise the importance of the last token and top layers in the inference process. We observe evidence of gradual information building and distributed representation. These observations elucidate how transformer models process semantic information, paving the way for targeted interventions and improved interpretability techniques. Our work highlights the complex, layered nature of semantic processing in LLMs and the challenges of isolating and modifying specific concepts within these models.
