The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights

Nura Aljaafari; Danilo S. Carvalho; André Freitas

The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights

Nura Aljaafari, Danilo S. Carvalho, André Freitas

TL;DR

The paper investigates how conceptually grounded interpretations emerge in transformer LLMs by introducing concept editing applied to reverse dictionary tasks. It employs causal tracing to localise when and where internal representations form, examining MLP, MHA, and hidden states across English and Spanish WordNets with GPT-J-6B and BERTIN-GPT-J-6B. Key findings show MLPs store and retrieve concepts via a key-value mechanism and Token Integration Mechanism; MHA layers perform distributed, compositional semantic processing with a last-token aggregator, while hidden states emphasize the final input and top layers, revealing gradual information aggregation. The study generalises across languages and proposes a conceptual interpretation mechanism that informs targeted interpretability interventions, with implications for safe knowledge editing and LM transparency.

Abstract

Locating and editing knowledge in large language models (LLMs) is crucial for enhancing their accuracy, safety, and inference rationale. We introduce ``concept editing'', an innovative variation of knowledge editing that uncovers conceptualisation mechanisms within these models. Using the reverse dictionary task, inference tracing, and input abstraction, we analyse the Multi-Layer Perceptron (MLP), Multi-Head Attention (MHA), and hidden state components of transformer models. Our results reveal distinct patterns: MLP layers employ key-value retrieval mechanism and context-dependent processing, which are highly associated with relative input tokens. MHA layers demonstrate a distributed nature with significant higher-level activations, suggesting sophisticated semantic integration. Hidden states emphasise the importance of the last token and top layers in the inference process. We observe evidence of gradual information building and distributed representation. These observations elucidate how transformer models process semantic information, paving the way for targeted interventions and improved interpretability techniques. Our work highlights the complex, layered nature of semantic processing in LLMs and the challenges of isolating and modifying specific concepts within these models.

The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights

TL;DR

Abstract

Paper Structure (33 sections, 5 equations, 25 figures, 3 tables)

This paper contains 33 sections, 5 equations, 25 figures, 3 tables.

Introduction
NL Definitions & Concept Representation
Proposed Approach
Localisation via Causal tracing
Eliciting the conceptual interpretation model
Empirical Analysis
Selected Datasets & Models
Localisation of conceptual interpretation patterns
MLP: Content association, lexical signalling and adaptive behaviour.
MHA layers: Compositional-distributional function
Hidden states: pivotal influence of top Layers and last tokens
Results validation
Alternative definitions
Results are transferable to other models and languages
Outline of a conceptual interpretation mechanism in LMs
...and 18 more sections

Figures (25)

Figure 1: Example of definitional semantic labelling for the term service.
Figure 2: Overview of causal tracing and conceptual locating in LMs.
Figure 3: (a-c) Causal traces for GPT-J-6B, illustrating the impact of restoring a window of 10 MLP layers. In all cases, the span of importance is short. The state representation differs between definitions, with strong word content associations in (a) and (b) and weaker content associations in (c).
Figure 4: Distribution of layer indices in the top 10 locations across 818 samples in the MLP layers. It follows a bimodal pattern with two clusters: the first grouping the early layers and the second grouping the top layers.
Figure 5: Sample of causal tracing with DSR labelling when restoring a window of 10 MLP layers. The representation highlights the distribution of important states over several layers and the importance of content words, mainly captured in supertype (a) and differentia quality (b) and (c).
...and 20 more figures

The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights

TL;DR

Abstract

The Mechanics of Conceptual Interpretation in GPT Models: Interpretative Insights

Authors

TL;DR

Abstract

Table of Contents

Figures (25)