Table of Contents
Fetching ...

Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs

Michael Keeman

Abstract

Large language models appear to develop internal representations of emotion -- "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology -- clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods -- linear probing, causal activation patching, knockout experiments, and representational geometry -- and discover two dissociable emotion processing mechanisms. Affect reception -- detecting emotionally significant content -- operates with near-perfect accuracy (AUROC 1.000), consistent with early-layer saturation, and replicates across all six models. Emotion categorization -- mapping affect to specific emotion labels -- is partially keyword-dependent, dropping 1-7% without keywords and improving with scale. Causal activation patching confirms keyword-rich and keyword-free stimuli share representational space, transferring affective salience rather than emotion-category identity. These findings falsify the keyword-spotting hypothesis, establish a novel mechanistic dissociation, and introduce clinical stimulus methodology as a rigorous standard for testing emotion processing claims in large language models -- with direct implications for AI safety evaluation and alignment. All stimuli, code, and data are released for replication.

Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs

Abstract

Large language models appear to develop internal representations of emotion -- "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology -- clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods -- linear probing, causal activation patching, knockout experiments, and representational geometry -- and discover two dissociable emotion processing mechanisms. Affect reception -- detecting emotionally significant content -- operates with near-perfect accuracy (AUROC 1.000), consistent with early-layer saturation, and replicates across all six models. Emotion categorization -- mapping affect to specific emotion labels -- is partially keyword-dependent, dropping 1-7% without keywords and improving with scale. Causal activation patching confirms keyword-rich and keyword-free stimuli share representational space, transferring affective salience rather than emotion-category identity. These findings falsify the keyword-spotting hypothesis, establish a novel mechanistic dissociation, and introduce clinical stimulus methodology as a rigorous standard for testing emotion processing claims in large language models -- with direct implications for AI safety evaluation and alignment. All stimuli, code, and data are released for replication.
Paper Structure (55 sections, 11 figures, 18 tables)

This paper contains 55 sections, 11 figures, 18 tables.

Figures (11)

  • Figure 1: Experimental design overview. Both keyword-rich stimuli (Set A; explicit emotion vocabulary) and clinical vignettes (Set B; situational cues, no emotion words) are processed by the same six LLMs. Four convergent analysis methods---linear probing, activation patching, knockout experiments, and representational geometry---probe internal activations at every layer, yielding evidence for two dissociable mechanisms: affect reception (binary, early-layer, universal) and emotion categorisation (8-class, mid-to-late layer, scale-sensitive).
  • Figure 2: Component decomposition of emotion encoding on Set A (keyword-rich stimuli). Layer-wise 8-class AUROC for three activation types---residual stream ($\mathbf{h}$, solid blue), MHSA output ($\mathbf{a}$, dashed orange), and FFN output ($\mathbf{m}$, dotted red)---across all six models. The MHSA component reaches peak AUROC at mid-layer (0.43--0.75 normalized depth), while the FFN component peaks later (0.43--0.97). All three components converge to ceiling by the final layers. The earlier MHSA peak suggests that multi-head self-attention consolidates emotion information before feed-forward layers process it---a temporal dissociation within the standard transformer sublayer sequence.
  • Figure 3: Layer-wise 8-class probe AUROC for keyword-rich (Set A, solid blue) and keyword-free clinical (Set B, dashed orange) stimuli across all six models at normalized layer depth. The shaded zone marks the early layers where binary (emotional vs. neutral) detection saturates---affect reception completes before categorical classification peaks. Set B curves converge on Set A in larger models (Llama-8B, Gemma-9B) but diverge more in the 1B model, reflecting the scale-dependent architecture shift. Filled dots mark each model's peak AUROC layer. Base models (left column), instruct variants (right column).
  • Figure 4: The affect reception / emotion categorisation dissociation on keyword-free stimuli (Set B). Each row shows one model. Blue filled circles: binary (emotional vs. neutral) AUROC at peak layer, measuring affect reception. Orange open circles: 8-class AUROC at the same layer, measuring emotion categorisation. The annotated gap is the cost of removing keywords: binary detection remains near-perfect across all models (${\geq}\,0.999$), while categorical accuracy drops by 1.1--6.7 percentage points. The gap shrinks with scale (6.7 pp for Llama-1B Instruct vs. 1.1--1.9 pp for 8B--9B models), but the dissociation is present in every model. Chance baseline (8-class) $= 0.125$, off scale.
  • Figure 5: Cross-set probe transfer asymmetry. Each pair of bars compares A$\to$B transfer (keyword-trained probe applied to clinical vignettes; light blue) against B$\to$A transfer (clinical-trained probe applied to keyword-rich stimuli; dark blue) at peak layer, across all six model variants. B$\to$A exceeds A$\to$B by 1.5--11.4 percentage points in every case. The asymmetry is largest for Llama-1B (${\sim}11$ pp) and smallest for larger models (1.5--3.6 pp), indicating that scale drives emotion representations toward greater stimulus-type invariance. A probe trained on keyword-free representations captures a more generalizable signal than a probe trained on keyword-rich text.
  • ...and 6 more figures