Table of Contents
Fetching ...

Do LLMs "Feel"? Emotion Circuits Discovery and Control

Chenxi Wang, Yixuan Zhang, Ruiji Yu, Yufei Zheng, Lang Gao, Zirui Song, Zixiang Xu, Gus Xia, Huishuai Zhang, Dongyan Zhao, Xiuying Chen

TL;DR

This work investigates whether LLMs harbor context-agnostic mechanisms that drive emotional expression and whether these mechanisms can be harnessed for universal emotion control. It introduces the SEV dataset and a three-stage interpretability framework—extracting context-agnostic emotion directions, identifying local neurons and attention heads, and assembling global emotion circuits—that reveals tractable internal circuitry for emotion generation. By applying timely causal interventions (ablation and enhancement) and circuit modulation, the authors demonstrate high-accuracy, naturalistic emotion control, achieving 99.65% expression accuracy on a held-out test and outperforming prompting- and steering-based baselines. The findings provide mechanistic insight into emotional intelligence in AI and establish a principled path toward interpretable and controllable affective generation in large language models.

Abstract

As the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain context-agnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer's causal influence on the model's final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.

Do LLMs "Feel"? Emotion Circuits Discovery and Control

TL;DR

This work investigates whether LLMs harbor context-agnostic mechanisms that drive emotional expression and whether these mechanisms can be harnessed for universal emotion control. It introduces the SEV dataset and a three-stage interpretability framework—extracting context-agnostic emotion directions, identifying local neurons and attention heads, and assembling global emotion circuits—that reveals tractable internal circuitry for emotion generation. By applying timely causal interventions (ablation and enhancement) and circuit modulation, the authors demonstrate high-accuracy, naturalistic emotion control, achieving 99.65% expression accuracy on a held-out test and outperforming prompting- and steering-based baselines. The findings provide mechanistic insight into emotional intelligence in AI and establish a principled path toward interpretable and controllable affective generation in large language models.

Abstract

As the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain context-agnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer's causal influence on the model's final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.

Paper Structure

This paper contains 48 sections, 13 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of emotion circuit modulation. Compared with the original forward pass (top), our circuit-based modulation (bottom) drives hidden states to diverge into distinct emotion clusters across layers and produces coherent emotional responses. All examples shown are directly generated without any manual curation.
  • Figure 2: The first row (a–d) visualizes the last-token hidden states of prompting-based generations across layers 0, 9, 12, and 27. Initially, all samples overlap due to identical input tokens, but representations gradually diverge and form distinct emotion clusters in deeper layers. The second row (e–h) shows the layer-wise evolution of pure emotion vectors, which already display slight separation at layer 0 and become increasingly clustered with depth.
  • Figure 3: (a–b) Ablation: zeroing out the identified emotion-related components sharply decreases emotion scores, while random ablation has minimal effect. (c–d) Enhancement: injecting emotion difference vectors into identified components greatly increases $s$. All curves are plotted with 95% confidence intervals.
  • Figure 4: The layer-wise clustering visualizations of hidden states for all samples successfully guided by the prompting-based method.
  • Figure 5: The layer-wise clustering visualizations of pure emotion vectors.
  • ...and 3 more figures