Table of Contents
Fetching ...

Decomposing Theory of Mind: How Emotional Processing Mediates ToM Abilities in LLMs

Ivan Chulo, Ananya Joshi

TL;DR

This paper investigates the cognitive mechanisms underlying Theory of Mind (ToM) in large language models and how activation steering via Contrastive Activation Addition (CAA) modulates them. It decomposes ToM into 45 cognitive actions, trains linear probes, and applies CAA steering to evaluate changes on 1,000 BigToM forward belief scenarios, reporting a $14.2\%$ accuracy gain from $32.5\%$ to $46.7\%$ and linking the improvement to enhanced emotional processing while suppressing analytical interrogation. The key finding is that emotional understanding and generative hypothesis formation mediate ToM performance, rather than purely analytical reasoning, suggesting ToM in LLMs relies on affective representations. This work provides a mechanistic interpretability framework that combines targeted interventions with probe-based decomposition to analyze high-level cognitive abilities and informs steering design for improved social reasoning in AI.

Abstract

Recent work shows activation steering substantially improves language models' Theory of Mind (ToM) (Bortoletto et al. 2024), yet the mechanisms of what changes occur internally that leads to different outputs remains unclear. We propose decomposing ToM in LLMs by comparing steered versus baseline LLMs' activations using linear probes trained on 45 cognitive actions. We applied Contrastive Activation Addition (CAA) steering to Gemma-3-4B and evaluated it on 1,000 BigToM forward belief scenarios (Gandhi et al. 2023), we find improved performance on belief attribution tasks (32.5\% to 46.7\% accuracy) is mediated by activations processing emotional content : emotion perception (+2.23), emotion valuing (+2.20), while suppressing analytical processes: questioning (-0.78), convergent thinking (-1.59). This suggests that successful ToM abilities in LLMs are mediated by emotional understanding, not analytical reasoning.

Decomposing Theory of Mind: How Emotional Processing Mediates ToM Abilities in LLMs

TL;DR

This paper investigates the cognitive mechanisms underlying Theory of Mind (ToM) in large language models and how activation steering via Contrastive Activation Addition (CAA) modulates them. It decomposes ToM into 45 cognitive actions, trains linear probes, and applies CAA steering to evaluate changes on 1,000 BigToM forward belief scenarios, reporting a accuracy gain from to and linking the improvement to enhanced emotional processing while suppressing analytical interrogation. The key finding is that emotional understanding and generative hypothesis formation mediate ToM performance, rather than purely analytical reasoning, suggesting ToM in LLMs relies on affective representations. This work provides a mechanistic interpretability framework that combines targeted interventions with probe-based decomposition to analyze high-level cognitive abilities and informs steering design for improved social reasoning in AI.

Abstract

Recent work shows activation steering substantially improves language models' Theory of Mind (ToM) (Bortoletto et al. 2024), yet the mechanisms of what changes occur internally that leads to different outputs remains unclear. We propose decomposing ToM in LLMs by comparing steered versus baseline LLMs' activations using linear probes trained on 45 cognitive actions. We applied Contrastive Activation Addition (CAA) steering to Gemma-3-4B and evaluated it on 1,000 BigToM forward belief scenarios (Gandhi et al. 2023), we find improved performance on belief attribution tasks (32.5\% to 46.7\% accuracy) is mediated by activations processing emotional content : emotion perception (+2.23), emotion valuing (+2.20), while suppressing analytical processes: questioning (-0.78), convergent thinking (-1.59). This suggests that successful ToM abilities in LLMs are mediated by emotional understanding, not analytical reasoning.

Paper Structure

This paper contains 10 sections, 7 figures.

Figures (7)

  • Figure 1: Radar chart comparing baseline versus steered cognitive action activation patterns across categories. The steered condition (red) compared to baseline (blue). These findings support LLMs mirror known cognitive phenomena that emotional understanding is more important than analytical procecsses in perspective-taking
  • Figure 2: Steering effects across all cognitive actions and timepoints (n=1000). Left panels show individual action changes at three timepoints: at question (before answer), after true answer, and after wrong answer. Bars indicate mean layer count difference (steered - baseline) with positive values (right) showing increases and negative values (left) showing decreases. Emotional actions (emotion_perception, emotion_valuing, noticing) consistently increase across timepoints, while analytical actions (questioning, convergent_thinking, understanding) consistently decrease, revealing the emotional foundation of successful ToM.
  • Figure 3: Category-level analysis of steering effects. Negative points represent more activation at baseline while positive represent more activation on steered. Left: mean steering effect by cognitive action category. Right: distribution of effects at answer timepoint.
  • Figure 4: Top 10 cognitive actions with largest increases and decreases between baseline and steered conditions at question level (left) and answer level (right). Emotional actions (emotion_perception, emotion_valuing, noticing) show the strongest increase, while analytical actions (questioning, convergent_thinking, understanding) show the strongest decrease, revealing the cognitive processes most affected by successful ToM steering.
  • Figure 5: Heatmap of cognitive action activation differences (steered - baseline) across timepoints. Each row represents a cognitive action, and columns show the three measurement timepoints: at question, after true answer, and after wrong answer. Green indicates increase and red indicates decreases.
  • ...and 2 more figures