Decomposing Theory of Mind: How Emotional Processing Mediates ToM Abilities in LLMs
Ivan Chulo, Ananya Joshi
TL;DR
This paper investigates the cognitive mechanisms underlying Theory of Mind (ToM) in large language models and how activation steering via Contrastive Activation Addition (CAA) modulates them. It decomposes ToM into 45 cognitive actions, trains linear probes, and applies CAA steering to evaluate changes on 1,000 BigToM forward belief scenarios, reporting a $14.2\%$ accuracy gain from $32.5\%$ to $46.7\%$ and linking the improvement to enhanced emotional processing while suppressing analytical interrogation. The key finding is that emotional understanding and generative hypothesis formation mediate ToM performance, rather than purely analytical reasoning, suggesting ToM in LLMs relies on affective representations. This work provides a mechanistic interpretability framework that combines targeted interventions with probe-based decomposition to analyze high-level cognitive abilities and informs steering design for improved social reasoning in AI.
Abstract
Recent work shows activation steering substantially improves language models' Theory of Mind (ToM) (Bortoletto et al. 2024), yet the mechanisms of what changes occur internally that leads to different outputs remains unclear. We propose decomposing ToM in LLMs by comparing steered versus baseline LLMs' activations using linear probes trained on 45 cognitive actions. We applied Contrastive Activation Addition (CAA) steering to Gemma-3-4B and evaluated it on 1,000 BigToM forward belief scenarios (Gandhi et al. 2023), we find improved performance on belief attribution tasks (32.5\% to 46.7\% accuracy) is mediated by activations processing emotional content : emotion perception (+2.23), emotion valuing (+2.20), while suppressing analytical processes: questioning (-0.78), convergent thinking (-1.59). This suggests that successful ToM abilities in LLMs are mediated by emotional understanding, not analytical reasoning.
