Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling
TL;DR
This work examines how large language models encode beliefs about self and others (Theory of Mind) and whether these representations are genuine or incidental. Through extensive probing across 12 LM variants and the BigToM dataset, the authors show that belief representations emerge with model size and fine-tuning, are structured yet sensitive to prompts, and can be strengthened via activation steering. They introduce Contrastive Activation Addition (CAA) as an efficient, post-hoc method to enhance ToM performance across tasks, outperforming prior activation-editing approaches. The findings have implications for alignment, safety, and practical steering of social reasoning in LM systems, while also outlining limitations and avenues for embedding perspective-taking circuitry.
Abstract
Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.
