Table of Contents
Fetching ...

Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

TL;DR

This work examines how large language models encode beliefs about self and others (Theory of Mind) and whether these representations are genuine or incidental. Through extensive probing across 12 LM variants and the BigToM dataset, the authors show that belief representations emerge with model size and fine-tuning, are structured yet sensitive to prompts, and can be strengthened via activation steering. They introduce Contrastive Activation Addition (CAA) as an efficient, post-hoc method to enhance ToM performance across tasks, outperforming prior activation-editing approaches. The findings have implications for alignment, safety, and practical steering of social reasoning in LM systems, while also outlining limitations and avenues for embedding perspective-taking circuitry.

Abstract

Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.

Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

TL;DR

This work examines how large language models encode beliefs about self and others (Theory of Mind) and whether these representations are genuine or incidental. Through extensive probing across 12 LM variants and the BigToM dataset, the authors show that belief representations emerge with model size and fine-tuning, are structured yet sensitive to prompts, and can be strengthened via activation steering. They introduce Contrastive Activation Addition (CAA) as an efficient, post-hoc method to enhance ToM performance across tasks, outperforming prior activation-editing approaches. The findings have implications for alignment, safety, and practical steering of social reasoning in LM systems, while also outlining limitations and avenues for embedding perspective-taking circuitry.

Abstract

Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical - not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts - using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs' internal representations of others' beliefs, which are structured - not mere by-products of spurious correlations - yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.
Paper Structure (36 sections, 2 equations, 12 figures, 4 tables)

This paper contains 36 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: ToM tasks are challenging for LMs, but correct predictions can sometimes be recovered by probing their internal representations. We study how internal representations of beliefs of self and others emerge in 12 LMs, and show that these representations are structured yet brittle to prompts, and can be strengthened with a steering vector to fix incorrect ToM inferences.
  • Figure 2: Example of false belief from our probing datasets. The labels $z_p$ and $z_o$ correspond to $\mathcal{D}_p^P$ and $\mathcal{D}_o^P$, respectively. By manipulating the protagonist's percepts after the causal event, we obtain two scenarios: true belief and false belief.
  • Figure 3: Belief probing accuracy show similar patterns across all models: oracle belief representations generally form already in the first layers, while protagonist belief representations emerge at the intermediate layers. Moreover, probing accuracy increases with model size and, more crucially for smaller models, with fine-tuning.
  • Figure 4: We compare the probing accuracy obtained by using the original set of activations (All) with the accuracy obtained by considering only the first $k=\{2, 10, 100, 1000\}$ principal components. Results are for protagonist beliefs (for oracle see \ref{['fig:pca-oracle']}). In general, it is possible to recover most of the original accuracy by training probes on a smaller number $k$ of principal components of the activations.
  • Figure 5: Sensitivity of protagonist belief probing accuracy to different prompt variations. Results for Pythia are shown in \ref{['fig:prompt-pythia']}. Representations are brittle to prompt variations.
  • ...and 7 more figures