Table of Contents
Fetching ...

Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

Junsol Kim, Winnie Street, Roberta Rocca, Daine M. Korngiebel, Adam Waytz, James Evans, Geoff Keeling

Abstract

Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

Abstract

Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

Paper Structure

This paper contains 30 sections, 9 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Jailbreaking large language models shifts mind-attribution toward human-like levels.a, Illustration of the model transformation pipeline. A pretrained base model is instruction-tuned with safety training and subsequently jailbroken via ablation of the safety-refusal direction. b, Red and blue points represent harmful and harmless instructions, respectively; the gray arrow denotes the extracted safety-refusal vector used for ablation. c, The instruction-tuned model refuses unsafe queries, whereas the jailbroken model complies. d, Mind-attribution scores (0--10) across various entity categories. Dots and error bars denote marginal means and $95\%$ CIs, showing that jailbroken models (red) attribute higher degrees of mind than instruction-tuned models (blue). e, Scores measuring belief in God. f, Self-attribution of mindedness. g, Kernel density estimate plot of humans' mind-attribution scores ($n = 500$). Dashed vertical lines indicate the means for the human (black), the instruction-tuned model (blue), and the jailbroken model (red).
  • Figure 2: Safety fine-tuning selectively suppresses mind-attribution without disrupting Theory of Mind.a, Angular relationships between the Safety, Mind-Attribution (IDAQ), and ToM directions in the residual stream of Llama-3-8B Layer 32. In the base model (left), Safety and Mind-Attribution are nearly orthogonal (97°); after instruction tuning (right), they become obtuse (122°), indicating that mind-attribution is represented as opposing safety. The Safety--ToM angle remains largely unchanged (85° $\to$ 77°). b, Change in cosine similarity ($\Delta\cos$) between the Safety direction and each task direction after instruction tuning in Llama-3-8B. c, (Left) Accuracy (%) on social reasoning benchmarks (MoToMQA ToM split, HI-ToM, SimpleToM) and general reasoning (MMLU, MoToMQA Factual split) under Instructed (blue) and Jailbroken (red) conditions, aggregated across models. Dots and error bars denote means and 95% CIs. (Right) MoToMQA (ToM split) accuracy broken down by order of mental state inference (2nd- through 6th-order).
  • Figure S1: Layer-wise cosine similarity between the safety direction and task-specific directions. Left: Safety $\leftrightarrow$ Mind-Attribution (IDAQ). Right: Safety $\leftrightarrow$ ToM. In the base model (blue, dashed), both directions show weak alignment with the safety direction. After instruction tuning (orange, solid), the IDAQ direction becomes strongly anti-aligned with safety across middle-to-late layers, while the ToM direction remains largely unchanged.
  • Figure S2: Placebo test: subject-matched control for the safety--IDAQ alignment. Distribution of $\Delta\mathcal{S}$ (Instruct $-$ Base) across layers for the IDAQ direction (same subjects, mental attributes; red) and the subject-matched control (same subjects, non-mental attributes; yellow). Points denote individual layers; bars indicate 95% CI around the mean. The IDAQ direction shows a significant negative shift, whereas the subject-matched control shows no significant shift. This confirms that the safety--IDAQ entanglement is driven by mental-state attribution specifically, not by the subjects (e.g., robots, animals) themselves.