Table of Contents
Fetching ...

Re-evaluating Theory of Mind evaluation in large language models

Jennifer Hu, Felix Sosa, Tomer Ullman

TL;DR

This paper argues that inconsistent conclusions about ToM in LLMs stem from ambiguous definitions—whether ToM should be judged by behavior or by underlying computations—and from evaluation practices that may probe auxiliary task demands rather than true mental state attribution. It advocates shifting toward computation-centric benchmarks grounded in cognitive theory (e.g., inverse planning) and explicitly accounting for auxiliary demands and pragmatics. The authors review positive and negative findings on LLM ToM, critique training-on-test and adversarial methods, and propose future directions including studying the link between pragmatics and ToM, learning trajectories, spontaneous versus prompted ToM, and mechanistic interpretability. The work aims to enable more precise, validity-grounded ToM assessments that advance understanding of both human cognition and artificial agents, with practical implications for safe and robust social AI.

Abstract

The question of whether large language models (LLMs) possess Theory of Mind (ToM) -- often defined as the ability to reason about others' mental states -- has sparked significant scientific and public interest. However, the evidence as to whether LLMs possess ToM is mixed, and the recent growth in evaluations has not resulted in a convergence. Here, we take inspiration from cognitive science to re-evaluate the state of ToM evaluation in LLMs. We argue that a major reason for the disagreement on whether LLMs have ToM is a lack of clarity on whether models should be expected to match human behaviors, or the computations underlying those behaviors. We also highlight ways in which current evaluations may be deviating from "pure" measurements of ToM abilities, which also contributes to the confusion. We conclude by discussing several directions for future research, including the relationship between ToM and pragmatic communication, which could advance our understanding of artificial systems as well as human cognition.

Re-evaluating Theory of Mind evaluation in large language models

TL;DR

This paper argues that inconsistent conclusions about ToM in LLMs stem from ambiguous definitions—whether ToM should be judged by behavior or by underlying computations—and from evaluation practices that may probe auxiliary task demands rather than true mental state attribution. It advocates shifting toward computation-centric benchmarks grounded in cognitive theory (e.g., inverse planning) and explicitly accounting for auxiliary demands and pragmatics. The authors review positive and negative findings on LLM ToM, critique training-on-test and adversarial methods, and propose future directions including studying the link between pragmatics and ToM, learning trajectories, spontaneous versus prompted ToM, and mechanistic interpretability. The work aims to enable more precise, validity-grounded ToM assessments that advance understanding of both human cognition and artificial agents, with practical implications for safe and robust social AI.

Abstract

The question of whether large language models (LLMs) possess Theory of Mind (ToM) -- often defined as the ability to reason about others' mental states -- has sparked significant scientific and public interest. However, the evidence as to whether LLMs possess ToM is mixed, and the recent growth in evaluations has not resulted in a convergence. Here, we take inspiration from cognitive science to re-evaluate the state of ToM evaluation in LLMs. We argue that a major reason for the disagreement on whether LLMs have ToM is a lack of clarity on whether models should be expected to match human behaviors, or the computations underlying those behaviors. We also highlight ways in which current evaluations may be deviating from "pure" measurements of ToM abilities, which also contributes to the confusion. We conclude by discussing several directions for future research, including the relationship between ToM and pragmatic communication, which could advance our understanding of artificial systems as well as human cognition.

Paper Structure

This paper contains 21 sections, 1 figure.

Figures (1)

  • Figure 1: What does it mean for a model to "have" Theory of Mind? Given an input observation (action $A$), there is a distinction between asking whether humans and models arrive at the same output (beliefs or predictions about latent mental states $M$), versus asking whether humans and models use the same kinds of computations to map from $A$ to $M$. The first question (Q1) is concerned with whether $M == M'$. The second question (Q2) is concerned with whether $f == f'$.