Table of Contents
Fetching ...

ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind

Xiaomeng Ma, Lingyu Gao, Qihui Xu

TL;DR

<3-5 sentence high-level summary> ToMChallenges presents a principled dataset and an auto-grader to rigorously evaluate Theory of Mind (ToM) in large language models using Sally-Anne and Smarties narratives. By generating 30 variations per test and six task formats (fully-constrained, semi-constrained, open-ended), the study examines prompt- and task-dependent performance across three models, revealing inconsistent ToM abilities and limited robust reasoning. An auto-grader based on GPT-4 enables scalable, rubric-based evaluation of diverse responses, with error analyses highlighting true ToM failures, conservatism, and hallucinations as key error modes. The work advocates for principled, large-scale ToM assessment in LLMs and invites further exploration of prompt design and interpretability.>

Abstract

Theory of Mind (ToM), the capacity to comprehend the mental states of distinct individuals, is essential for numerous practical applications. With the development of large language models (LLMs), there is a heated debate about whether they are able to perform ToM tasks. Previous studies have used different tasks and prompts to test the ToM on LLMs and the results are inconsistent: some studies asserted these models are capable of exhibiting ToM, while others suggest the opposite. In this study, We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks. In addition, we also propose an auto-grader to streamline the answer evaluation process. We tested three models: davinci, turbo, and gpt-4. Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks. Performing the ToM tasks robustly remains a challenge for the LLMs. In addition, our paper wants to raise awareness in evaluating the ToM in LLMs and we want to invite more discussion on how to design the prompts and tasks for ToM tasks that can better assess the LLMs' ability.

ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind

TL;DR

<3-5 sentence high-level summary> ToMChallenges presents a principled dataset and an auto-grader to rigorously evaluate Theory of Mind (ToM) in large language models using Sally-Anne and Smarties narratives. By generating 30 variations per test and six task formats (fully-constrained, semi-constrained, open-ended), the study examines prompt- and task-dependent performance across three models, revealing inconsistent ToM abilities and limited robust reasoning. An auto-grader based on GPT-4 enables scalable, rubric-based evaluation of diverse responses, with error analyses highlighting true ToM failures, conservatism, and hallucinations as key error modes. The work advocates for principled, large-scale ToM assessment in LLMs and invites further exploration of prompt design and interpretability.>

Abstract

Theory of Mind (ToM), the capacity to comprehend the mental states of distinct individuals, is essential for numerous practical applications. With the development of large language models (LLMs), there is a heated debate about whether they are able to perform ToM tasks. Previous studies have used different tasks and prompts to test the ToM on LLMs and the results are inconsistent: some studies asserted these models are capable of exhibiting ToM, while others suggest the opposite. In this study, We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks. In addition, we also propose an auto-grader to streamline the answer evaluation process. We tested three models: davinci, turbo, and gpt-4. Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks. Performing the ToM tasks robustly remains a challenge for the LLMs. In addition, our paper wants to raise awareness in evaluating the ToM in LLMs and we want to invite more discussion on how to design the prompts and tasks for ToM tasks that can better assess the LLMs' ability.
Paper Structure (25 sections, 3 figures, 9 tables)

This paper contains 25 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: An example of Smarties test, as well as Mentalizing and False-Belief Understanding criteria.
  • Figure 3: The average accuracy of questions in Smarties test for different prompts.
  • Figure : MC = Multiple Choice, FB = Fill-in-the-Blank, TF = True/False, CoT-TF = Chain-of-Thought True/False, QA = Question Answering, Comp = Text Completion