Table of Contents
Fetching ...

ToMBench: Benchmarking Theory of Mind in Large Language Models

Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, Minlie Huang

TL;DR

ToMBench addresses the insufficiencies of prior ToM evaluations by delivering a systematic, automated, bilingual benchmark with 8 tasks and 31 abilities, built from scratch to avoid data leakage. It enables both task- and ability-level analysis across 10 representative LLMs and a 20-person human baseline, revealing a persistent gap where GPT-4 remains over 10 percentage points below human performance and CoT prompting yields little benefit. The results highlight that LLM ToM relies more on surface semantic cues than human-like mental-state reasoning, and that bilingual evaluation introduces measurable differences. By providing a scalable, automated evaluation framework and public data/code, ToMBench aims to drive the development of truly socially intelligent LLMs and motivates future multimodal and multilingual ToM research.

Abstract

Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs' ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.

ToMBench: Benchmarking Theory of Mind in Large Language Models

TL;DR

ToMBench addresses the insufficiencies of prior ToM evaluations by delivering a systematic, automated, bilingual benchmark with 8 tasks and 31 abilities, built from scratch to avoid data leakage. It enables both task- and ability-level analysis across 10 representative LLMs and a 20-person human baseline, revealing a persistent gap where GPT-4 remains over 10 percentage points below human performance and CoT prompting yields little benefit. The results highlight that LLM ToM relies more on surface semantic cues than human-like mental-state reasoning, and that bilingual evaluation introduces measurable differences. By providing a scalable, automated evaluation framework and public data/code, ToMBench aims to drive the development of truly socially intelligent LLMs and motivates future multimodal and multilingual ToM research.

Abstract

Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs' ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.
Paper Structure (55 sections, 6 figures, 22 tables)

This paper contains 55 sections, 6 figures, 22 tables.

Figures (6)

  • Figure 1: TMBench is a systematic, automated, and original bilingual ToM benchmark for LLMs, covering 8 tasks and 31 abilities. TMBench contains 2,860 testing samples involving diverse real-world social scenarios.
  • Figure 2: The mapping between 8 tasks and 31 ATOMS abilities. The suffix after each ability indicates its occurrence within specific tasks, whereas those with "#" are not covered by tasks and are evaluated with extra test samples.
  • Figure 3: Topics of social scenarios in TMBench. Under 9 primary topics, we highlight the top-5 sub-topics with the highest frequency.
  • Figure 4: The performance variance under the coherent test. Full results are present in Appendix \ref{['appendix-coherent-results']}, Table \ref{['tab-task-coherent-results']}.
  • Figure 5: The difference between the human and LLM's attentions. Color intensity denotes attention weights.
  • ...and 1 more figures