HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

Yinghui He; Yufan Wu; Yilin Jia; Rada Mihalcea; Yulong Chen; Naihao Deng

HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

Yinghui He, Yufan Wu, Yilin Jia, Rada Mihalcea, Yulong Chen, Naihao Deng

TL;DR

Hi-ToM introduces the first benchmark tailored to higher-order Theory of Mind (ToM) reasoning in large language models, extending beyond prior first- and second-order tasks by incorporating zeroth to fourth-order questions and agent deception. The dataset comprises Sally-Anne–style stories with rooms, objects, containers, and five agents, including multi-chapter narratives and public/private communications to challenge recursive belief reasoning. Experimental results show that state-of-the-art LLMs, even with chain-of-thought prompting, exhibit substantial performance drops as ToM order rises, with deception further deteriorating accuracy. The authors analyze error types and behavioral patterns, arguing for human-inspired and symbolic-augmented approaches to strengthen ToM capabilities and inform future NLP systems. These findings highlight fundamental limits of current LLMs in complex social reasoning and point toward hybrid methods to improve real-world language understanding and interaction tasks.

Abstract

Theory of Mind (ToM) is the ability to reason about one's own and others' mental states. ToM plays a critical role in the development of intelligence, language understanding, and cognitive processes. While previous work has primarily focused on first and second-order ToM, we explore higher-order ToM, which involves recursive reasoning on others' beliefs. We introduce HI-TOM, a Higher Order Theory of Mind benchmark. Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks, demonstrating the limitations of current LLMs. We conduct a thorough analysis of different failure cases of LLMs, and share our thoughts on the implications of our findings on the future of NLP.

HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

TL;DR

Abstract

Paper Structure (34 sections, 15 figures, 8 tables, 2 algorithms)

This paper contains 34 sections, 15 figures, 8 tables, 2 algorithms.

Introduction
The Hi-ToM Dataset
Dataset Design
Story Design.
Question-Answer Design.
Data Generation
Dataset Characteristics
Experimental Setup
Models
Methods
Evaluation
Experimental Results
CoTP prompting yields insignificant performance gains.
Increased ToM order leads to decreased performances.
LLMs' performance decreases as there are more deception communications involved.
...and 19 more sections

Figures (15)

Figure 1: A Hi-ToM story containing communications among agents, adapted from the Sally-Anne story baron1985does. The four questions at the bottom correspond to orders zeroth (reality) to third in ToM reasoning.
Figure 2: Joint accuracy of GPT-4 and GPT-3.5 on Hi-ToM stories w/ or w/o deceptive agent communications. The $x$-axis stands for ToM orders, and the $y$-axis is for story lengths (number of chapters). CoTP and VP respectively represent chain-of-thought and multiple-choice-w/o-explanation prompting styles. The devil sign () signifies accuracy results on stories with deception, while other results pertain to non-deceptive stories.
Figure 3: Joint accuracy of GPT-4 on Hi-ToM stories with 0 to 4 sentences of deceptive agent communication. 0th-order (reality) accuracy is not included, since the answer to the real room of the objects is not affected by deceptive communications.
Figure 4: Frequency of GPT-4 correctly or incorrectly answering a question of a three-chapter story, based on whether or not the correct answer is the last or first container mentioned in the story. "Last"/"First" and "$\neg$Last"/"$\neg$First" indicate whether or not the correct answer lies at the last/first container.
Figure 5: Standard accuracy of GPT-4 on 2nd, 3rd, and 4th-order questions, categorized by whether the correct answer matches the corresponding 1st-order answer.
...and 10 more figures

HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

TL;DR

Abstract

HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)