HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models
Yinghui He, Yufan Wu, Yilin Jia, Rada Mihalcea, Yulong Chen, Naihao Deng
TL;DR
Hi-ToM introduces the first benchmark tailored to higher-order Theory of Mind (ToM) reasoning in large language models, extending beyond prior first- and second-order tasks by incorporating zeroth to fourth-order questions and agent deception. The dataset comprises Sally-Anne–style stories with rooms, objects, containers, and five agents, including multi-chapter narratives and public/private communications to challenge recursive belief reasoning. Experimental results show that state-of-the-art LLMs, even with chain-of-thought prompting, exhibit substantial performance drops as ToM order rises, with deception further deteriorating accuracy. The authors analyze error types and behavioral patterns, arguing for human-inspired and symbolic-augmented approaches to strengthen ToM capabilities and inform future NLP systems. These findings highlight fundamental limits of current LLMs in complex social reasoning and point toward hybrid methods to improve real-world language understanding and interaction tasks.
Abstract
Theory of Mind (ToM) is the ability to reason about one's own and others' mental states. ToM plays a critical role in the development of intelligence, language understanding, and cognitive processes. While previous work has primarily focused on first and second-order ToM, we explore higher-order ToM, which involves recursive reasoning on others' beliefs. We introduce HI-TOM, a Higher Order Theory of Mind benchmark. Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks, demonstrating the limitations of current LLMs. We conduct a thorough analysis of different failure cases of LLMs, and share our thoughts on the implications of our findings on the future of NLP.
