OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models

Hainiu Xu; Runcong Zhao; Lixing Zhu; Jinhua Du; Yulan He

OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models

Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, Yulan He

TL;DR

The paper introduces OpenToM, a comprehensive neural Theory-of-Mind benchmark that overcomes ambiguities in prior datasets by using longer narratives with explicit personality traits and intended actions to test both physical and psychological ToM in LLMs. It outlines a four-stage data-generation pipeline (personality assignment, intention/enaction generation, narrative construction, human refinement) and a task design with 696 narratives across Loc, MHop, and Att questions. OpenToM evaluates a range of LLMs with CoT and SimToM prompting, revealing strong performance on physical ToM components but notable gaps in reasoning about psychological states. The work provides a valuable resource for benchmarking and guiding future improvements in ToM and social commonsense reasoning in AI systems.

Abstract

Neural Theory-of-Mind (N-ToM), machine's ability to understand and keep track of the mental states of others, is pivotal in developing socially intelligent agents. However, prevalent N-ToM benchmarks have several shortcomings, including the presence of ambiguous and artificial narratives, absence of personality traits and preferences, a lack of questions addressing characters' psychological mental states, and limited diversity in the questions posed. In response to these issues, we construct OpenToM, a new benchmark for assessing N-ToM with (1) longer and clearer narrative stories, (2) characters with explicit personality traits, (3) actions that are triggered by character intentions, and (4) questions designed to challenge LLMs' capabilities of modeling characters' mental states of both the physical and psychological world. Using OpenToM, we reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological world.

OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models

TL;DR

Abstract

Paper Structure (44 sections, 3 equations, 8 figures, 16 tables, 1 algorithm)

This paper contains 44 sections, 3 equations, 8 figures, 16 tables, 1 algorithm.

Introduction
The OpenToM Dataset
OpenToM Construction
OpenToM Overview
Task Formulation
Question Genres
Mitigating Spurious Correlation
Dataset Validation
Experiments
Baseline Models
Prompting Techniques
Overall Results
Detailed Result Analysis
Faithfulness in Loc Questions
Performance Gap in Character Roles
...and 29 more sections

Figures (8)

Figure 1: Illustration of a simplified story from OpenToM and the corresponding first-order ToM questions. This story features two protagonists: Sam (observer) and Amy (mover); and an entity-of-interest: rubber duck. There are two containers involved: a basket and Amy's backpack. Each narrative within OpenToM is followed by three types of questions, namely questions regarding the location (Loc) of an entity, questions that involve multi-hop reasoning (MHop), and questions about the characters' attitude (Att).
Figure 2: The data generating process of OpenToM dataset. Using the story in Figure \ref{['fig:figure1']} as an example, the features created in the personification process are shown in Part (A), which include character preference ($\varheart$), belief of the other character's preference ($\clubsuit$), the perturbed mover's preference belief ($\spadesuit$), the mover's personality trait ($\bigstar$), and the mover's intention and action ($\blacklozenge$). The usage of these information in the OpenToM plot are shown in Part (B) next to the paragraph indicator. See Appendix \ref{['app:annotation']} for detailed description of the Human Annotation and Rule-Based Label Generation process.
Figure 3: A Bayesian Network representation of the dependencies among preference ($P$), personality trait ($T$), intention ($Int$), action ($Act$), and attitude ($Att$). The causal relations are represented by solid arrows. The spurious correlations are represented by dashed arrows. The grey-shaded variables are observable by the observer and the unshaded variables are latent to the observer.
Figure 4: Faithfulness of LLMs in answering Loc questions. The x-axis displays the evaluation model and the y-axis displays the Unfaithful Rate.
Figure A1: Illustration of the reasoning tree employed to answer the Fullness questions.
...and 3 more figures

OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models

TL;DR

Abstract

OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)