Table of Contents
Fetching ...

ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind

Kazutoshi Shinoda, Nobukatsu Hojo, Kyosuke Nishida, Saki Mizuno, Keita Suzuki, Ryo Masumura, Hiroaki Sugiyama, Kuniko Saito

TL;DR

ToMATO presents a comprehensive Theory-of-Mind benchmark generated via inter-LLM conversations that include inner speech prompts, information asymmetry, and varied personality traits. It assesses first- and second-order mental states across belief, intention, desire, emotion, and knowledge, and extends to false beliefs (ToMATO-FB) with 5.4k questions over 753 conversations and 15 personality patterns. The study shows that even strong LLMs fall short of human performance, especially on false-belief tasks, and that performance varies with mental state and personality traits, indicating robustness gaps. By explicitly modeling hidden thoughts and diverse character traits, ToMATO aims to provide a more realistic, debuggable, and actionable benchmark for deploying ToM-capable systems in real-world communication. The results highlight persistent gaps and suggest directions for progress in improving ToM reasoning in LLMs, including extended training, better handling of false beliefs, and enhanced robustness to personality variation.

Abstract

Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in three aspects: 1) they assess a limited range of mental states such as beliefs, 2) false beliefs are not comprehensively explored, and 3) the diverse personality traits of characters are overlooked. To address these challenges, we introduce ToMATO, a new ToM benchmark formulated as multiple-choice QA over conversations. ToMATO is generated via LLM-LLM conversations featuring information asymmetry. By employing a prompting method that requires role-playing LLMs to verbalize their thoughts before each utterance, we capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge. These verbalized thoughts serve as answers to questions designed to assess the mental states of characters within conversations. Furthermore, the information asymmetry introduced by hiding thoughts from others induces the generation of false beliefs about various mental states. Assigning distinct personality traits to LLMs further diversifies both utterances and thoughts. ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns. Our analysis shows that this dataset construction approach frequently generates false beliefs due to the information asymmetry between role-playing LLMs, and effectively reflects diverse personalities. We evaluate nine LLMs on ToMATO and find that even GPT-4o mini lags behind human performance, especially in understanding false beliefs, and lacks robustness to various personality traits.

ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind

TL;DR

ToMATO presents a comprehensive Theory-of-Mind benchmark generated via inter-LLM conversations that include inner speech prompts, information asymmetry, and varied personality traits. It assesses first- and second-order mental states across belief, intention, desire, emotion, and knowledge, and extends to false beliefs (ToMATO-FB) with 5.4k questions over 753 conversations and 15 personality patterns. The study shows that even strong LLMs fall short of human performance, especially on false-belief tasks, and that performance varies with mental state and personality traits, indicating robustness gaps. By explicitly modeling hidden thoughts and diverse character traits, ToMATO aims to provide a more realistic, debuggable, and actionable benchmark for deploying ToM-capable systems in real-world communication. The results highlight persistent gaps and suggest directions for progress in improving ToM reasoning in LLMs, including extended training, better handling of false beliefs, and enhanced robustness to personality variation.

Abstract

Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in three aspects: 1) they assess a limited range of mental states such as beliefs, 2) false beliefs are not comprehensively explored, and 3) the diverse personality traits of characters are overlooked. To address these challenges, we introduce ToMATO, a new ToM benchmark formulated as multiple-choice QA over conversations. ToMATO is generated via LLM-LLM conversations featuring information asymmetry. By employing a prompting method that requires role-playing LLMs to verbalize their thoughts before each utterance, we capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge. These verbalized thoughts serve as answers to questions designed to assess the mental states of characters within conversations. Furthermore, the information asymmetry introduced by hiding thoughts from others induces the generation of false beliefs about various mental states. Assigning distinct personality traits to LLMs further diversifies both utterances and thoughts. ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns. Our analysis shows that this dataset construction approach frequently generates false beliefs due to the information asymmetry between role-playing LLMs, and effectively reflects diverse personalities. We evaluate nine LLMs on ToMATO and find that even GPT-4o mini lags behind human performance, especially in understanding false beliefs, and lacks robustness to various personality traits.
Paper Structure (51 sections, 1 equation, 10 figures, 15 tables)

This paper contains 51 sections, 1 equation, 10 figures, 15 tables.

Figures (10)

  • Figure 1: (a) Conversation between two role-playing LLMs with information asymmetry. Before speaking to the other, our Inner Speech prompting (e.g., I feel, or I think that he/she/they feels) promptes each agent to verbalize their first- and second-order mental states as thoughts. The verbalized thoughts are used as the answers to the questions in ToMATO. (b) To detect false beliefs, both GPT4o mini and human annotators judge whether character B misunderstands A's mental state at each turn.
  • Figure 2: Statistical word-level correlation analysis gardner-etal-2021-competency between the generated thoughts and the personality traits given in system prompts.
  • Figure 3: Statistical word-level correlation analysis gardner-etal-2021-competency on four benchmarks. Among the four, ToMATO (ours) contains the fewest word-level spurious correlations in options, indicating sophisticated solutions are needed to achieve higher scores than the random baseline on ToMATO.
  • Figure 4: Conversation annotation.
  • Figure 5: QA pair annotation.
  • ...and 5 more figures