Table of Contents
Fetching ...

CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

Ruirui Chen, Weifeng Jiang, Chengwei Qin, Cheston Tan

Abstract

Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.

CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

Abstract

Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.
Paper Structure (28 sections, 7 figures, 6 tables)

This paper contains 28 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An example of second-order belief–desire reasoning. The story is presented across multiple rounds, with questions posed at different stages as the narrative unfolds. Images are provided for each round to enhance illustration and contextual understanding. In addition to the story content, we supply adaptive hints for LLMs. For example, if the model answers the first question correctly, we add the feedback: "That’s right—Mom thinks Leo still wants the red race car." If the answer is incorrect, we provide a corrective hint: "Remember, Mom didn’t hear Leo say that he changed his mind, so she still believes he wants the first toy he picked."
  • Figure 2: A scenario analogous to a reference (hard) task from the ToM booklet task. The child asks for the big apple; however, because the leaves block his view, Grandpa cannot see the largest apple at the top of the tree. As a result, he believes the medium apple is the biggest and therefore picks it for the child.
  • Figure 3: Statistics of CoMMET. To improve interpretability, we group related tasks from the Theory of Mind booklet and map them to the mental states defined in ATOMS. In addition to the ATOMS mental states, the ToM booklet tasks also encompass moral reasoning.
  • Figure 4: A scenario from the Reference (Hard) task in the ToM Booklet is illustrated. Although the child asks for the big cookie, the father—due to his viewing position—can see only the small and medium cookies. Consequently, he believes the medium cookie is the largest available and therefore packs it for the child.
  • Figure 5: The image for spatial perspective task. ID: story_5_2025-12-27 11_22_07.738437_1
  • ...and 2 more figures