Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses
Maryam Amirizaniani, Elias Martin, Maryna Sivachenko, Afra Mashhadi, Chirag Shah
TL;DR
This work investigates whether large language models (LLMs) can exhibit human-like Theory of Mind (ToM) in open-ended reasoning by testing them on Reddit ChangeMyView prompts. It combines human-based evaluation with metric-based assessments, using three LLMs (Zephyr-7B, Llama2-Chat-13B, GPT-4) and a two-round protocol that adds mental-state prompts (sentiment, emotion, intention). The results reveal clear gaps between LLMs and human ToM reasoning in open-ended questions, though prompt tuning with mental-state information improves performance across multiple metrics, it does not achieve human-level ToM. The findings highlight enduring limitations in social reasoning for LLMs and point to prompt engineering as a partial, but not complete, route toward more human-like ToM capabilities in open-ended contexts.
Abstract
Theory of Mind (ToM) reasoning entails recognizing that other individuals possess their own intentions, emotions, and thoughts, which is vital for guiding one's own thought processes. Although large language models (LLMs) excel in tasks such as summarization, question answering, and translation, they still face challenges with ToM reasoning, especially in open-ended questions. Despite advancements, the extent to which LLMs truly understand ToM reasoning and how closely it aligns with human ToM reasoning remains inadequately explored in open-ended scenarios. Motivated by this gap, we assess the abilities of LLMs to perceive and integrate human intentions and emotions into their ToM reasoning processes within open-ended questions. Our study utilizes posts from Reddit's ChangeMyView platform, which demands nuanced social reasoning to craft persuasive responses. Our analysis, comparing semantic similarity and lexical overlap metrics between responses generated by humans and LLMs, reveals clear disparities in ToM reasoning capabilities in open-ended questions, with even the most advanced models showing notable limitations. To enhance LLM capabilities, we implement a prompt tuning method that incorporates human intentions and emotions, resulting in improvements in ToM reasoning performance. However, despite these improvements, the enhancement still falls short of fully achieving human-like reasoning. This research highlights the deficiencies in LLMs' social reasoning and demonstrates how integrating human intentions and emotions can boost their effectiveness.
