Table of Contents
Fetching ...

PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues

Fangxu Yu, Lai Jiang, Shenyi Huang, Zhen Wu, Xinyu Dai

TL;DR

PersuasiveToM introduces a benchmark to evaluate Large Language Models' ability to reason about dynamic mental states in persuasive dialogues. It defines two core tasks—ToM Reasoning (Desire, Belief, Intention) and ToM Application (Persuasion Strategy Prediction and Judgement)—and evaluates them on a dataset derived from DailyPersuasion across 35 domains. Experiments across eight state-of-the-art LLMs reveal that models struggle with tracking evolving desires and beliefs throughout a dialogue, while chain-of-thought prompts help more with strategy prediction than with ToM reasoning. Humans outperform models on all tasks, highlighting a gap in current ToM capabilities and motivating future memory-augmented or richer-annotated datasets for more robust social reasoning in AI.

Abstract

The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social scenarios. Although recent studies have evaluated ToM in Large Language Models (LLMs), existing benchmarks focus on simplified settings (e.g., Sally-Anne-style tasks) and overlook the complexity of real-world social interactions. To mitigate this gap, we propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues. Our framework contains two core tasks: ToM Reasoning, which tests tracking of evolving desires, beliefs, and intentions; and ToM Application, which assesses the use of inferred mental states to predict and evaluate persuasion strategies. Experiments across eight leading LLMs reveal that while models excel on multiple questions, they struggle with the tasks that need tracking the dynamics and shifts of mental states and understanding the mental states in the whole dialogue comprehensively. Our aim with PersuasiveToM is to allow an effective evaluation of the ToM reasoning ability of LLMs with more focus on complex psychological activities. Our code is available at https://github.com/Yu-Fangxu/PersuasiveToM.

PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues

TL;DR

PersuasiveToM introduces a benchmark to evaluate Large Language Models' ability to reason about dynamic mental states in persuasive dialogues. It defines two core tasks—ToM Reasoning (Desire, Belief, Intention) and ToM Application (Persuasion Strategy Prediction and Judgement)—and evaluates them on a dataset derived from DailyPersuasion across 35 domains. Experiments across eight state-of-the-art LLMs reveal that models struggle with tracking evolving desires and beliefs throughout a dialogue, while chain-of-thought prompts help more with strategy prediction than with ToM reasoning. Humans outperform models on all tasks, highlighting a gap in current ToM capabilities and motivating future memory-augmented or richer-annotated datasets for more robust social reasoning in AI.

Abstract

The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social scenarios. Although recent studies have evaluated ToM in Large Language Models (LLMs), existing benchmarks focus on simplified settings (e.g., Sally-Anne-style tasks) and overlook the complexity of real-world social interactions. To mitigate this gap, we propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues. Our framework contains two core tasks: ToM Reasoning, which tests tracking of evolving desires, beliefs, and intentions; and ToM Application, which assesses the use of inferred mental states to predict and evaluate persuasion strategies. Experiments across eight leading LLMs reveal that while models excel on multiple questions, they struggle with the tasks that need tracking the dynamics and shifts of mental states and understanding the mental states in the whole dialogue comprehensively. Our aim with PersuasiveToM is to allow an effective evaluation of the ToM reasoning ability of LLMs with more focus on complex psychological activities. Our code is available at https://github.com/Yu-Fangxu/PersuasiveToM.

Paper Structure

This paper contains 43 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: An example in PersuasiveToM. Bob is persuading Alice to join the botanical garden tour.
  • Figure 2: Domains of PersuasiveToM. Under 6 primary topics and 35 domains in total.
  • Figure 3: Distribution of errors of Desire questions happening in different stages of dialogue progress. The Left figure corresponds to the persuader, and the Right figure corresponds to the persuadee.
  • Figure 4: Model errors of belief questions of persuader.
  • Figure 5: Model errors of belief questions of persuadee.
  • ...and 5 more figures