Table of Contents
Fetching ...

RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems

Mengfan Li, Xuanhua Shi, Yang Deng

TL;DR

<3-5 sentence high-level summary> RecToM introduces a targeted benchmark for evaluating Machine Theory of Mind in LLM-based conversational recommender systems, addressing the gap that existing ToM benchmarks focus on retrospective or synthetic scenarios and neglect behavioral guidance in future interactions. The benchmark defines two core capabilities—Cognitive Inference (desire, belief, and intention reasoning) and Behavioral Prediction (prediction and judgment of dialogue strategies)—and employs a multi-turn, multi-dimensional annotation scheme derived from ReDial dialogues. Extensive experiments across state-of-the-art LLMs show that while models can handle coarse ToM tasks like belief and desire reasoning, they struggle with fine-grained intention inferences, multi-dimensional beliefs, and consistent strategic reasoning, with chain-of-thought prompting yielding limited and inconsistent gains. The results reveal systematic biases toward pleasing responses and reveal the need for more robust reasoning architectures and prompting strategies to achieve human-like ToM in realistic CRS settings.</p>

Abstract

Large Language models are revolutionizing the conversational recommender systems through their impressive capabilities in instruction comprehension, reasoning, and human interaction. A core factor underlying effective recommendation dialogue is the ability to infer and reason about users' mental states (such as desire, intention, and belief), a cognitive capacity commonly referred to as Theory of Mind. Despite growing interest in evaluating ToM in LLMs, current benchmarks predominantly rely on synthetic narratives inspired by Sally-Anne test, which emphasize physical perception and fail to capture the complexity of mental state inference in realistic conversational settings. Moreover, existing benchmarks often overlook a critical component of human ToM: behavioral prediction, the ability to use inferred mental states to guide strategic decision-making and select appropriate conversational actions for future interactions. To better align LLM-based ToM evaluation with human-like social reasoning, we propose RecToM, a novel benchmark for evaluating ToM abilities in recommendation dialogues. RecToM focuses on two complementary dimensions: Cognitive Inference and Behavioral Prediction. The former focus on understanding what has been communicated by inferring the underlying mental states. The latter emphasizes what should be done next, evaluating whether LLMs can leverage these inferred mental states to predict, select, and assess appropriate dialogue strategies. Extensive experiments on state-of-the-art LLMs demonstrate that RecToM poses a significant challenge. While the models exhibit partial competence in recognizing mental states, they struggle to maintain coherent, strategic ToM reasoning throughout dynamic recommendation dialogues, particularly in tracking evolving intentions and aligning conversational strategies with inferred mental states.

RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems

TL;DR

<3-5 sentence high-level summary> RecToM introduces a targeted benchmark for evaluating Machine Theory of Mind in LLM-based conversational recommender systems, addressing the gap that existing ToM benchmarks focus on retrospective or synthetic scenarios and neglect behavioral guidance in future interactions. The benchmark defines two core capabilities—Cognitive Inference (desire, belief, and intention reasoning) and Behavioral Prediction (prediction and judgment of dialogue strategies)—and employs a multi-turn, multi-dimensional annotation scheme derived from ReDial dialogues. Extensive experiments across state-of-the-art LLMs show that while models can handle coarse ToM tasks like belief and desire reasoning, they struggle with fine-grained intention inferences, multi-dimensional beliefs, and consistent strategic reasoning, with chain-of-thought prompting yielding limited and inconsistent gains. The results reveal systematic biases toward pleasing responses and reveal the need for more robust reasoning architectures and prompting strategies to achieve human-like ToM in realistic CRS settings.</p>

Abstract

Large Language models are revolutionizing the conversational recommender systems through their impressive capabilities in instruction comprehension, reasoning, and human interaction. A core factor underlying effective recommendation dialogue is the ability to infer and reason about users' mental states (such as desire, intention, and belief), a cognitive capacity commonly referred to as Theory of Mind. Despite growing interest in evaluating ToM in LLMs, current benchmarks predominantly rely on synthetic narratives inspired by Sally-Anne test, which emphasize physical perception and fail to capture the complexity of mental state inference in realistic conversational settings. Moreover, existing benchmarks often overlook a critical component of human ToM: behavioral prediction, the ability to use inferred mental states to guide strategic decision-making and select appropriate conversational actions for future interactions. To better align LLM-based ToM evaluation with human-like social reasoning, we propose RecToM, a novel benchmark for evaluating ToM abilities in recommendation dialogues. RecToM focuses on two complementary dimensions: Cognitive Inference and Behavioral Prediction. The former focus on understanding what has been communicated by inferring the underlying mental states. The latter emphasizes what should be done next, evaluating whether LLMs can leverage these inferred mental states to predict, select, and assess appropriate dialogue strategies. Extensive experiments on state-of-the-art LLMs demonstrate that RecToM poses a significant challenge. While the models exhibit partial competence in recognizing mental states, they struggle to maintain coherent, strategic ToM reasoning throughout dynamic recommendation dialogues, particularly in tracking evolving intentions and aligning conversational strategies with inferred mental states.

Paper Structure

This paper contains 26 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An example dialogue in RecToM.
  • Figure 2: Coarse-Grained and Fine-Grained Intention Classification for Recommenders (Left) and Seekers (Right) in the RecToM Benchmark, with segment sizes in the doughnut charts reflecting the frequency of each intention category.
  • Figure 3: Intention reasoning compared across 10 models (accuracy in %), Fine2Coarse intention reflect the accuracy of mapping fine-grained intentions to their predefined coarse-grained categories. The upper section reports results for the recommender; the lower section for the seeker.