Table of Contents
Fetching ...

Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction

Mehdi Jafari, Devin Yuncheng Hua, Hao Xue, Flora Salim

TL;DR

The paper investigates whether Theory of Mind (ToM) information is encoded in the internal activations of open-source LLMs and whether explicit ToM manipulation can improve alignment in dialogue. It introduces reading ToM via linear probing and LatentQA, and a ToM-controlled generation mechanism based on a $BDI$-style Belief–Desire–Intention framework to steer outputs. Empirical results across CaSiNo, CraigslistBargain, FanToM, and NegotiationToM show that ToM-informed alignment can enhance response quality, achieving win rates around $67.15\%$ for 3B and $63.25\%$ for 8B models in controllability tasks. The work highlights both the viability and limitations of ToM-based alignment, and points to future integration into real-world, multi-model agents with careful attention to evaluation and ethics.

Abstract

Natural language interaction with agentic Artificial Intelligence (AI), driven by Large Language Models (LLMs), is expected to remain a dominant paradigm in the near future. While humans instinctively align their communication with mental states -- an ability known as Theory of Mind (ToM), current LLM powered systems exhibit significant limitations in this regard. This study examines the extent to which open source language models (LLaMA) can capture and preserve ToM related information and how effectively it contributes to consistent ToM reasoning in generated responses. We further investigate whether explicit manipulation of ToM related components, such as beliefs, desires, and intentions, can enhance response alignment. Experiments on two LLaMA 3 variants demonstrate that incorporating ToM informed alignment improves response quality, achieving win rates of 67 and 63 percent for the 3B and 8B models, respectively. These findings highlight the potential of ToM driven strategies to improve alignment in LLM based conversational agents.

Enhancing Conversational Agents with Theory of Mind: Aligning Beliefs, Desires, and Intentions for Human-Like Interaction

TL;DR

The paper investigates whether Theory of Mind (ToM) information is encoded in the internal activations of open-source LLMs and whether explicit ToM manipulation can improve alignment in dialogue. It introduces reading ToM via linear probing and LatentQA, and a ToM-controlled generation mechanism based on a -style Belief–Desire–Intention framework to steer outputs. Empirical results across CaSiNo, CraigslistBargain, FanToM, and NegotiationToM show that ToM-informed alignment can enhance response quality, achieving win rates around for 3B and for 8B models in controllability tasks. The work highlights both the viability and limitations of ToM-based alignment, and points to future integration into real-world, multi-model agents with careful attention to evaluation and ethics.

Abstract

Natural language interaction with agentic Artificial Intelligence (AI), driven by Large Language Models (LLMs), is expected to remain a dominant paradigm in the near future. While humans instinctively align their communication with mental states -- an ability known as Theory of Mind (ToM), current LLM powered systems exhibit significant limitations in this regard. This study examines the extent to which open source language models (LLaMA) can capture and preserve ToM related information and how effectively it contributes to consistent ToM reasoning in generated responses. We further investigate whether explicit manipulation of ToM related components, such as beliefs, desires, and intentions, can enhance response alignment. Experiments on two LLaMA 3 variants demonstrate that incorporating ToM informed alignment improves response quality, achieving win rates of 67 and 63 percent for the 3B and 8B models, respectively. These findings highlight the potential of ToM driven strategies to improve alignment in LLM based conversational agents.

Paper Structure

This paper contains 58 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: An overview of the challenges faced by current conversational ai agents lacking tom alignment, highlighted in violet. The alternative approach, using tom-informed agents, is shown in teal. Prior to each response generation, a tom alignment phase (highlighted in gray) is introduced to ensure better context understanding and alignment. Selected examples demonstrating the practical impact of tom-alignment can be found in Appendix \ref{['sec:sample_responses']}.
  • Figure 2: The LatentQA interpretability pipeline is employed for tom-alignment. In this setup, yellow illustrates how to interpret ToM from a conversation, while cyan demonstrates how to use a steered model to generate aligned uttrances. The backpropagation paths for each component are highlighted with dashed arrows, which are active only during the training phase and not during the inference phase.
  • Figure 3: The win rate of tom-aligned model responses is compared to that of the out-of-the-box model across various experiments. Each subsection focuses on a specific tom component aligned with the ground truth of the conversation. The name of the altered tom for each experiment is displayed below, while the number of samples for each experiment is indicated above each bar. Successful examples are detailed in Appendix \ref{['sec:sample_responses']}, while general trends observed in failure cases are presented in Appendix \ref{['sec:failure_cases']}.