Table of Contents
Fetching ...

Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek

TL;DR

A steep drop in ToM capabilities under task perturbation is shown for all evaluated LLMs, questioning the notion of any robust form of ToM being present.

Abstract

Theory of Mind (ToM) refers to an agent's ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance performance and explain the LLM's decision. We introduce a handcrafted, richly annotated ToM dataset, including classic and perturbed false belief tasks, the corresponding spaces of valid reasoning chains for correct task completion, subsequent reasoning faithfulness, task solutions, and propose metrics to evaluate reasoning chain correctness and to what extent final answers are faithful to reasoning traces of the generated CoT. We show a steep drop in ToM capabilities under task perturbation for all evaluated LLMs, questioning the notion of any robust form of ToM being present. While CoT prompting improves the ToM performance overall in a faithful manner, it surprisingly degrades accuracy for some perturbation classes, indicating that selective application is necessary.

Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models

TL;DR

A steep drop in ToM capabilities under task perturbation is shown for all evaluated LLMs, questioning the notion of any robust form of ToM being present.

Abstract

Theory of Mind (ToM) refers to an agent's ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance performance and explain the LLM's decision. We introduce a handcrafted, richly annotated ToM dataset, including classic and perturbed false belief tasks, the corresponding spaces of valid reasoning chains for correct task completion, subsequent reasoning faithfulness, task solutions, and propose metrics to evaluate reasoning chain correctness and to what extent final answers are faithful to reasoning traces of the generated CoT. We show a steep drop in ToM capabilities under task perturbation for all evaluated LLMs, questioning the notion of any robust form of ToM being present. While CoT prompting improves the ToM performance overall in a faithful manner, it surprisingly degrades accuracy for some perturbation classes, indicating that selective application is necessary.
Paper Structure (45 sections, 8 equations, 7 figures, 7 tables)

This paper contains 45 sections, 8 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustrative example of ToM task. We manually annotated every sentence with the correct current belief of a protagonist the agent has to reason about at each step. While CoT-P improves performance on some task classes, it degrades it on others. Our dataset allows to assess whether this is grounded in correct step-wise reasoning, where it fails and if models are faithful to their reasoning.
  • Figure 2: Illustrating "Conclusion from Sentiment".
  • Figure 3: Dataset illustration with per-sentence gold belief states shown inline. The belief column encodes what Esther believes about the container contents after each sentence.
  • Figure 4: Examples of reasoning chains compared to the gold CoT. In (a), the model outputs a valid proper subsequence with consistent reasoning; in (b), intermediate states are inconsistent or skipped (step 3 "D" in the gold chain), leading to an invalid chain.
  • Figure 5: Comparison of effect strengths of CoT prompting given (a) incorrect (left-hand side) and (b) correct (right-hand side) reasoning rationales. Among all the evaluated models, we observe a placebo effect only in Mixtral, where incorrect reasoning (right) can even have a larger positive effect on final answer correctness than correct reasoning chains.
  • ...and 2 more figures