Table of Contents
Fetching ...

Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?

Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati

TL;DR

This paper investigates whether Large Language Models (LLMs) genuinely exhibit Theory of Mind (ToM) abilities in Human-Robot Interaction (HRI) contexts. It introduces PERCEIVED Behavior RECOGNITION (PROBE), a ToM benchmark that probes four behavior types (explicability, legibility, predictability, obfuscation) across five domains, using human-subject studies and LLM prompts. The authors show that while vanilla prompts yield impressions of ToM alignment between LLMs (notably GPT-4) and humans, targeted perturbations reveal brittleness and a lack of robust ToM reasoning, challenging claims of true ToM in LLMs. A real-world case study with a Fetch robot reinforces the gap between human judgments and LLM robustness, underscoring the need for caution and deeper evaluation when deploying LLMs in safety-critical HRI settings. The work provides a foundation for more rigorous, perturbation-aware ToM assessment in HRI and outlines directions for future research to close the gap between retrieval-based cues and genuine reasoning in language models.

Abstract

Large Language Models have shown exceptional generative abilities in various natural language and generation tasks. However, possible anthropomorphization and leniency towards failure cases have propelled discussions on emergent abilities of Large Language Models especially on Theory of Mind (ToM) abilities in Large Language Models. While several false-belief tests exists to verify the ability to infer and maintain mental models of another entity, we study a special application of ToM abilities that has higher stakes and possibly irreversible consequences : Human Robot Interaction. In this work, we explore the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot's generated behavior in a manner similar to human observer. We focus on four behavior types, namely - explicable, legible, predictable, and obfuscatory behavior which have been extensively used to synthesize interpretable robot behaviors. The LLMs goal is, therefore to be a human proxy to the agent, and to answer how a certain agent behavior would be perceived by the human in the loop, for example "Given a robot's behavior X, would the human observer find it explicable?". We conduct a human subject study to verify that the users are able to correctly answer such a question in the curated situations (robot setting and plan) across five domains. A first analysis of the belief test yields extremely positive results inflating ones expectations of LLMs possessing ToM abilities. We then propose and perform a suite of perturbation tests which breaks this illusion, i.e. Inconsistent Belief, Uninformative Context and Conviction Test. We conclude that, the high score of LLMs on vanilla prompts showcases its potential use in HRI settings, however to possess ToM demands invariance to trivial or irrelevant perturbations in the context which LLMs lack.

Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?

TL;DR

This paper investigates whether Large Language Models (LLMs) genuinely exhibit Theory of Mind (ToM) abilities in Human-Robot Interaction (HRI) contexts. It introduces PERCEIVED Behavior RECOGNITION (PROBE), a ToM benchmark that probes four behavior types (explicability, legibility, predictability, obfuscation) across five domains, using human-subject studies and LLM prompts. The authors show that while vanilla prompts yield impressions of ToM alignment between LLMs (notably GPT-4) and humans, targeted perturbations reveal brittleness and a lack of robust ToM reasoning, challenging claims of true ToM in LLMs. A real-world case study with a Fetch robot reinforces the gap between human judgments and LLM robustness, underscoring the need for caution and deeper evaluation when deploying LLMs in safety-critical HRI settings. The work provides a foundation for more rigorous, perturbation-aware ToM assessment in HRI and outlines directions for future research to close the gap between retrieval-based cues and genuine reasoning in language models.

Abstract

Large Language Models have shown exceptional generative abilities in various natural language and generation tasks. However, possible anthropomorphization and leniency towards failure cases have propelled discussions on emergent abilities of Large Language Models especially on Theory of Mind (ToM) abilities in Large Language Models. While several false-belief tests exists to verify the ability to infer and maintain mental models of another entity, we study a special application of ToM abilities that has higher stakes and possibly irreversible consequences : Human Robot Interaction. In this work, we explore the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot's generated behavior in a manner similar to human observer. We focus on four behavior types, namely - explicable, legible, predictable, and obfuscatory behavior which have been extensively used to synthesize interpretable robot behaviors. The LLMs goal is, therefore to be a human proxy to the agent, and to answer how a certain agent behavior would be perceived by the human in the loop, for example "Given a robot's behavior X, would the human observer find it explicable?". We conduct a human subject study to verify that the users are able to correctly answer such a question in the curated situations (robot setting and plan) across five domains. A first analysis of the belief test yields extremely positive results inflating ones expectations of LLMs possessing ToM abilities. We then propose and perform a suite of perturbation tests which breaks this illusion, i.e. Inconsistent Belief, Uninformative Context and Conviction Test. We conclude that, the high score of LLMs on vanilla prompts showcases its potential use in HRI settings, however to possess ToM demands invariance to trivial or irrelevant perturbations in the context which LLMs lack.
Paper Structure (47 sections, 11 figures, 3 tables)

This paper contains 47 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: (a) An LLM being used as a Human-Proxy by a Robot as an internal-critique for behavior synthesis. (b) Interface used for our User Study: Example showing the three questions for Legibility in Fetch Domain.
  • Figure 2: An illustrative view of the domains used for testing LLM for Theory of Mind reasoning on the Perceived Behavior Recognition task. Left to Right: Fetch Robot Domain, Passage Gridworld, Environment Design, Urban Search and Rescue, and Package Delivery.
  • Figure 3: Human subject performance across five domains.
  • Figure 4: Performance on Q1 across five domains along four robot behavior types on Q1 (binary response). Human subjects' results have been scaled for a uniform comparison.
  • Figure 5: Likert Score (1-5) comparison for subjective evaluation of LLM responses.
  • ...and 6 more figures