Table of Contents
Fetching ...

Are Large Language Models Aligned with People's Social Intuitions for Human-Robot Interactions?

Lennart Wachowiak, Andrew Coles, Oya Celiktutan, Gerard Canal

TL;DR

This study interrogates whether large language models (LLMs) align with human social intuitions in human–robot interaction by re-creating three established HRI user studies with prompting-based evaluations. GPT-4 emerges as the strongest correlator with human judgments in two experiments on communication preferences and behavioral judgments, achieving $r_s$ values around $0.82$–$0.83$, while other models lag and vision-enabled inputs underperform compared to text-only prompts. The results reveal persistent gaps: LLMs struggle to differentiate robot-versus-human actions, exhibit a positivity bias in judgments, and chain-of-thought prompting often reduces alignment, especially in nuanced social tasks. The paper highlights critical challenges for deploying LLMs in social robotics and motivates development of robust multimodal perception and benchmark-driven evaluation to better capture human social values in real-time human–agent interactions.

Abstract

Large language models (LLMs) are increasingly used in robotics, especially for high-level action planning. Meanwhile, many robotics applications involve human supervisors or collaborators. Hence, it is crucial for LLMs to generate socially acceptable actions that align with people's preferences and values. In this work, we test whether LLMs capture people's intuitions about behavior judgments and communication preferences in human-robot interaction (HRI) scenarios. For evaluation, we reproduce three HRI user studies, comparing the output of LLMs with that of real participants. We find that GPT-4 strongly outperforms other models, generating answers that correlate strongly with users' answers in two studies $\unicode{x2014}$ the first study dealing with selecting the most appropriate communicative act for a robot in various situations ($r_s$ = 0.82), and the second with judging the desirability, intentionality, and surprisingness of behavior ($r_s$ = 0.83). However, for the last study, testing whether people judge the behavior of robots and humans differently, no model achieves strong correlations. Moreover, we show that vision models fail to capture the essence of video stimuli and that LLMs tend to rate different communicative acts and behavior desirability higher than people.

Are Large Language Models Aligned with People's Social Intuitions for Human-Robot Interactions?

TL;DR

This study interrogates whether large language models (LLMs) align with human social intuitions in human–robot interaction by re-creating three established HRI user studies with prompting-based evaluations. GPT-4 emerges as the strongest correlator with human judgments in two experiments on communication preferences and behavioral judgments, achieving values around , while other models lag and vision-enabled inputs underperform compared to text-only prompts. The results reveal persistent gaps: LLMs struggle to differentiate robot-versus-human actions, exhibit a positivity bias in judgments, and chain-of-thought prompting often reduces alignment, especially in nuanced social tasks. The paper highlights critical challenges for deploying LLMs in social robotics and motivates development of robust multimodal perception and benchmark-driven evaluation to better capture human social values in real-time human–agent interactions.

Abstract

Large language models (LLMs) are increasingly used in robotics, especially for high-level action planning. Meanwhile, many robotics applications involve human supervisors or collaborators. Hence, it is crucial for LLMs to generate socially acceptable actions that align with people's preferences and values. In this work, we test whether LLMs capture people's intuitions about behavior judgments and communication preferences in human-robot interaction (HRI) scenarios. For evaluation, we reproduce three HRI user studies, comparing the output of LLMs with that of real participants. We find that GPT-4 strongly outperforms other models, generating answers that correlate strongly with users' answers in two studies the first study dealing with selecting the most appropriate communicative act for a robot in various situations ( = 0.82), and the second with judging the desirability, intentionality, and surprisingness of behavior ( = 0.83). However, for the last study, testing whether people judge the behavior of robots and humans differently, no model achieves strong correlations. Moreover, we show that vision models fail to capture the essence of video stimuli and that LLMs tend to rate different communicative acts and behavior desirability higher than people.
Paper Structure (25 sections, 6 figures, 1 table)

This paper contains 25 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Shortened examples of the LLM evaluation tasks. Correlations are based on answers across multiple stimuli.
  • Figure 2: Distribution of participant answers vs. GPT-4 answers. The task was to rate if a robot should (a) give a why-explanation or (b) ask for help given a scenario.
  • Figure 3: Scatterplots comparing human with model ratings
  • Figure 4: Spearman correlation between model answers and human answers for Experiment 2. ** for $p<0.05$, bold = highest correlation, N/A = model always returns the same score
  • Figure 5: VLM Input
  • ...and 1 more figures