Table of Contents
Fetching ...

Few-Shot Inference of Human Perceptions of Robot Performance in Social Navigation Scenarios

Qiping Zhang, Nathan Tsoi, Mofeed Nagib, Hao-Tien Lewis Chiang, Marynel Vázquez

TL;DR

The paper investigates using few-shot in-context learning with large language models to predict human perceptions of robot performance in social navigation. By augmenting the SEAN TOGETHER dataset, it demonstrates that LLMs can match or exceed traditional supervised models with an order of magnitude fewer labeled examples, and that prediction improves with more in-context demonstrations. It also analyzes which sensor-based observations drive predictions and shows personalized demonstrations further enhance accuracy, highlighting a scalable, user-centered pathway for evaluating and improving robot behavior in real-world settings. The work points to future extensions with multimodal data and adaptive robot policies that respond to predicted user perceptions.

Abstract

Understanding how humans evaluate robot behavior during human-robot interactions is crucial for developing socially aware robots that behave according to human expectations. While the traditional approach to capturing these evaluations is to conduct a user study, recent work has proposed utilizing machine learning instead. However, existing data-driven methods require large amounts of labeled data, which limits their use in practice. To address this gap, we propose leveraging the few-shot learning capabilities of Large Language Models (LLMs) to improve how well a robot can predict a user's perception of its performance, and study this idea experimentally in social navigation tasks. To this end, we extend the SEAN TOGETHER dataset with additional real-world human-robot navigation episodes and participant feedback. Using this augmented dataset, we evaluate the ability of several LLMs to predict human perceptions of robot performance from a small number of in-context examples, based on observed spatio-temporal cues of the robot and surrounding human motion. Our results demonstrate that LLMs can match or exceed the performance of traditional supervised learning models while requiring an order of magnitude fewer labeled instances. We further show that prediction performance can improve with more in-context examples, confirming the scalability of our approach. Additionally, we investigate what kind of sensor-based information an LLM relies on to make these inferences by conducting an ablation study on the input features considered for performance prediction. Finally, we explore the novel application of personalized examples for in-context learning, i.e., drawn from the same user being evaluated, finding that they further enhance prediction accuracy. This work paves the path to improving robot behavior in a scalable manner through user-centered feedback.

Few-Shot Inference of Human Perceptions of Robot Performance in Social Navigation Scenarios

TL;DR

The paper investigates using few-shot in-context learning with large language models to predict human perceptions of robot performance in social navigation. By augmenting the SEAN TOGETHER dataset, it demonstrates that LLMs can match or exceed traditional supervised models with an order of magnitude fewer labeled examples, and that prediction improves with more in-context demonstrations. It also analyzes which sensor-based observations drive predictions and shows personalized demonstrations further enhance accuracy, highlighting a scalable, user-centered pathway for evaluating and improving robot behavior in real-world settings. The work points to future extensions with multimodal data and adaptive robot policies that respond to predicted user perceptions.

Abstract

Understanding how humans evaluate robot behavior during human-robot interactions is crucial for developing socially aware robots that behave according to human expectations. While the traditional approach to capturing these evaluations is to conduct a user study, recent work has proposed utilizing machine learning instead. However, existing data-driven methods require large amounts of labeled data, which limits their use in practice. To address this gap, we propose leveraging the few-shot learning capabilities of Large Language Models (LLMs) to improve how well a robot can predict a user's perception of its performance, and study this idea experimentally in social navigation tasks. To this end, we extend the SEAN TOGETHER dataset with additional real-world human-robot navigation episodes and participant feedback. Using this augmented dataset, we evaluate the ability of several LLMs to predict human perceptions of robot performance from a small number of in-context examples, based on observed spatio-temporal cues of the robot and surrounding human motion. Our results demonstrate that LLMs can match or exceed the performance of traditional supervised learning models while requiring an order of magnitude fewer labeled instances. We further show that prediction performance can improve with more in-context examples, confirming the scalability of our approach. Additionally, we investigate what kind of sensor-based information an LLM relies on to make these inferences by conducting an ablation study on the input features considered for performance prediction. Finally, we explore the novel application of personalized examples for in-context learning, i.e., drawn from the same user being evaluated, finding that they further enhance prediction accuracy. This work paves the path to improving robot behavior in a scalable manner through user-centered feedback.

Paper Structure

This paper contains 12 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: We investigate to what extent Large Language Models (LLMs) can infer human perceptions of a mobile robot in navigation scenarios where a person -- the "follower" -- was guided by the robot to an indoor location. The inferences are made based on a few examples only using In-Context Learning (ICL). For each example, the input consists of sensor-based observations from the robot and the output is a binary performance level (e.g., indicating competent behavior).
  • Figure 2: ICL overview: An LLM predicts a person's perception of a robot on an evaluation example given a set of demonstrations in the prompt. In (a), demonstrations are gathered from interactions with users who are different from the person who generated the evaluation example. In (b), the demonstrations include examples from the same user who provided the evaluation example.
  • Figure 3: Prompt structure (a), including the structure for an example (b). The LLM is asked to predict robot competence.
  • Figure 4: Model accuracy for RQ1. (****), (**), and (*) denote $p < 0.0001$, $p < 0.01$, and $p < 0.05$. Error bars are std. err. and are small.
  • Figure 5: Results for RQ2. Average accuracy for Gemini 2.0 Flash No CoT with $K=4$. The model always takes as input the goal location, but the other spatial observations are ablated. Error bars are std. err. The symbols (****), (***), (**), and (*) denote $p < 0.0001$, $p < 0.001$, $p < 0.01$, and $p < 0.05$.
  • ...and 1 more figures