Table of Contents
Fetching ...

Predicting Human Perceptions of Robot Performance During Navigation Tasks

Qiping Zhang, Nathan Tsoi, Mofeed Nagib, Booyeon Choi, Jie Tan, Hao-Tien Lewis Chiang, Marynel Vázquez

TL;DR

The paper tackles predicting how humans perceive robot navigation performance by leveraging implicit, nonverbal cues collected in VR, introducing the SEAN TOGETHER dataset. It compares human predictions against multiple machine learning models, showing that ML models—especially using spatial navigation features—often outperform humans and generalize to unseen users. A real-world demonstration confirms sim-to-real transfer potential, suggesting that perception-informed models can guide robot navigation to better align with human expectations. The findings support deploying ML-based perceptual supervision to adapt robot behavior in social navigation tasks.

Abstract

Understanding human perceptions of robot performance is crucial for designing socially intelligent robots that can adapt to human expectations. Current approaches often rely on surveys, which can disrupt ongoing human-robot interactions. As an alternative, we explore predicting people's perceptions of robot performance using non-verbal behavioral cues and machine learning techniques. We contribute the SEAN TOGETHER Dataset consisting of observations of an interaction between a person and a mobile robot in Virtual Reality, together with perceptions of robot performance provided by users on a 5-point scale. We then analyze how well humans and supervised learning techniques can predict perceived robot performance based on different observation types (like facial expression and spatial behavior features). Our results suggest that facial expressions alone provide useful information, but in the navigation scenarios that we considered, reasoning about spatial features in context is critical for the prediction task. Also, supervised learning techniques outperformed humans' predictions in most cases. Further, when predicting robot performance as a binary classification task on unseen users' data, the F1-Score of machine learning models more than doubled that of predictions on a 5-point scale. This suggested good generalization capabilities, particularly in identifying performance directionality over exact ratings. Based on these findings, we conducted a real-world demonstration where a mobile robot uses a machine learning model to predict how a human who follows it perceives it. Finally, we discuss the implications of our results for implementing these supervised learning models in real-world navigation. Our work paves the path to automatically enhancing robot behavior based on observations of users and inferences about their perceptions of a robot.

Predicting Human Perceptions of Robot Performance During Navigation Tasks

TL;DR

The paper tackles predicting how humans perceive robot navigation performance by leveraging implicit, nonverbal cues collected in VR, introducing the SEAN TOGETHER dataset. It compares human predictions against multiple machine learning models, showing that ML models—especially using spatial navigation features—often outperform humans and generalize to unseen users. A real-world demonstration confirms sim-to-real transfer potential, suggesting that perception-informed models can guide robot navigation to better align with human expectations. The findings support deploying ML-based perceptual supervision to adapt robot behavior in social navigation tasks.

Abstract

Understanding human perceptions of robot performance is crucial for designing socially intelligent robots that can adapt to human expectations. Current approaches often rely on surveys, which can disrupt ongoing human-robot interactions. As an alternative, we explore predicting people's perceptions of robot performance using non-verbal behavioral cues and machine learning techniques. We contribute the SEAN TOGETHER Dataset consisting of observations of an interaction between a person and a mobile robot in Virtual Reality, together with perceptions of robot performance provided by users on a 5-point scale. We then analyze how well humans and supervised learning techniques can predict perceived robot performance based on different observation types (like facial expression and spatial behavior features). Our results suggest that facial expressions alone provide useful information, but in the navigation scenarios that we considered, reasoning about spatial features in context is critical for the prediction task. Also, supervised learning techniques outperformed humans' predictions in most cases. Further, when predicting robot performance as a binary classification task on unseen users' data, the F1-Score of machine learning models more than doubled that of predictions on a 5-point scale. This suggested good generalization capabilities, particularly in identifying performance directionality over exact ratings. Based on these findings, we conducted a real-world demonstration where a mobile robot uses a machine learning model to predict how a human who follows it perceives it. Finally, we discuss the implications of our results for implementing these supervised learning models in real-world navigation. Our work paves the path to automatically enhancing robot behavior based on observations of users and inferences about their perceptions of a robot.
Paper Structure (26 sections, 12 figures, 4 tables)

This paper contains 26 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Data collection. Humans controlled an avatar in the simulation with VR (a) while they were guided by a Fetch robot (b). The screen on the desk shows what the user saw.
  • Figure 2: a) It is typical to gather explicit human feedback about robot performance using surveys after human-robot interactions conclude because interruptions by the experimenters can easily bias human-robot social encounters. Unfortunately, the feedback from surveys tends to be very limited, making it difficult to understand robot performance at a granular level. Alternatively, participants may complete video annotations of their experiences zhang2023sean, but this can be time consuming and taxing, especially in continuous navigation tasks. b) In this work, we first collect a dataset of human perceptions of a robot's performance by prompting participants during interactions using VR (Training Step in the diagram). Then, we use this explicit feedback to train models that infer human perceptions of robot performance based on observations of the interactions, especially including observations of human implicit feedback. The value of such a model is that once it is trained, it can be reused to estimate robot performance during new interactions (Deployment Step), without having to ask humans for explicit feedback as in the training step.
  • Figure 3: A data sample from the Nav.+Facial condition. The left plot shows gaze, spatial behavior, goal, and occupancy features: is the robot's pose; is the pose of the participant following the robot during the VR interaction; indicates the gaze of the participant; are the poses of algorithmically controlled avatars; is the destination position that the robot navigated towards; and occupancy in the environment is indicated by black pixels (occupied) and white pixels (unoccupied). The right visualization shows a rendering of the facial expression features of the participant.
  • Figure 4: Layout of the interfaces used for video annotation for the human baseline. Left: Layout used for the Nav.-Only annotation condition, showing the navigation rendering on the left, and questions on the right. Right: Layout for the Facial.-Only condition.
  • Figure 5: Errors for annotators' predictions by Annotation Conditions (left) and Before/After Robot Behavior Change (right). (**) and (*) denote $p < 0.0001$ and $p < 0.05$, respectively.
  • ...and 7 more figures