How Users Understand Robot Foundation Model Performance through Task Success Rates and Beyond

Isaac Sheidlower; Jindan Huang; James Staley; Bingyu Wu; Qicong Chen; Reuben Aronson; Elaine Short

How Users Understand Robot Foundation Model Performance through Task Success Rates and Beyond

Isaac Sheidlower, Jindan Huang, James Staley, Bingyu Wu, Qicong Chen, Reuben Aronson, Elaine Short

TL;DR

The paper investigates how non-experts interpret performance information for robot foundation models, focusing on task success rate (TSR) and the value of failure descriptions. Through online (n=112) and in-person (n=14) studies using real evaluation data, it defines four information types (ETSR, EFC, RT-TSR, RT-FC) and demonstrates that users rely on TSR while also valuing failure information and real-data history. Findings show non-experts interpret TSR in line with expert expectations but benefits from additional information types, and they prefer access to both real data and task-based estimates to gauge capabilities on novel tasks. The work advocates user-centered, interpretable evaluations and deployments, including standardized failure reporting and mechanisms to forecast RFM performance on unseen tasks, with implications for safer and more trustworthy home robots.

Abstract

Robot Foundation Models (RFMs) represent a promising approach to developing general-purpose home robots. Given the broad capabilities of RFMs, users will inevitably ask an RFM-based robot to perform tasks that the RFM was not trained or evaluated on. In these cases, it is crucial that users understand the risks associated with attempting novel tasks due to the relatively high cost of failure. Furthermore, an informed user who understands an RFM's capabilities will know what situations and tasks the robot can handle. In this paper, we study how non-roboticists interpret performance information from RFM evaluations. These evaluations typically report task success rate (TSR) as the primary performance metric. While TSR is intuitive to experts, it is necessary to validate whether novices also use this information as intended. Toward this end, we conducted a study in which users saw real evaluation data, including TSR, failure case descriptions, and videos from multiple published RFM research projects. The results highlight that non-experts not only use TSR in a manner consistent with expert expectations but also highly value other information types, such as failure cases that are not often reported in RFM evaluations. Furthermore, we find that users want access to both real data from previous evaluations of the RFM and estimates from the robot about how well it will do on a novel task.

How Users Understand Robot Foundation Model Performance through Task Success Rates and Beyond

TL;DR

Abstract

Paper Structure (24 sections, 15 figures, 5 tables)

This paper contains 24 sections, 15 figures, 5 tables.

Introduction
Related Works
Methodology: Defining information types
User Experience with Different Information Types (online study)
Procedure
Results
Exploring User Information Needs with an Embodied Robot (in-person study)
Procedure
Results
Discussion
Conclusion
Task list
Codebooks
Code Counts
Common Demographic Information (Online Study)
...and 9 more sections

Figures (15)

Figure 1: This work investigates how people interpret commonly reported robot foundation model performance information. We verify that non-experts value information like task success rate, while also wanting other information types less readily available. This work serves to inform future robot foundation model evaluations as well as what information should be available to end-users during deployments and when they request tasks.
Figure 2: Overview of the study procedure. Users saw a successful or failed trajectory based on a probabilistic sample from the real evaluation success rate for that task. For each participant, the 16 information combinations where paired with the 16 tasks.
Figure 3: Responses to the pre-task and post-task Likert questions of information sufficiency under different conditions. p-values and $\eta_p^2$ effect sizes from an RM-ANOVA are shown. Legend: Strongly Disagree/1, Disagree/2, Neutral/3, Agree/4, Strongly Agree/5
Figure 4: The in-person study was conducted in a University building lobby. The robot repeatedly attempted to put the cans on the shelf.
Figure 5: Users in the online study generally reported each information type as useful.
...and 10 more figures

How Users Understand Robot Foundation Model Performance through Task Success Rates and Beyond

TL;DR

Abstract

How Users Understand Robot Foundation Model Performance through Task Success Rates and Beyond

Authors

TL;DR

Abstract

Table of Contents

Figures (15)