A Human-in-the-Loop Confidence-Aware Failure Recovery Framework for Modular Robot Policies

Rohan Banerjee; Krishna Palempalli; Bohan Yang; Jiaying Fang; Alif Abdullah; Tom Silver; Sarah Dean; Tapomayukh Bhattacharjee

A Human-in-the-Loop Confidence-Aware Failure Recovery Framework for Modular Robot Policies

Rohan Banerjee, Krishna Palempalli, Bohan Yang, Jiaying Fang, Alif Abdullah, Tom Silver, Sarah Dean, Tapomayukh Bhattacharjee

TL;DR

This work tackles failure recovery in unstructured human-robot collaboration by introducing a human-in-the-loop framework for modular policies that jointly selects which module to query and when to query a human. It fuses calibrated module-level uncertainty with a workload-model to minimize a horizoned objective $J(\\psi_{ms}, \\psi_q)$, balancing recovery efficiency against human effort. The authors evaluate multiple module selectors and querying algorithms in synthetic simulations and deploy the method on a robot-assisted bite acquisition system, demonstrating improved recovery success with reduced user workload in studies with emulated and real mobility limitations. The approach generalizes to other full-stack modular robots and offers a principled, scalable strategy for workload-aware failure recovery in collaborative autonomy.

Abstract

Robots operating in unstructured human environments inevitably encounter failures, especially in robot caregiving scenarios. While humans can often help robots recover, excessive or poorly targeted queries impose unnecessary cognitive and physical workload on the human partner. We present a human-in-the-loop failure-recovery framework for modular robotic policies, where a policy is composed of distinct modules such as perception, planning, and control, any of which may fail and often require different forms of human feedback. Our framework integrates calibrated estimates of module-level uncertainty with models of human intervention cost to decide which module to query and when to query the human. It separates these two decisions: a module selector identifies the module most likely responsible for failure, and a querying algorithm determines whether to solicit human input or act autonomously. We evaluate several module-selection strategies and querying algorithms in controlled synthetic experiments, revealing trade-offs between recovery efficiency, robustness to system and user variables, and user workload. Finally, we deploy the framework on a robot-assisted bite acquisition system and demonstrate, in studies involving individuals with both emulated and real mobility limitations, that it improves recovery success while reducing the workload imposed on users. Our results highlight how explicitly reasoning about both robot uncertainty and human effort can enable more efficient and user-centered failure recovery in collaborative robots. Supplementary materials and videos can be found at: http://emprise.cs.cornell.edu/modularhil

A Human-in-the-Loop Confidence-Aware Failure Recovery Framework for Modular Robot Policies

TL;DR

, balancing recovery efficiency against human effort. The authors evaluate multiple module selectors and querying algorithms in synthetic simulations and deploy the method on a robot-assisted bite acquisition system, demonstrating improved recovery success with reduced user workload in studies with emulated and real mobility limitations. The approach generalizes to other full-stack modular robots and offers a principled, scalable strategy for workload-aware failure recovery in collaborative autonomy.

Abstract

Paper Structure (50 sections, 10 equations, 11 figures, 2 algorithms)

This paper contains 50 sections, 10 equations, 11 figures, 2 algorithms.

Introduction
Related Work
Problem Formulation
Module Selectors
Proxy Objective Optimization
Binary Tree Query
Graph Query
Querying Algorithms
Synthetic Simulation: Systematic Ablations
Varying number of modules $N$
Varying graph structure $G_M$
Varying confidence values $c_i$
Varying query costs $q_i$
Synthetic Simulation: Module Heterogeneity
Robot-Assisted Bite Acquisition Experiments
...and 35 more sections

Figures (11)

Figure 1: Overall human-in-the-loop decision failure recovery framework, grounded in the robot-assisted bite acquisition domain. The recovery framework first calls a module selector to decide which of the modules to query for (e.g. the skill selector). The framework then calls a querying algorithm, which decides whether to ask the user for help or act autonomously.
Figure 2: (left) BinaryTreeQuery graph example for $N{=}3$, (middle) GraphQuery graph example for $N{=}3$, (right) Querying algorithms and Query-For-All baseline, which decide when to query (calling module selector $\psi_{ms}$) and when to execute actions (calling ForwardPass to get action $a$, then calling Execute). Querying algorithms include Execute-First, which executes once prior to starting to query, Query-then-Execute, which alternates between querying and execution, and Query-until-Confident/Query-until-Confident-Workload-Aware, both of which repeatedly query until a stopping condition is met. The Query-For-All baseline queries for all modules before execution.
Figure 3: (a) Systematic ablation experiments, showing the 4 independent variables in our simulations: (1) number of modules, (2) graph redundancies, (3) confidences, (4) query costs, with median values across 100 trials (mean for Task Cost). We find that the BruteForce and GraphQuery module selectors (along with the ConfidenceQuery baseline) are the most robust to varying redundancy, confidences, and query costs, with BruteForce and ConfidenceQuery being the most scalable. Additionally, we find that the Query-until-Confident and Query-until-Confident-Workload-Aware querying algorithms are the most robust across redundancy and query costs, with Query-until-Confident having the best scalability and robustness to confidences; (b) Module heterogeneity experiments. We find that GraphQuery outperforms ConfidenceQuery, particularly when module confidences overlap and workload variance $\beta$ is high. Detailed results in Appendix \ref{['app:synthetic-exps']}.
Figure 4: Experimental setup. (left) Robot and user study setup; (middle, top) Meal plates used in user studies, including "Thanksgiving meal", "savory salad", and "mixed salad"; (middle, bottom) Users with mobility limitations from in-home user study; (right) Users with emulated mobility limitations from two in-lab user studies.
Figure 5: User study metrics. In-lab real-robot study: (a) subjective scores, (b) objective scores; In-lab real-robot study with module heterogeneity: (c) subjective scores, (d) objective scores; In-home real-robot study: (e) subjective scores, (f) objective scores ($*$ indicates statistical significance p < 0.05.).
...and 6 more figures

A Human-in-the-Loop Confidence-Aware Failure Recovery Framework for Modular Robot Policies

TL;DR

Abstract

A Human-in-the-Loop Confidence-Aware Failure Recovery Framework for Modular Robot Policies

Authors

TL;DR

Abstract

Table of Contents

Figures (11)