Table of Contents
Fetching ...

Not All Errors Are Made Equal: A Regret Metric for Detecting System-level Trajectory Prediction Failures

Kensuke Nakamura, Ran Tian, Andrea Bajcsy

TL;DR

It is found that the very presence of high-regret data during human predictor fine-tuning is highly predictive of robot re-deployment performance improvements, indicating a promising avenue for efficiently mitigating system-level human-robot interaction failures.

Abstract

Robot decision-making increasingly relies on data-driven human prediction models when operating around people. While these models are known to mispredict in out-of-distribution interactions, only a subset of prediction errors impact downstream robot performance. We propose characterizing such "system-level" prediction failures via the mathematical notion of regret: high-regret interactions are precisely those in which mispredictions degraded closed-loop robot performance. We further introduce a probabilistic generalization of regret that calibrates failure detection across disparate deployment contexts and renders regret compatible with reward-based and reward-free (e.g., generative) planners. In simulated autonomous driving interactions and social navigation interactions deployed on hardware, we showcase that our system-level failure metric can be used offline to automatically extract closed-loop human-robot interactions that state-of-the-art generative human predictors and robot planners previously struggled with. We further find that the very presence of high-regret data during human predictor fine-tuning is highly predictive of robot re-deployment performance improvements. Fine-tuning with the informative but significantly smaller high-regret data (23% of deployment data) is competitive with fine-tuning on the full deployment dataset, indicating a promising avenue for efficiently mitigating system-level human-robot interaction failures. Project website: https://cmu-intentlab.github.io/not-all-errors/

Not All Errors Are Made Equal: A Regret Metric for Detecting System-level Trajectory Prediction Failures

TL;DR

It is found that the very presence of high-regret data during human predictor fine-tuning is highly predictive of robot re-deployment performance improvements, indicating a promising avenue for efficiently mitigating system-level human-robot interaction failures.

Abstract

Robot decision-making increasingly relies on data-driven human prediction models when operating around people. While these models are known to mispredict in out-of-distribution interactions, only a subset of prediction errors impact downstream robot performance. We propose characterizing such "system-level" prediction failures via the mathematical notion of regret: high-regret interactions are precisely those in which mispredictions degraded closed-loop robot performance. We further introduce a probabilistic generalization of regret that calibrates failure detection across disparate deployment contexts and renders regret compatible with reward-based and reward-free (e.g., generative) planners. In simulated autonomous driving interactions and social navigation interactions deployed on hardware, we showcase that our system-level failure metric can be used offline to automatically extract closed-loop human-robot interactions that state-of-the-art generative human predictors and robot planners previously struggled with. We further find that the very presence of high-regret data during human predictor fine-tuning is highly predictive of robot re-deployment performance improvements. Fine-tuning with the informative but significantly smaller high-regret data (23% of deployment data) is competitive with fine-tuning on the full deployment dataset, indicating a promising avenue for efficiently mitigating system-level human-robot interaction failures. Project website: https://cmu-intentlab.github.io/not-all-errors/
Paper Structure (17 sections, 5 equations, 6 figures, 3 tables)

This paper contains 17 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: All scenarios have component-level prediction failures: mispredicting that parked cars will turn (left), marooned truck will to move (center), and nearby cars will lane change (right). But, only the center and right scenarios have system-level prediction failures which impact robot performance.
  • Figure 2: Illustrative Example. Left column: robot's predictions and deployment-time decision. Right column: the counterfactual analysis given observed human behavior. Each hindsight-optimal action and executed action's reward is on the right.
  • Figure 3: Left: The robot correctly predicts the human will block its goal and proceeds straight. Middle: The robot incorrectly predicts that the human will walk straight ahead, however it is able to reach its goal despite the misprediction. Right: The robot mispredicts that the human will block its goal and proceeds straight, colliding with the human who also walked straight ahead. The robot's executed trajectory is unlikely conditioned on the human's true actions and is assigned high regret.
  • Figure 4: Qualitative comparison between scenarios uniquely identified by each metric.
  • Figure 5: As $P_\theta$ is fine-tuned on more high-regret data, closed-loop performance improves.
  • ...and 1 more figures