Table of Contents
Fetching ...

Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot Interaction

Zhaoda Du, Michael Bowman, Qiaojie Zheng, Xiaoli Zhang

TL;DR

This study presents the first systematic evaluation of uncertainty in vision-language model-based short-term action recognition for human-robot interaction, and introduces a temporal-prefix evaluation protocol and metrics for calibration and selective prediction.

Abstract

Robots in shared workspaces must interpret human actions from partial, ambiguous observations, where overconfident early predictions can lead to unsafe or disruptive interaction. This challenge is amplified in egocentric views, where viewpoint changes and occlusions increase perceptual noise and ambiguity. As a result, downstream human-robot interaction modules require not only an action hypothesis but also a trustworthy estimate of confidence under partial observation. Recent vision-language model-based approaches have been proposed for short-term action recognition due to their open-vocabulary and context-aware reasoning, but their uncertainty reliability in the temporal-prefix regime is largely uncharacterized. We present the first systematic evaluation of uncertainty in vision-language model-based short-term action recognition for human-robot interaction. We introduce a temporal-prefix evaluation protocol and metrics for calibration and selective prediction. We also characterize miscalibration patterns and failure modes under partial observations. Our study provides the missing reliability evidence needed to use vision-language model predictions in confidence-gated human-robot interaction modules.

Decision-Aware Uncertainty Evaluation of Vision-Language Model-Based Early Action Anticipation for Human-Robot Interaction

TL;DR

This study presents the first systematic evaluation of uncertainty in vision-language model-based short-term action recognition for human-robot interaction, and introduces a temporal-prefix evaluation protocol and metrics for calibration and selective prediction.

Abstract

Robots in shared workspaces must interpret human actions from partial, ambiguous observations, where overconfident early predictions can lead to unsafe or disruptive interaction. This challenge is amplified in egocentric views, where viewpoint changes and occlusions increase perceptual noise and ambiguity. As a result, downstream human-robot interaction modules require not only an action hypothesis but also a trustworthy estimate of confidence under partial observation. Recent vision-language model-based approaches have been proposed for short-term action recognition due to their open-vocabulary and context-aware reasoning, but their uncertainty reliability in the temporal-prefix regime is largely uncharacterized. We present the first systematic evaluation of uncertainty in vision-language model-based short-term action recognition for human-robot interaction. We introduce a temporal-prefix evaluation protocol and metrics for calibration and selective prediction. We also characterize miscalibration patterns and failure modes under partial observations. Our study provides the missing reliability evidence needed to use vision-language model predictions in confidence-gated human-robot interaction modules.
Paper Structure (30 sections, 10 equations, 5 figures, 3 tables)

This paper contains 30 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison between traditional accuracy-based Top-K selection and our confidence-aware selection framework for short-term action anticipation. Given observed short context, a VLM produces Top-K action hypotheses. Traditional methods select the highest-scoring action and commit early to a single hypothesis. In contrast, our approach models confidence and preserves multiple plausible intents, enabling delayed commitment, active clarification, and uncertainty-calibrated execution.
  • Figure 2: From stochastic variability to structured uncertainty for HRI decision-making. Multiple stochastic decoding runs produce unstable Top-K action hypotheses under partial observation. Aggregation operators transform these discrete prediction sets into a structured confidence distribution over candidate actions. Different aggregation strategies reshape the geometry of confidence in distinct ways, influencing calibration behavior and confidence-based decision gating in downstream human–robot interaction systems.
  • Figure 3: Comparison of aggregation strategies on two egocentric action benchmarks (top: EGTEA Gaze+; bottom: EPIC-KITCHENS-100). From left to right: Recall@K, selective accuracy and coverage under confidence thresholding , set-level calibration, and normalized Top-10 entropy. Ranking performance differences are modest. Calibration behavior and confidence geometry exhibit clearer structural differences across aggregation strategies, impacting selective decision characteristics.
  • Figure 4: Rank-wise confidence distributions across aggregation strategies on EGTEA Gaze+ (top) and EPIC-KITCHENS-100 (bottom). Boxplots show the distribution of confidence assigned to each rank position within the Top-K set. PairRank sharply concentrates confidence on top ranks, whereas consistency and confidence-weighted methods yield smoother, higher-entropy distributions.
  • Figure 5: Illustrative example of Top-$K$ ($K=5$) prediction with confidence thresholding ($\tau = 0.39$) for a temporal prefix, used to examine interaction responses in HRI. When only Top-$K$ rankings are considered, the aggregation strategies appear behaviorally similar. However, incorporating confidence via threshold-based selection leads to markedly different clarification behaviors. Sharp distributions (e.g., PairRank) may produce overconfident misprediction, whereas smoother aggregations(e.g., Consistency) expand the clarification space and increase interaction burden. Low-confidence single-run outputs may suppress interaction under thresholding. This example highlights the importance of uncertainty-aware evaluation for interaction-level decision-making.