Table of Contents
Fetching ...

Shifting Uncertainty to Critical Moments: Towards Reliable Uncertainty Quantification for VLA Model

Yanchuan Tang, Taowen Wang, Yuefei Chen, Boxuan Zhang, Qiang Guan, Ruixiang Tang

Abstract

Vision-Language-Action (VLA) models enable general-purpose robotic policies by mapping visual observations and language instructions to low-level actions, but they often lack reliable introspection. A common practice is to compute a token-level uncertainty signal and take its mean over a rollout. However, mean aggregation can dilute short-lived but safety-critical uncertainty spikes in continuous control. In particular, successful rollouts may contain localized high-entropy segments due to benign noise or non-critical micro-adjustments, while failure rollouts can appear low-entropy for most timesteps and only exhibit brief spikes near the onset of failure. We propose a unified uncertainty quantification approach for predicting rollout success versus failure that (1) uses max-based sliding window pooling to preserve transient risk signals, (2) applies motion-aware stability weighting to emphasize high-frequency action oscillations associated with unstable behaviors, and (3) performs DoF-adaptive calibration via Bayesian Optimization to prioritize kinematically critical axes. Experiments on the LIBERO benchmark show that our method substantially improves failure prediction accuracy and yields more reliable signals for failure detection, which can support downstream human-in-the-loop interventions.

Shifting Uncertainty to Critical Moments: Towards Reliable Uncertainty Quantification for VLA Model

Abstract

Vision-Language-Action (VLA) models enable general-purpose robotic policies by mapping visual observations and language instructions to low-level actions, but they often lack reliable introspection. A common practice is to compute a token-level uncertainty signal and take its mean over a rollout. However, mean aggregation can dilute short-lived but safety-critical uncertainty spikes in continuous control. In particular, successful rollouts may contain localized high-entropy segments due to benign noise or non-critical micro-adjustments, while failure rollouts can appear low-entropy for most timesteps and only exhibit brief spikes near the onset of failure. We propose a unified uncertainty quantification approach for predicting rollout success versus failure that (1) uses max-based sliding window pooling to preserve transient risk signals, (2) applies motion-aware stability weighting to emphasize high-frequency action oscillations associated with unstable behaviors, and (3) performs DoF-adaptive calibration via Bayesian Optimization to prioritize kinematically critical axes. Experiments on the LIBERO benchmark show that our method substantially improves failure prediction accuracy and yields more reliable signals for failure detection, which can support downstream human-in-the-loop interventions.
Paper Structure (47 sections, 10 equations, 9 figures, 3 tables)

This paper contains 47 sections, 10 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of the proposed uncertainty quantification framework. Standard global averaging often masks failure signals. We propose: (A) Sliding Window Pooling (SW) to capture transient uncertainty spikes; (B) Action Transfer Reweighting (ATR) to prioritize uncertainty during oscillatory actions; and (C) Bayesian Optimization (BO) to learn adaptive weights for kinematically critical DoFs.
  • Figure 2: Empirical evidence of the Averaging Trap on LIBERO-10. The plots show the probability density of the global mean entropy $S_{\text{Avg}}(\tau)$ for success (blue) and failure (red) rollouts. The significant overlap between the two distributions, coupled with near-random AUROC scores (0.51 on train and 0.47 on test), demonstrates that global averaging masks critical failure signals and fails to distinguish between successful and failed executions.
  • Figure 3: Ablation studies on LIBERO-10. (a) SW window size ablation on LIBERO-10. (b) ATR stability contrast $\alpha$ ablation on LIBERO-10. (c) Joint SW+ATR ablation on LIBERO-10; the red star indicates the best-performing $(w,\alpha)$ pair. (d) Effect of DoF-adaptive calibration, comparing SW+ATR with and without Bayesian Optimization across LIBERO suites.
  • Figure A1: DoF importance learned by Bayesian Optimization. We visualize the optimized DoF weights $\beta^\star$ across LIBERO suites. Gripper and $\Delta z$ consistently receive high weights, while $\Delta\text{pitch}$ is emphasized in LIBERO-OBJECT and LIBERO-GOAL, supporting DoF-adaptive calibration.
  • Figure A2: Per-task success rates (%) of OpenVLA baseline across LIBERO suites. Darker cells indicate higher success. Task difficulty varies significantly both across and within suites, motivating the need for reliable failure prediction via uncertainty quantification.
  • ...and 4 more figures