Table of Contents
Fetching ...

Confidence Calibration in Vision-Language-Action Models

Thomas P Zollo, Richard Zemel

TL;DR

This work pioneers the study of confidence calibration for vision-language-action (VLA) robots, establishing baseline calibration metrics, and introducing two practical remedies—prompt ensembles and action-wise Platt scaling—to improve reliability without altering actions. Through extensive simulation across multiple VLA architectures and LIBERO task suites, the authors show that ensemble-based wording variations reduce calibration error (often by ~20%) and that per-dimension recalibration can outperform global approaches, highlighting the need for domain-specific uncertainty tools in VLA systems. They also reveal that calibration evolves over the task horizon, typically improving mid-task before leveling off, suggesting horizon-aware intervention strategies and context-aware monitoring. The work lays groundwork for trustworthy VLA systems by demonstrating concrete methods to quantify and improve uncertainty in multimodal robotic control, with clear paths toward real-world validation and broader architectural coverage.

Abstract

Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present a first-of-its-kind study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural language instructions to low-level robot motor commands. We establish a confidence baseline for VLAs, examine how task success relates to calibration error and how calibration evolves over time, and introduce two lightweight techniques to remedy the miscalibration we observe: prompt ensembles and action-wise Platt scaling. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs both highly performant and highly trustworthy via reliable uncertainty quantification.

Confidence Calibration in Vision-Language-Action Models

TL;DR

This work pioneers the study of confidence calibration for vision-language-action (VLA) robots, establishing baseline calibration metrics, and introducing two practical remedies—prompt ensembles and action-wise Platt scaling—to improve reliability without altering actions. Through extensive simulation across multiple VLA architectures and LIBERO task suites, the authors show that ensemble-based wording variations reduce calibration error (often by ~20%) and that per-dimension recalibration can outperform global approaches, highlighting the need for domain-specific uncertainty tools in VLA systems. They also reveal that calibration evolves over the task horizon, typically improving mid-task before leveling off, suggesting horizon-aware intervention strategies and context-aware monitoring. The work lays groundwork for trustworthy VLA systems by demonstrating concrete methods to quantify and improve uncertainty in multimodal robotic control, with clear paths toward real-world validation and broader architectural coverage.

Abstract

Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present a first-of-its-kind study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural language instructions to low-level robot motor commands. We establish a confidence baseline for VLAs, examine how task success relates to calibration error and how calibration evolves over time, and introduce two lightweight techniques to remedy the miscalibration we observe: prompt ensembles and action-wise Platt scaling. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs both highly performant and highly trustworthy via reliable uncertainty quantification.

Paper Structure

This paper contains 40 sections, 10 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: To be trustworthy, a robotic system must be able to reliably express its confidence in its ability to perform a task, especially in high-stakes and open-world domains. A well-calibrated robot policy produces confidence estimates that align with its probability of task success. For example, the robot should succeed on 95% of instances for which it expresses 95% confidence.
  • Figure 2: Given an input image and text instruction, popular VLAs such as OpenVLA and RT-2 generate a distribution over discrete action tokens for each of the robot's degrees of freedom. Confidence in each dimension's prediction can be estimated using the probability assigned to the predicted token; a single estimate can be produced by averaging across dimensions. Given an uncalibrated confidence estimate, recalibration methods such as Platt scaling use a small calibration dataset to learn a map from uncalibrated confidence estimates to calibrated ones. Calibration error can be measured by comparing confidence estimates to actual task success rates.
  • Figure 3: Visualization of task error rates compared against 4 different calibration error measurements for 4 VLA variants (OpenVLA, MolmoAct, UniVLA, and NORA) and 4 LIBERO task suites (Spatial, Object, Goal, 10), as well as OpenVLA 8- and 4-bit versions on Spatial, Object, and Goal. All models exhibit a roughly monotonic relationship between task error and the discriminative measures (Brier score and NLL). ECE shows differences between model families, potentially due to architecture and objective differences.
  • Figure 4: Empirical study of calibration error across the task time horizon. In the top row (a), the left two plots show how calibration evolves with task progress, while the right two plots show the average confidence by task time, grouped by successful and failed trials. The bottom two rows (b, c) offer a sample of the reliability diagrams produced by different methods for aggregating confidence estimates across time. Overall, these results illustrate that calibration can improve as the task progresses and more information is gathered, suggesting opportunities for context-aware uncertainty interventions.
  • Figure 5: Qualitative examples of a context-aware confidence monitoring strategy applied to a task from the Goal suite. Here, the task is to "put the wine bottle on the rack". The red dashed line represents the 10% quantile of the confidence estimates output by the model across the task time horizon, offering a potential threshold below which the robot may abstain from performing the task.
  • ...and 14 more figures