Confidence Calibration in Vision-Language-Action Models
Thomas P Zollo, Richard Zemel
TL;DR
This work pioneers the study of confidence calibration for vision-language-action (VLA) robots, establishing baseline calibration metrics, and introducing two practical remedies—prompt ensembles and action-wise Platt scaling—to improve reliability without altering actions. Through extensive simulation across multiple VLA architectures and LIBERO task suites, the authors show that ensemble-based wording variations reduce calibration error (often by ~20%) and that per-dimension recalibration can outperform global approaches, highlighting the need for domain-specific uncertainty tools in VLA systems. They also reveal that calibration evolves over the task horizon, typically improving mid-task before leveling off, suggesting horizon-aware intervention strategies and context-aware monitoring. The work lays groundwork for trustworthy VLA systems by demonstrating concrete methods to quantify and improve uncertainty in multimodal robotic control, with clear paths toward real-world validation and broader architectural coverage.
Abstract
Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present a first-of-its-kind study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural language instructions to low-level robot motor commands. We establish a confidence baseline for VLAs, examine how task success relates to calibration error and how calibration evolves over time, and introduce two lightweight techniques to remedy the miscalibration we observe: prompt ensembles and action-wise Platt scaling. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs both highly performant and highly trustworthy via reliable uncertainty quantification.
