Uncertainty in Action: Confidence Elicitation in Embodied Agents
Tianjiao Yu, Vedant Shah, Muntasir Wahed, Kiet A. Nguyen, Adheesh Juvekar, Tal August, Ismini Lourentzou
TL;DR
This work tackles the problem of expressing and calibrating uncertainty in embodied agents operating in open-ended, multimodal environments by introducing a dual-policy framework: Elicitation Policies to verbalize and refine uncertainty and Execution Policies to broaden the reasoning and action space. In Minecraft, across three backbones ($GPT$-$4V$, $MineLLM$, $STEVE$) and 30 tasks of varying difficulty, structured reasoning approaches such as Chain-of-Thought (CoT) and Plan & Solve (P&S) consistently improve confidence calibration, as measured by $ECE$, and failure prediction via $AUROC$, with Top-K posing greater challenges in abductive settings. Execution Policies like Action Sampling, Scenario Reinterpretation, and Hypothetical Reasoning further enhance reliability by diversifying trajectories and hypothetical considerations, though some policies may degrade calibration in certain models. Overall, the paper demonstrates that combining elicitation and execution-aware strategies yields more reliable confidence estimates in embodied agents, while highlighting persistent obstacles in abductive reasoning and model-specific calibration. This work advances safe, reliable deployment of embodied AI in complex real-world domains such as robotics and interactive environments, and lays a foundation for scalable confidence elicitation across diverse embodied architectures.
Abstract
Expressing confidence is challenging for embodied agents navigating dynamic multimodal environments, where uncertainty arises from both perception and decision-making processes. We present the first work investigating embodied confidence elicitation in open-ended multimodal environments. We introduce Elicitation Policies, which structure confidence assessment across inductive, deductive, and abductive reasoning, along with Execution Policies, which enhance confidence calibration through scenario reinterpretation, action sampling, and hypothetical reasoning. Evaluating agents in calibration and failure prediction tasks within the Minecraft environment, we show that structured reasoning approaches, such as Chain-of-Thoughts, improve confidence calibration. However, our findings also reveal persistent challenges in distinguishing uncertainty, particularly under abductive settings, underscoring the need for more sophisticated embodied confidence elicitation methods.
