Table of Contents
Fetching ...

Uncertainty in Action: Confidence Elicitation in Embodied Agents

Tianjiao Yu, Vedant Shah, Muntasir Wahed, Kiet A. Nguyen, Adheesh Juvekar, Tal August, Ismini Lourentzou

TL;DR

This work tackles the problem of expressing and calibrating uncertainty in embodied agents operating in open-ended, multimodal environments by introducing a dual-policy framework: Elicitation Policies to verbalize and refine uncertainty and Execution Policies to broaden the reasoning and action space. In Minecraft, across three backbones ($GPT$-$4V$, $MineLLM$, $STEVE$) and 30 tasks of varying difficulty, structured reasoning approaches such as Chain-of-Thought (CoT) and Plan & Solve (P&S) consistently improve confidence calibration, as measured by $ECE$, and failure prediction via $AUROC$, with Top-K posing greater challenges in abductive settings. Execution Policies like Action Sampling, Scenario Reinterpretation, and Hypothetical Reasoning further enhance reliability by diversifying trajectories and hypothetical considerations, though some policies may degrade calibration in certain models. Overall, the paper demonstrates that combining elicitation and execution-aware strategies yields more reliable confidence estimates in embodied agents, while highlighting persistent obstacles in abductive reasoning and model-specific calibration. This work advances safe, reliable deployment of embodied AI in complex real-world domains such as robotics and interactive environments, and lays a foundation for scalable confidence elicitation across diverse embodied architectures.

Abstract

Expressing confidence is challenging for embodied agents navigating dynamic multimodal environments, where uncertainty arises from both perception and decision-making processes. We present the first work investigating embodied confidence elicitation in open-ended multimodal environments. We introduce Elicitation Policies, which structure confidence assessment across inductive, deductive, and abductive reasoning, along with Execution Policies, which enhance confidence calibration through scenario reinterpretation, action sampling, and hypothetical reasoning. Evaluating agents in calibration and failure prediction tasks within the Minecraft environment, we show that structured reasoning approaches, such as Chain-of-Thoughts, improve confidence calibration. However, our findings also reveal persistent challenges in distinguishing uncertainty, particularly under abductive settings, underscoring the need for more sophisticated embodied confidence elicitation methods.

Uncertainty in Action: Confidence Elicitation in Embodied Agents

TL;DR

This work tackles the problem of expressing and calibrating uncertainty in embodied agents operating in open-ended, multimodal environments by introducing a dual-policy framework: Elicitation Policies to verbalize and refine uncertainty and Execution Policies to broaden the reasoning and action space. In Minecraft, across three backbones (-, , ) and 30 tasks of varying difficulty, structured reasoning approaches such as Chain-of-Thought (CoT) and Plan & Solve (P&S) consistently improve confidence calibration, as measured by , and failure prediction via , with Top-K posing greater challenges in abductive settings. Execution Policies like Action Sampling, Scenario Reinterpretation, and Hypothetical Reasoning further enhance reliability by diversifying trajectories and hypothetical considerations, though some policies may degrade calibration in certain models. Overall, the paper demonstrates that combining elicitation and execution-aware strategies yields more reliable confidence estimates in embodied agents, while highlighting persistent obstacles in abductive reasoning and model-specific calibration. This work advances safe, reliable deployment of embodied AI in complex real-world domains such as robotics and interactive environments, and lays a foundation for scalable confidence elicitation across diverse embodied architectures.

Abstract

Expressing confidence is challenging for embodied agents navigating dynamic multimodal environments, where uncertainty arises from both perception and decision-making processes. We present the first work investigating embodied confidence elicitation in open-ended multimodal environments. We introduce Elicitation Policies, which structure confidence assessment across inductive, deductive, and abductive reasoning, along with Execution Policies, which enhance confidence calibration through scenario reinterpretation, action sampling, and hypothetical reasoning. Evaluating agents in calibration and failure prediction tasks within the Minecraft environment, we show that structured reasoning approaches, such as Chain-of-Thoughts, improve confidence calibration. However, our findings also reveal persistent challenges in distinguishing uncertainty, particularly under abductive settings, underscoring the need for more sophisticated embodied confidence elicitation methods.

Paper Structure

This paper contains 15 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Embodied Confidence Estimation Framework consisting of Elicitation Policies and Execution Policies, which jointly enable an agent to assess and express its confidence. Elicitation Modules prompt the agent to evaluate uncertainty in what it sees and does, while Execution Policies refine confidence calibration by expanding the agent's reasoning space (See §\ref{['sec:method']} for details).
  • Figure 2: Embodied Confidence Elicitation.Elicitation Policies (§\ref{['sec:Elicitation_Policies']}) enable agents to express uncertainty, while Execution Policies (§\ref{['sec:Execution_Policies']}) refine and expand confidence assessment through scenario reinterpretation, action sampling, and hypothetical reasoning. Together, they enhance confidence calibration in embodied agents. The orange text represents the vanilla elicitation policy, which incorporates the vanilla confidence prompt (described in Table \ref{['tab:prompts']}) into the original instruction. The brown arrows denote the Scenario-Reinterpretation execution policy, prompting the agent to generate additional scene insights.
  • Figure 3: ECE and AUROC across Models and Execution Policies. Bars present ECE (top, lower is better) and AUROC (bottom, higher is better) under different elicitation strategies. Red dashed lines are metrics for Vanilla elicitation with no execution policy applied.
  • Figure 4: Comparison of Aggregation Methods for Vanilla Elicitation without Execution Policies. Temporal aggregation provides a holistic score, while separate aggregation evaluates confidence in reasoning and perception separately respectively.
  • Figure 5: ECE and AUROC across Execution Policy Iterations.
  • ...and 1 more figures