Table of Contents
Fetching ...

Towards Reliable LLM-based Robot Planning via Combined Uncertainty Estimation

Shiyuan Yin, Chenjia Bai, Zihao Zhang, Junwei Jin, Xinxin Zhang, Chi Zhang, Xuelong Li

TL;DR

This work tackles the challenge of unreliable planning by LLMs in robotics due to hallucinations and instruction ambiguity. It introduces CURE, a plug-and-play framework that decomposes planning uncertainty into epistemic (task clarity and task familiarity) and intrinsic (expected success rate) components, estimated via RND and MLP heads driven by LLM features. The approach is validated on kitchen manipulation and tabletop rearrangement tasks, showing stronger correlations between estimated uncertainty and actual execution outcomes than baselines, and yielding substantial improvements in the SR-HR-AUC metric. The results demonstrate the practical value of granular uncertainty modeling for safer, more reliable embodied planning with minimal integration overhead. Future work will address generalization to broader task sets and integrate physical reasoning to further enhance robustness.

Abstract

Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding. However, LLM hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans. While researchers have explored uncertainty estimation to improve the reliability of LLM-based planning, existing studies have not sufficiently differentiated between epistemic and intrinsic uncertainty, limiting the effectiveness of uncertainty estimation. In this paper, we present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately. Furthermore, epistemic uncertainty is subdivided into task clarity and task familiarity for more accurate evaluation. The overall uncertainty assessments are obtained using random network distillation and multi-layer perceptron regression heads driven by LLM features. We validated our approach in two distinct experimental settings: kitchen manipulation and tabletop rearrangement experiments. The results show that, compared to existing methods, our approach yields uncertainty estimates that are more closely aligned with the actual execution outcomes.

Towards Reliable LLM-based Robot Planning via Combined Uncertainty Estimation

TL;DR

This work tackles the challenge of unreliable planning by LLMs in robotics due to hallucinations and instruction ambiguity. It introduces CURE, a plug-and-play framework that decomposes planning uncertainty into epistemic (task clarity and task familiarity) and intrinsic (expected success rate) components, estimated via RND and MLP heads driven by LLM features. The approach is validated on kitchen manipulation and tabletop rearrangement tasks, showing stronger correlations between estimated uncertainty and actual execution outcomes than baselines, and yielding substantial improvements in the SR-HR-AUC metric. The results demonstrate the practical value of granular uncertainty modeling for safer, more reliable embodied planning with minimal integration overhead. Future work will address generalization to broader task sets and integrate physical reasoning to further enhance robustness.

Abstract

Large language models (LLMs) demonstrate advanced reasoning abilities, enabling robots to understand natural language instructions and generate high-level plans with appropriate grounding. However, LLM hallucinations present a significant challenge, often leading to overconfident yet potentially misaligned or unsafe plans. While researchers have explored uncertainty estimation to improve the reliability of LLM-based planning, existing studies have not sufficiently differentiated between epistemic and intrinsic uncertainty, limiting the effectiveness of uncertainty estimation. In this paper, we present Combined Uncertainty estimation for Reliable Embodied planning (CURE), which decomposes the uncertainty into epistemic and intrinsic uncertainty, each estimated separately. Furthermore, epistemic uncertainty is subdivided into task clarity and task familiarity for more accurate evaluation. The overall uncertainty assessments are obtained using random network distillation and multi-layer perceptron regression heads driven by LLM features. We validated our approach in two distinct experimental settings: kitchen manipulation and tabletop rearrangement experiments. The results show that, compared to existing methods, our approach yields uncertainty estimates that are more closely aligned with the actual execution outcomes.

Paper Structure

This paper contains 32 sections, 16 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of the proposed uncertainty-aware LLM planning framework. Given a natural language instruction (e.g., Give me sth to drink) and environmental context (e.g., Coke, Sprite, Apple), the LLM planner generates a high-level plan (e.g., I will give user Coke). Our framework estimates the overall planning uncertainty using CURE module, which decomposes uncertainty into epistemic and intrinsic components. Epistemic uncertainty encompasses task similarity and task clarity, while intrinsic uncertainty is represented by the expected success rate of the generated plan. The final uncertainty score then guides the decision to proceed, halt, or request clarification, thereby enhancing planning reliability in uncertain or ambiguous scenarios.
  • Figure 2: The process of task familiarity assessment
  • Figure 3: The process of UAN. During the training process, Llama consistently remained frozen. For tasks with clear objectives, both task clarity and expected success rate were trained. In tasks with ambiguous goals, only task clarity was trained.
  • Figure 4: the Help Rate-Success Rate Curve of CURE and KnowNo for Mobile Manipulator in a Kitchen two variants of the CURE algorithm were executed 11 times. The lighter-colored region represents the 2$\sigma$ confidence interval.
  • Figure 5: The SR-HR-AUC metric is used to evaluate the performance of uncertainty estimation. The figure shows three curves: the actual uncertainty evaluation curve (orange solid line), the random evaluation curve (purple dashed line), and the perfect uncertainty evaluation curve (green solid line). The horizontal axis represents the help rate (HR), and the vertical axis represents the success rate (SR). By calculating the area difference between the actual curve and the random curve, and dividing by the area difference between the perfect curve and the random curve, a normalized AUC value is obtained. This metric effectively eliminates the influence of the baseline success rate, providing a more accurate reflection of the quality of uncertainty estimation.
  • ...and 3 more figures