Table of Contents
Fetching ...

Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators

Linfeng Zhao, Willie McClinton, Aidan Curtis, Nishanth Kumar, Tom Silver, Leslie Pack Kaelbling, Lawson L. S. Wong

TL;DR

This work addresses robust long-horizon robotic manipulation under partial observability by integrating belief-space planning with vision-language models as uncertainty estimators. The core idea is to represent uncertainty with three-valued predicates ($K_P$, $K_{ eg P}$, Unknown) and to interleave manipulation with information-gathering actions through an online replanning loop, grounding goals via VLMs and validating plans with a determinized planning domain. The approach, termed BKLVA, demonstrates improved task success and efficiency over baselines in both synthetic tasks with real images and real-robot experiments on Spot, highlighting the potential of VLM-grounded belief representations for uncertainty-aware planning. The results indicate that combining symbolic belief-space reasoning with perceptual grounding enables scalable, open-world robotic systems capable of strategic perception and robust long-horizon execution. This framework lays groundwork for future automation of operators, better perception-to-planning integration, and deeper coupling with low-level control in uncertain environments.

Abstract

Generalizable robotic mobile manipulation in open-world environments poses significant challenges due to long horizons, complex goals, and partial observability. A promising approach to address these challenges involves planning with a library of parameterized skills, where a task planner sequences these skills to achieve goals specified in structured languages, such as logical expressions over symbolic facts. While vision-language models (VLMs) can be used to ground these expressions, they often assume full observability, leading to suboptimal behavior when the agent lacks sufficient information to evaluate facts with certainty. This paper introduces a novel framework that leverages VLMs as a perception module to estimate uncertainty and facilitate symbolic grounding. Our approach constructs a symbolic belief representation and uses a belief-space planner to generate uncertainty-aware plans that incorporate strategic information gathering. This enables the agent to effectively reason about partial observability and property uncertainty. We demonstrate our system on a range of challenging real-world tasks that require reasoning in partially observable environments. Simulated evaluations show that our approach outperforms both vanilla VLM-based end-to-end planning or VLM-based state estimation baselines by planning for and executing strategic information gathering. This work highlights the potential of VLMs to construct belief-space symbolic scene representations, enabling downstream tasks such as uncertainty-aware planning.

Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators

TL;DR

This work addresses robust long-horizon robotic manipulation under partial observability by integrating belief-space planning with vision-language models as uncertainty estimators. The core idea is to represent uncertainty with three-valued predicates (, , Unknown) and to interleave manipulation with information-gathering actions through an online replanning loop, grounding goals via VLMs and validating plans with a determinized planning domain. The approach, termed BKLVA, demonstrates improved task success and efficiency over baselines in both synthetic tasks with real images and real-robot experiments on Spot, highlighting the potential of VLM-grounded belief representations for uncertainty-aware planning. The results indicate that combining symbolic belief-space reasoning with perceptual grounding enables scalable, open-world robotic systems capable of strategic perception and robust long-horizon execution. This framework lays groundwork for future automation of operators, better perception-to-planning integration, and deeper coupling with low-level control in uncertain environments.

Abstract

Generalizable robotic mobile manipulation in open-world environments poses significant challenges due to long horizons, complex goals, and partial observability. A promising approach to address these challenges involves planning with a library of parameterized skills, where a task planner sequences these skills to achieve goals specified in structured languages, such as logical expressions over symbolic facts. While vision-language models (VLMs) can be used to ground these expressions, they often assume full observability, leading to suboptimal behavior when the agent lacks sufficient information to evaluate facts with certainty. This paper introduces a novel framework that leverages VLMs as a perception module to estimate uncertainty and facilitate symbolic grounding. Our approach constructs a symbolic belief representation and uses a belief-space planner to generate uncertainty-aware plans that incorporate strategic information gathering. This enables the agent to effectively reason about partial observability and property uncertainty. We demonstrate our system on a range of challenging real-world tasks that require reasoning in partially observable environments. Simulated evaluations show that our approach outperforms both vanilla VLM-based end-to-end planning or VLM-based state estimation baselines by planning for and executing strategic information gathering. This work highlights the potential of VLMs to construct belief-space symbolic scene representations, enabling downstream tasks such as uncertainty-aware planning.

Paper Structure

This paper contains 35 sections, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Example tasks demonstrating various uncertainty levels. (1) Cup Pick-Place: a fully observable tabletop manipulation task with multiple cups. (2) Empty Cup Removal: requires inspecting cups from above to determine if they are empty before removal. (3) Drawer Cleaning: involves opening drawers to discover and remove objects inside. (4) Sort Weight: requires weighing sealed boxes on a scale to identify and dispose of empty ones. These tasks demonstrate increasing complexity in information gathering, from fully observable scenarios to those requiring strategic inspection and manipulation.
  • Figure 2: Example plan. A task to put any object in the drawer into a paper bin. Because the drawer is closed, the robot needs to maintain uncertainty of the environment and plan under uncertainty to achieve the belief goal: KEmpty+(drawer) and Inside(block, box). The sequence shows: (1) initial reach to the closed drawer (without knowing if the drawer is empty or not), (2) opening the drawer to reveal a blue block inside and update belief, (3) grasping the block from the drawer, (4) moving the block over the paper bin, and (5) successfully placing the block into the bin. This demonstrates how the robot handles uncertainty through interleaved information gathering (opening drawer to check contents) and manipulation actions (grasping and placing the block).
  • Figure 3: Pipeline overview. Our system integrates perception, belief-state update, and planning. The example shows a task of moving empty cups to a bin, where the system must evaluate cup properties and plan appropriate manipulation actions. Before runtime, a text goal is first translated into symbolic specifications, which along with actions are determinized for the task planner. During a step of belief state update at runtime, given an observation (images and sensor inputs from a robot), the system performs two parallel processes for: (1) object pointing and segmenting to maintain a spatial memory of objects, and (2) predicate evaluation to ground belief predicates (e.g., Empty, On). The planner generates a symbolic plan based on the symbolic belief state, and the first action is executed to generate a new observation. The belief state is updated based on the new observation, and the process repeats until the goal is satisfied.
  • Figure 4: Sort Weight. An example task in our synthetic environment with real images. The agent needs to open the drawer and retrieve sealed boxes to weigh them. The boxes cannot be opened by the agent but can only be measured indrectly by a scale. The goal is to find empty boxes and remove it to a bin, and a few notable states are shown in the figure. The optimal path takes 14 steps.