Table of Contents
Fetching ...

Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation

Utkarsh Bajpai, Julius Rückin, Cyrill Stachniss, Marija Popović

TL;DR

This work tackles ObjectNav under semantic uncertainty by integrating a probabilistic semantic relevance model, online probabilistic geometric-semantic mapping, and an uncertainty-aware frontier planner. Semantic relevance scores are derived via a prompt ensemble from a Vision-Language Model, and per-pixel uncertainties are updated in a Bayesian-style map, enabling frontier-based planning that balances exploitation of known relevant regions with exploration of uncertain areas. The approach demonstrates competitive ObjectNav performance against state-of-the-art open-vocabulary methods without fixed hand-crafted prompts, while highlighting prompts' brittleness in baseline methods. The proposed training-free, uncertainty-informed framework advances robust open-vocabulary perception for indoor robot navigation and points to future real-world deployments.

Abstract

Mobile robots exploring indoor environments increasingly rely on vision-language models to perceive high-level semantic cues in camera images, such as object categories. Such models offer the potential to substantially advance robot behaviour for tasks such as object-goal navigation (ObjectNav), where the robot must locate objects specified in natural language by exploring the environment. Current ObjectNav methods heavily depend on prompt engineering for perception and do not address the semantic uncertainty induced by variations in prompt phrasing. Ignoring semantic uncertainty can lead to suboptimal exploration, which in turn limits performance. Hence, we propose a semantic uncertainty-informed active perception pipeline for ObjectNav in indoor environments. We introduce a novel probabilistic sensor model for quantifying semantic uncertainty in vision-language models and incorporate it into a probabilistic geometric-semantic map to enhance spatial understanding. Based on this map, we develop a frontier exploration planner with an uncertainty-informed multi-armed bandit objective to guide efficient object search. Experimental results demonstrate that our method achieves ObjectNav success rates comparable to those of state-of-the-art approaches, without requiring extensive prompt engineering.

Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation

TL;DR

This work tackles ObjectNav under semantic uncertainty by integrating a probabilistic semantic relevance model, online probabilistic geometric-semantic mapping, and an uncertainty-aware frontier planner. Semantic relevance scores are derived via a prompt ensemble from a Vision-Language Model, and per-pixel uncertainties are updated in a Bayesian-style map, enabling frontier-based planning that balances exploitation of known relevant regions with exploration of uncertain areas. The approach demonstrates competitive ObjectNav performance against state-of-the-art open-vocabulary methods without fixed hand-crafted prompts, while highlighting prompts' brittleness in baseline methods. The proposed training-free, uncertainty-informed framework advances robust open-vocabulary perception for indoor robot navigation and points to future real-world deployments.

Abstract

Mobile robots exploring indoor environments increasingly rely on vision-language models to perceive high-level semantic cues in camera images, such as object categories. Such models offer the potential to substantially advance robot behaviour for tasks such as object-goal navigation (ObjectNav), where the robot must locate objects specified in natural language by exploring the environment. Current ObjectNav methods heavily depend on prompt engineering for perception and do not address the semantic uncertainty induced by variations in prompt phrasing. Ignoring semantic uncertainty can lead to suboptimal exploration, which in turn limits performance. Hence, we propose a semantic uncertainty-informed active perception pipeline for ObjectNav in indoor environments. We introduce a novel probabilistic sensor model for quantifying semantic uncertainty in vision-language models and incorporate it into a probabilistic geometric-semantic map to enhance spatial understanding. Based on this map, we develop a frontier exploration planner with an uncertainty-informed multi-armed bandit objective to guide efficient object search. Experimental results demonstrate that our method achieves ObjectNav success rates comparable to those of state-of-the-art approaches, without requiring extensive prompt engineering.

Paper Structure

This paper contains 14 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: We develop an uncertainty-informed, open-vocabulary ObjectNav pipeline for locating arbitrary objects in indoor environments. The figure visualises our approach: given a target object, the robot needs to navigate to it (green arrow) in an initially unknown environment. The robot actively selects a frontier to explore at each timestep amongst all available frontiers (yellow rectangles) using our multi-arm bandit frontier planner informed by semantic relevance estimates about each frontier (blue Gaussians) from our probabilistic geometric-semantic map (purple).
  • Figure 2: Our ObjectNav approach consists of the object detection and active semantic exploration modules. If the target object is detected in the currently recorded frame, the robot navigates directly to it using point-goal navigation, as discussed in Sec. \ref{['sec:object_detection']}. Otherwise, it explores the environment using our uncertainty-informed frontier planner guided by our probabilistic semantic relevance map. Incorporating semantic cues and uncertainty into our map allows us to intelligently explore regions with higher probability of finding the target object.
  • Figure 3: We display the Quantile-Quantile (QQ) plot of 100 VLM-predicted semantic relevance scores, which are generated from 100 unique prompts around the target object name "printer". The quantiles of the semantic relevance scores (vertical) are contrasted with the theoretical quantiles of the standard normal distribution (horizontal). The linear relationship suggests that the semantic relevance scores are approximately normally distributed.