Table of Contents
Fetching ...

Perception Matters: Enhancing Embodied AI with Uncertainty-Aware Semantic Segmentation

Sai Prasanna, Daniel Honerkamp, Kshitij Sirohi, Tim Welschehold, Wolfram Burgard, Abhinav Valada

TL;DR

The paper tackles the substantial gap between ground-truth perception and pretrained, potentially overconfident semantic predictions in sequential embodied AI tasks like ObjectNav. It proposes an uncertainty-aware pipeline that calibrates perception via temperature scaling, computes per-pixel uncertainty, and performs uncertainty-weighted map aggregation along with a map-uncertainty-based found decision, integrated into modular perception-mapping-policy architectures. Across multiple perception models (e.g., Mask-RCNN, Segformer, EMSANet) and policies (shortest-path and RL), the approach reduces false found decisions and improves success rates and SPL on HM3D Habitat ObjectNav, demonstrating robust gains without additional training costs. The authors release code and trained models to facilitate adoption and future work on policy conditioning on calibrated uncertainty.

Abstract

Embodied AI has made significant progress acting in unexplored environments. However, tasks such as object search have largely focused on efficient policy learning. In this work, we identify several gaps in current search methods: They largely focus on dated perception models, neglect temporal aggregation, and transfer from ground truth directly to noisy perception at test time, without accounting for the resulting overconfidence in the perceived state. We address the identified problems through calibrated perception probabilities and uncertainty across aggregation and found decisions, thereby adapting the models for sequential tasks. The resulting methods can be directly integrated with pretrained models across a wide family of existing search approaches at no additional training cost. We perform extensive evaluations of aggregation methods across both different semantic perception models and policies, confirming the importance of calibrated uncertainties in both the aggregation and found decisions. We make the code and trained models available at https://semantic-search.cs.uni-freiburg.de.

Perception Matters: Enhancing Embodied AI with Uncertainty-Aware Semantic Segmentation

TL;DR

The paper tackles the substantial gap between ground-truth perception and pretrained, potentially overconfident semantic predictions in sequential embodied AI tasks like ObjectNav. It proposes an uncertainty-aware pipeline that calibrates perception via temperature scaling, computes per-pixel uncertainty, and performs uncertainty-weighted map aggregation along with a map-uncertainty-based found decision, integrated into modular perception-mapping-policy architectures. Across multiple perception models (e.g., Mask-RCNN, Segformer, EMSANet) and policies (shortest-path and RL), the approach reduces false found decisions and improves success rates and SPL on HM3D Habitat ObjectNav, demonstrating robust gains without additional training costs. The authors release code and trained models to facilitate adoption and future work on policy conditioning on calibrated uncertainty.

Abstract

Embodied AI has made significant progress acting in unexplored environments. However, tasks such as object search have largely focused on efficient policy learning. In this work, we identify several gaps in current search methods: They largely focus on dated perception models, neglect temporal aggregation, and transfer from ground truth directly to noisy perception at test time, without accounting for the resulting overconfidence in the perceived state. We address the identified problems through calibrated perception probabilities and uncertainty across aggregation and found decisions, thereby adapting the models for sequential tasks. The resulting methods can be directly integrated with pretrained models across a wide family of existing search approaches at no additional training cost. We perform extensive evaluations of aggregation methods across both different semantic perception models and policies, confirming the importance of calibrated uncertainties in both the aggregation and found decisions. We make the code and trained models available at https://semantic-search.cs.uni-freiburg.de.
Paper Structure (16 sections, 4 figures, 7 tables)

This paper contains 16 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Success rate of an RL agent fabian22exploration on the Habitat ObjectNav task with different semantic perception models. The gap from ground truth to learned perception models is often larger than the gap to an optimal policy. We propose uncertainty-based aggregation for sequential decision problems and find that this reduces the perception gap substantially. Ground Truth: ground truth semantic masks, One-Step: latest semantic prediction, Aggregated: best evaluated aggregation method of the model (cf. Sec. \ref{['sec:experiments']}).
  • Figure 2: Overview of modular object search pipelines. First, a semantic segmentation model classifies the current image. A mapping module then fuses this information into a semantic point cloud and integrates it into a global map. From this map, either an egocentric map is extracted for RL agents or the full map is used by a planner. An agent then determines navigation and found decision for a given target class $c$. We develop general methods to incorporate calibrated uncertainties in this system for temporal aggregation of the semantic perception and consistent found decisions.
  • Figure 3: Expected Calibration Error (left) and Uncertainty Expected Calibration Error (right) of the different semantic perception models on the validation set.
  • Figure 4: Semantic maps showing, from left to right, the ground truth semantics, the aggregated predictions of our Weighted Averaging approach, and the resulting uncertainty map. Circles indicate positions where a target object was falsely detected but due to the high uncertainty, no false found decision was raised. The uncertainty varies from blue-yellow corresponding to 0.0-1.0 normalized entropy.