Table of Contents
Fetching ...

Voxeland: Probabilistic Instance-Aware Semantic Mapping with Evidence-based Uncertainty Quantification

Jose-Luis Matez-Bandera, Pepe Ojeda, Javier Monroy, Javier Gonzalez-Jimenez, Jose-Raul Ruiz-Sarmiento

TL;DR

This work proposes Voxeland, a probabilistic framework for incrementally building instance-aware semantic maps that treats neural network predictions as subjective opinions regarding map instances at both geometric and semantic levels, and incorporates a Large Vision-Language Model to perform semantic level disambiguation for instances with high uncertainty.

Abstract

Robots in human-centered environments require accurate scene understanding to perform high-level tasks effectively. This understanding can be achieved through instance-aware semantic mapping, which involves reconstructing elements at the level of individual instances. Neural networks, the de facto solution for scene understanding, still face limitations such as overconfident incorrect predictions with out-of-distribution objects or generating inaccurate masks.Placing excessive reliance on these predictions makes the reconstruction susceptible to errors, reducing the robustness of the resulting maps and hampering robot operation. In this work, we propose Voxeland, a probabilistic framework for incrementally building instance-aware semantic maps. Inspired by the Theory of Evidence, Voxeland treats neural network predictions as subjective opinions regarding map instances at both geometric and semantic levels. These opinions are aggregated over time to form evidences, which are formalized through a probabilistic model. This enables us to quantify uncertainty in the reconstruction process, facilitating the identification of map areas requiring improvement (e.g. reobservation or reclassification). As one strategy to exploit this, we incorporate a Large Vision-Language Model (LVLM) to perform semantic level disambiguation for instances with high uncertainty. Results from the standard benchmarking on the publicly available SceneNN dataset demonstrate that Voxeland outperforms state-of-the-art methods, highlighting the benefits of incorporating and leveraging both instance- and semantic-level uncertainties to enhance reconstruction robustness. This is further validated through qualitative experiments conducted on the real-world ScanNet dataset.

Voxeland: Probabilistic Instance-Aware Semantic Mapping with Evidence-based Uncertainty Quantification

TL;DR

This work proposes Voxeland, a probabilistic framework for incrementally building instance-aware semantic maps that treats neural network predictions as subjective opinions regarding map instances at both geometric and semantic levels, and incorporates a Large Vision-Language Model to perform semantic level disambiguation for instances with high uncertainty.

Abstract

Robots in human-centered environments require accurate scene understanding to perform high-level tasks effectively. This understanding can be achieved through instance-aware semantic mapping, which involves reconstructing elements at the level of individual instances. Neural networks, the de facto solution for scene understanding, still face limitations such as overconfident incorrect predictions with out-of-distribution objects or generating inaccurate masks.Placing excessive reliance on these predictions makes the reconstruction susceptible to errors, reducing the robustness of the resulting maps and hampering robot operation. In this work, we propose Voxeland, a probabilistic framework for incrementally building instance-aware semantic maps. Inspired by the Theory of Evidence, Voxeland treats neural network predictions as subjective opinions regarding map instances at both geometric and semantic levels. These opinions are aggregated over time to form evidences, which are formalized through a probabilistic model. This enables us to quantify uncertainty in the reconstruction process, facilitating the identification of map areas requiring improvement (e.g. reobservation or reclassification). As one strategy to exploit this, we incorporate a Large Vision-Language Model (LVLM) to perform semantic level disambiguation for instances with high uncertainty. Results from the standard benchmarking on the publicly available SceneNN dataset demonstrate that Voxeland outperforms state-of-the-art methods, highlighting the benefits of incorporating and leveraging both instance- and semantic-level uncertainties to enhance reconstruction robustness. This is further validated through qualitative experiments conducted on the real-world ScanNet dataset.

Paper Structure

This paper contains 14 sections, 11 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Reconstruction at the instance and semantic levels of the sequence 096 from the SceneNN hua2016_scenenn dataset produced by the proposed method. While each different color in the instance map refers to a different instance, the colors in the semantic map represent the specific category indicated in the top-right legend. Additionally, both generated uncertainty maps are shown, being the blue areas those with minimum uncertainty, while red areas are those with the highest uncertainty.
  • Figure 2: Overview of Voxeland, a probabilistic framework for instance-aware semantic mapping. From a sequence of RGB-D images with known camera poses, our approach processes the RGB images with an instance semantic segmentation neural network and generates per-frame subjective opinions. The latter encompasses geometric and semantic levels, which are integrated with the existing map to accumulate evidence. Uncertainty maps at both levels are subsequently derived from their respective accumulated evidences. Please note that, in the uncertainty maps, blue indicates low entropy regions, while red represents high entropy ones.
  • Figure 3: Qualitative results of the instance map for sequence 011 of the SceneNN hua2016_scenenn dataset are shown using two state-of-the-art approaches (Voxblox++ grinvald2019_volumetric and Mascaro et al. mascaro2022_volumetric) and our proposed method. Our method demonstrates a reduction in the over-segmentation problem (highlighted by the blue circle) while improving object delineation without including spurious points (highlighted by the red circle). However, our method exhibits some incomplete areas (highlighted by the magenta circle) due to high uncertainty, specifically because the table was not detected by the neural network in several partial views. Additionally, the green circle in the semantic level uncertainty map indicates an out-of-distribution object (some posters on the wall). For clarification, in the uncertainty map, blue areas represent low entropy while red indicates high entropy regions.
  • Figure 4: Precision versus semantic level Shannon entropy curves for our proposed method with (blue solid line) and without (red dashed line) uncertainty map exploitation. Precision is calculated for each Shannon entropy value, considering only objects with an associated entropy lower than that threshold. The results indicate that instances with higher semantic level uncertainty often correspond to incorrect categorizations, suggesting that directly assigning the top-1 classification reduces overall performance.
  • Figure 5: Illustration of the under-segmentation problem in our proposed method. Our approach represents the entire set of books as a single instance (in green), whereas the ground-truth data identifies each group of books per shelf as separate instances.
  • ...and 2 more figures