Table of Contents
Fetching ...

LatentBKI: Open-Dictionary Continuous Mapping in Visual-Language Latent Spaces with Quantifiable Uncertainty

Joey Wilson, Ruihan Xu, Yile Sun, Parker Ewen, Minghan Zhu, Kira Barton, Maani Ghaffari

TL;DR

A novel probabilistic mapping algorithm, LatentBKI, which enables open-vocabulary mapping with quantifiable uncertainty, and is evaluated against similar explicit semantic mapping and VL mapping frameworks on the popular Matterport3D and Semantic KITTI data sets, demonstrating that LatentBKI maintains the probabilistic benefits of continuous mapping with the additional benefit of open-dictionary queries.

Abstract

This paper introduces a novel probabilistic mapping algorithm, LatentBKI, which enables open-vocabulary mapping with quantifiable uncertainty. Traditionally, semantic mapping algorithms focus on a fixed set of semantic categories which limits their applicability for complex robotic tasks. Vision-Language (VL) models have recently emerged as a technique to jointly model language and visual features in a latent space, enabling semantic recognition beyond a predefined, fixed set of semantic classes. LatentBKI recurrently incorporates neural embeddings from VL models into a voxel map with quantifiable uncertainty, leveraging the spatial correlations of nearby observations through Bayesian Kernel Inference (BKI). LatentBKI is evaluated against similar explicit semantic mapping and VL mapping frameworks on the popular Matterport3D and Semantic KITTI datasets, demonstrating that LatentBKI maintains the probabilistic benefits of continuous mapping with the additional benefit of open-dictionary queries. Real-world experiments demonstrate applicability to challenging indoor environments.

LatentBKI: Open-Dictionary Continuous Mapping in Visual-Language Latent Spaces with Quantifiable Uncertainty

TL;DR

A novel probabilistic mapping algorithm, LatentBKI, which enables open-vocabulary mapping with quantifiable uncertainty, and is evaluated against similar explicit semantic mapping and VL mapping frameworks on the popular Matterport3D and Semantic KITTI data sets, demonstrating that LatentBKI maintains the probabilistic benefits of continuous mapping with the additional benefit of open-dictionary queries.

Abstract

This paper introduces a novel probabilistic mapping algorithm, LatentBKI, which enables open-vocabulary mapping with quantifiable uncertainty. Traditionally, semantic mapping algorithms focus on a fixed set of semantic categories which limits their applicability for complex robotic tasks. Vision-Language (VL) models have recently emerged as a technique to jointly model language and visual features in a latent space, enabling semantic recognition beyond a predefined, fixed set of semantic classes. LatentBKI recurrently incorporates neural embeddings from VL models into a voxel map with quantifiable uncertainty, leveraging the spatial correlations of nearby observations through Bayesian Kernel Inference (BKI). LatentBKI is evaluated against similar explicit semantic mapping and VL mapping frameworks on the popular Matterport3D and Semantic KITTI datasets, demonstrating that LatentBKI maintains the probabilistic benefits of continuous mapping with the additional benefit of open-dictionary queries. Real-world experiments demonstrate applicability to challenging indoor environments.

Paper Structure

This paper contains 16 sections, 14 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: LatentBKI enables semantic mapping by leveraging open-dictionary language inference with vision-language (VL) model data. VL networks process exteroceptive data to generate point-wise features, which LatentBKI integrates into a 3D map via Bayesian Kernel Inference (BKI). Unlike prior VL mapping methods, LatentBKI updates nearby voxels using spatial information and maintains quantifiable uncertainty via conjugate priors. As shown, LatentBKI applies to real-world scenes (middle), quantifies semantic uncertainty per voxel (left), and decodes voxel features into categories using the VL network's language-driven decoder (right).
  • Figure 2: This figure demonstrates the overall pipeline of LatentBKI. (a) The input to LatentBKI is 3D points, which can be from LiDAR, RGB-D, or any exteroceptive sensor with 3D input. (b) Points are then processed by an off-the-shelf neural network, which encodes each point into a latent space. (c) By adopting a Gaussian likelihood over the point-wise features, we perform closed-form Bayesian inference on a voxel map where each voxel contains parameters modeling the conjugate prior of the multivariate Gaussian distribution. Additionally, instead of only considering points which fall within a voxel, we consider nearby points weighted through a kernel function. (d) The posterior predictive distribution of each voxel in latent space can then be decoded using the decoder of the neural network, enabling the computation of open-dictionary segmentation predictions with expectation and uncertainty.
  • Figure 3: Effect of spatial smoothing with varying levels of image sparsity. Spatial smoothing, indicated by the filter size $k$, is most effective for sparse images. The original image has a resolution of $720$ by $1080$ pixels.
  • Figure 4: Sparsification plot of segmentation performance compared to quantified uncertainty. As uncertain points are removed, a well calibrated uncertainty should cause the segmentation performance to increase.
  • Figure 5: Uncertainty maps for 5LpN3gDmAk7 MP3D sequence. (a) Covered house mesh in the sequence. (b) Categorical variance map by sampling from distribution. (c) Variance map by using E-optimality in latent space. (d) Variance map by using D-optimality in latent space.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2