Table of Contents
Fetching ...

OMCL: Open-vocabulary Monte Carlo Localization

Evgenii Kruzhkov, Raphael Memmesheimer, Sven Behnke

TL;DR

<p>OMCL tackles robust localization when maps are built from diverse sensors by grounding pose estimation in open-vocabulary visual–language features rather than fixed semantic categories. It builds an Octree Language Map that stores CLIP-like features and uses ray tracing to evaluate observation-map consistency within a Monte Carlo Localization framework, enabling cross-modal mapping and localization with open-set prompts for global initialization. The main contributions include two mapping options to construct the language map, a cosine-similarity–based measurement model with stratified ray sampling, and a prompt-augmented initialization method that accelerates global localization. Evaluation across Matterport3D, Replica, and SemanticKITTI demonstrates strong indoor/outdoor generalization, superior localization accuracy to semantic baselines, and tangible gains from prompt-guided initialization and sampling strategies.</p>

Abstract

Robust robot localization is an important prerequisite for navigation planning. If the environment map was created from different sensors, robot measurements must be robustly associated with map features. In this work, we extend Monte Carlo Localization using vision-language features. These open-vocabulary features enable to robustly compute the likelihood of visual observations, given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. The abstract vision-language features enable to associate observations and map elements from different modalities. Global localization can be initialized by natural language descriptions of the objects present in the vicinity of locations. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.

OMCL: Open-vocabulary Monte Carlo Localization

TL;DR

<p>OMCL tackles robust localization when maps are built from diverse sensors by grounding pose estimation in open-vocabulary visual–language features rather than fixed semantic categories. It builds an Octree Language Map that stores CLIP-like features and uses ray tracing to evaluate observation-map consistency within a Monte Carlo Localization framework, enabling cross-modal mapping and localization with open-set prompts for global initialization. The main contributions include two mapping options to construct the language map, a cosine-similarity–based measurement model with stratified ray sampling, and a prompt-augmented initialization method that accelerates global localization. Evaluation across Matterport3D, Replica, and SemanticKITTI demonstrates strong indoor/outdoor generalization, superior localization accuracy to semantic baselines, and tangible gains from prompt-guided initialization and sampling strategies.</p>

Abstract

Robust robot localization is an important prerequisite for navigation planning. If the environment map was created from different sensors, robot measurements must be robustly associated with map features. In this work, we extend Monte Carlo Localization using vision-language features. These open-vocabulary features enable to robustly compute the likelihood of visual observations, given a camera pose and a 3D map created from posed RGB-D images or aligned point clouds. The abstract vision-language features enable to associate observations and map elements from different modalities. Global localization can be initialized by natural language descriptions of the objects present in the vicinity of locations. We evaluate our approach using Matterport3D and Replica for indoor scenes and demonstrate generalization on SemanticKITTI for outdoor scenes.

Paper Structure

This paper contains 13 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: OMCL particles are sampled on a Language Map storing open-vocabulary features. Each particle represents a candidate camera pose. Particles are weighted according to how well the VLM-processed RGB input matches the ray-traced map value at the relevant location. The red particle denotes the estimated pose (weighted mean). The Language Map is colored by similarity to the prompted labels.
  • Figure 2: Mapping: We propose two options to create Octree Language Maps. Input Option 1: OMCL derives language features from RGB images and reconstructs a 3D map from them using the corresponding volumetric data (depth images, LiDAR measurements, etc.). Input Option 2: Language features are directly predicted on precomputed 3D point clouds for each point and subsequently converted into the octree representation. Localization: A particle filter uses an RGB image as the only input, weighting particles by the discrepancy between language features extracted from the input image and those ray-traced from the Octree Language Map. Our stratified ray sampling strategy compensates for the imbalance between different object instance sizes in the image. All features are colored for visualization purpose only.
  • Figure 3: Sampled pixels for images of resolution $540 \times 540$. Both images employ the same sampling masks. The left image corresponds to $2^8$ samples per cluster and the right one to $2^{11}$. Small clusters have a higher sampling density, but the total number of samples is less for them because the duplicates are discarded.
  • Figure 4: An example of initial locations for the global localization based on the user prompt. The red spots correspond to the prompt (toilet, mirror, towel, sink) and the green one to (table, chair, picture, door, tv monitor). OMCL particles can be initialized nearby the prompt matching spots instead of the random locations.
  • Figure 5: Example trajectories performed by OMCL on the Matterport3D dataset, with APE indicated by color for each segment. Map projections are shown in brown. The plots demonstrate the performance in scenarios that involve loopy paths, corridors, and long monotonic trajectories.
  • ...and 2 more figures